Files
Abstract
Email has become a crucial part of life as the Internet has developed. However, a massiveinflux of spam emails has threatened the usefulness of email communication. Many techniqueshave been developed, such as machine learning, authentication, collaboration, etc. However,little has been done from a systems perspective to provide an effective, robust and efficientanti-spam solution. The arms race between spammers and anti-spam researchers has broughtnew challenges to the design of modern anti-spam systems.This dissertation focuses on the systems aspect of the challenges that the anti-spamresearchers face in designing various anti-spam approaches. the system aspects. In particular,we attempt to provide solutions to the challenges in the collaborative approach, stand-aloneapproach and sender-based approach. These challenges are 1) preserving privacy of emailcontent in collaboration, 2) achieving both high accuracy and high processing speed, and 3)selectively punishing email senders without exact knowledge of whether the email sender isa spammer or a normal user.We design a novel technique for message transformation to preserve the privacy ofemail content and derive resemblance information for collaborative email classification. Wealso carefully design a communication protocol to ensure email privacy during informationexchange among the collaborative entities. The experimental results demonstrate a comparableaccuracy and greater robustness compared to Bayesian and Distributed ChecksumClearinghouse approaches. This dissertation proposes a new metric for privacy evaluationand demonstrates a system with excellent privacy preservation.This dissertation continues to explore the tradeoff between spam filtering accuracy andspeed by using approximate classification. It demonstrates about one order of magnitude ofspeed improvement over two well-known spam filters, while achieving identical false positiverates and similar false negative rates.For cost-based approaches, we propose to push the spam filter to the early stage of theSMTP conversation, and determine the cost based on the email quality and spam behavior.The experimental results show that under state-of-the-art hardware, the proposed techniquecan effectively limit the ability of the spammer effectively and significantly even if he possessesmore CPU resources than the normal sender.