314 Magyar Számítógépes Nyelvészeti Konferencia
N e w m e th o d fo r sp a m -filte r in g
Sass B álint
HAS Research Institute for Linguistics, H-1068 Budapest, Benczúr u. 33.
j okerAnyt u d . h u
K eyw ords spam-filtering, document classification, Naïve Bayesian Clas
sifier
Unsolicited emails (spams) are becom ing one of th e m ost im p o rta n t problem s o f th e internet. One m ain m ethod for spam is filtering, w hen incom ing m ails are divided into two parts: emails are m arked as spam o r as legitim ate on th e basis of th e content. T hus spam-filtering can b e considered as a docum ent classification problem . T h e so-called Naïve Bayesian Classifier is on of th e good docum ent classification methods: th e language m odel is b u ilt on th e basis of examples of each category (learning-corpus), a n d th e n using th is m odel it is determ ined which category th e given docum ent belongs to . T h e language m odel consists of
th e word frequency lists of each category. ·
NBC is th e basis of Paul Graham's spam -filtering m ethod, which was p u b lished in 2002 [2]. It considers th a t spam -filtering is asym m etric: it is n o t a big trouble if we get one spam , b u t losing a legitim ate em ail can b e a misery.
T his m ethod has m any advantages: (1) very good filtering perfom ance, (2) filter-creation from spam an d legitim ate co rp o ra is auto m atic, (3) it can be retrained from tim e to tim e, th u s it can a d a p t itself, (4) giving learning-corpora, you can define w hat means spam for you.
I im plemented th is m ethod and te ste d it on m y incom ing m ails in th e last six months. Precision was 98.6% and recall was 94.1%.
I t is clear th a t in this case th e linguistic processing m eans only tokeniza- tion of emails and creation o f word-frequency lists. I t was trie d to lem m atise te x t or remove m ost frequent words, b u t it did n o t re su lt in su b stan tial im
provement of performance [1]. I t seems, th a t in such realtively sim ple docum ent classification tasks little linguistic processing can b e enough. T h e algorithm is language-independent, therefore it can be used to filter emails w ritten in any language.
R eferences
1. Androutsopoulos, I. et al.: An Evaluation of Naïve Bayesian Anti-Spam Filtering.
In proceedings of the 11th European Conference on Machine Learning. Workshop on Machine Learning in the New Information Age. (2000) 9-17
http ://arxiv.org/PS_cache/cs/pdf/0006/0006013.pdf
2. Graham, P.: A Plan for Spam. (2002)
http : //wew. paulgraham. com/epam. html