• Nem Talált Eredményt

New method for spam-filtering

N/A
N/A
Protected

Academic year: 2022

Ossza meg "New method for spam-filtering"

Copied!
1
0
0

Teljes szövegt

(1)

314 Magyar Számítógépes Nyelvészeti Konferencia

N e w m e th o d fo r sp a m -filte r in g

Sass B álint

HAS Research Institute for Linguistics, H-1068 Budapest, Benczúr u. 33.

j okerAnyt u d . h u

K eyw ords spam-filtering, document classification, Naïve Bayesian Clas­

sifier

Unsolicited emails (spams) are becom ing one of th e m ost im p o rta n t problem s o f th e internet. One m ain m ethod for spam is filtering, w hen incom ing m ails are divided into two parts: emails are m arked as spam o r as legitim ate on th e basis of th e content. T hus spam-filtering can b e considered as a docum ent classification problem . T h e so-called Naïve Bayesian Classifier is on of th e good docum ent classification methods: th e language m odel is b u ilt on th e basis of examples of each category (learning-corpus), a n d th e n using th is m odel it is determ ined which category th e given docum ent belongs to . T h e language m odel consists of

th e word frequency lists of each category. ·

NBC is th e basis of Paul Graham's spam -filtering m ethod, which was p u b ­ lished in 2002 [2]. It considers th a t spam -filtering is asym m etric: it is n o t a big trouble if we get one spam , b u t losing a legitim ate em ail can b e a misery.

T his m ethod has m any advantages: (1) very good filtering perfom ance, (2) filter-creation from spam an d legitim ate co rp o ra is auto m atic, (3) it can be retrained from tim e to tim e, th u s it can a d a p t itself, (4) giving learning-corpora, you can define w hat means spam for you.

I im plemented th is m ethod and te ste d it on m y incom ing m ails in th e last six months. Precision was 98.6% and recall was 94.1%.

I t is clear th a t in this case th e linguistic processing m eans only tokeniza- tion of emails and creation o f word-frequency lists. I t was trie d to lem m atise te x t or remove m ost frequent words, b u t it did n o t re su lt in su b stan tial im­

provement of performance [1]. I t seems, th a t in such realtively sim ple docum ent classification tasks little linguistic processing can b e enough. T h e algorithm is language-independent, therefore it can be used to filter emails w ritten in any language.

R eferences

1. Androutsopoulos, I. et al.: An Evaluation of Naïve Bayesian Anti-Spam Filtering.

In proceedings of the 11th European Conference on Machine Learning. Workshop on Machine Learning in the New Information Age. (2000) 9-17

http ://arxiv.org/PS_cache/cs/pdf/0006/0006013.pdf

2. Graham, P.: A Plan for Spam. (2002)

http : //wew. paulgraham. com/epam. html

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

MPTL '16 Workshop on Multimedia in Physics Teaching and Learning, Hsci 2011 Conference Hands on Science Organised by:.. Multimedia in Physics Teaching and Learning Group

The simultaneously collected meteorological and personal data make it possible to compare the actual thermal sensation, the weather perceptions and preferences with the

Text Mining-based Scientometric Analysis in Educational Research Gyula Nagy, University of Szeged, Hungary The European Conference on Education 2018 Official Conference

Tamburri — University of Molise, Italy; Eindhoven University of Technology, Netherlands. 1 The Role of Meta-Learners in the Adaptive Selection

In the present study we measure the fluorescence kinetics of NADH in an aqueous solution with high precision and apply a custom machine learning based analysis method to

machine that is the exact copy of the Mealy machine, except each transition in the new Finite State Machine (FSM) is the pair of input and output message types in the original

Support vector machine (SVM) is a new general machine.. learning method, which was proposed by Vapnik in the 1990s based on structural risk minimization principle of

In: Proceedings of the Fifth International FZK/TNO Conference on Contaminated Soil. Maastricht,