MTA SZTAKI DSD
Department of Distributed Systems
A new approach for searching translated plagiarism
Máté PATAKI
DSD
Department of
Distributed Systems
KOPI Plagiarism Search Portal
n KOPI Online Plagiarism Search and Information Portal
n MTA SZTAKI (Computer and Automation Research Institute, Hungarian Academy of Sciences)
n http://kopi.sztaki.hu/
DSD
Department of
Distributed Systems
Problem
1. Lot of students
2. Useful information on the web
3. Theses written digitally
4. Strong foreign language
skills
DSD
Department of
Distributed Systems
Problem
n Test cases for plagiarism detection software, Debora Weber-Wulff, HTW Berlin, 2010
n 48 different plagiarism checkers
n 42 different tests
n The biggest gap in all the plagiarism checkers was the inability to locate
translated plagiarism. While this is widely
expected as the technology to make such
detections simply is not there.
DSD
Department of
Distributed Systems
Problem
n CLEF 2010
n Potthast: Overview of the 2nd International Competition on Plagiarism Detection
n After analyzing all 17 reports, certain
algorithmic patterns became apparent to which many participants followed
independently. ... In order to simplify the
detection of cross-language plagiarism, non-
English documents in D are translated to
DSD
Department of
Distributed Systems
Other uses
n Building parallel corpora
n Searching for existing translations
n Analyzing the spread of news items
n Searching for citations
n Plagiarism detection
DSD
Department of
Distributed Systems
Why not Google translate?
n Use automatic translation engine
n Expensive or bad quality
n Quite poor for Hungarian
n loose word order
n conjugation
n significantly different grammar
n Cross lingual document retrieval
n Whole documents
DSD
Department of
Distributed Systems
The algorithm
n Sentence based
n word, n-word, limb, paragraph, document
n Similarity metric
n “Flat” dictionary, collection of words and
translations
DSD
Department of
Distributed Systems
The algorithm
n Bag of words
n Advantages
n no word disambiguation
n no problem with synonyms
n indifferent to word order
n Disadvantages
n large search space
n linear search time
DSD
Department of
Distributed Systems
The algorithm
n Important parameters
n Stopwords (by language pair)
n Proper names and unknown words
n Threshold (f+ / f-)
n Size of the dictionary
n Score for found/not found words (by word-
class)
DSD
Department of
Distributed Systems
Test environment
n English Wikipedia
n 31GB XML
n 3 800 000 articles
n SZTAKI Desktop GRID
n Available as a text:
http://kopiwiki.dsd.sztaki.hu/
n Google Translate
n For the research
DSD
Department of
Distributed Systems
Demo
n http://www.wikipedia.org
n http://translate.google.com
n http://kopi.sztaki.hu
DSD
Department of
Distributed Systems
Results
0,4 0,45 0,5 0,55 0,6 0,65
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
pr ob ab ili ty
placeing
Similarity
Query
DSD
Department of
Distributed Systems
Results
Probabilities to find at least x out of y sentences
DSD
Department of
Distributed Systems
Demo
DSD
Department of
Distributed Systems
Demo
1.
2.
3.
DSD
Department of
Distributed Systems
Demo
DSD
Department of
Distributed Systems
Results
DSD
Department of
Distributed Systems
Statistics
0 10 20 30 40 50 60 70 80 90 100
0 0,1 0,2 0,3 0,4 0,5 0,6
0 20 40 60 80 100 120
# w or ds
re ca ll
# max translations
recall
bag size
DSD
Department of
Distributed Systems
Statistics
Recall as a function of the number of translations per
0,42 0,44 0,46 0,48 0,5 0,52 0,54 0,56 0,58
2 3 4 5 6 7 8 9 10 15 20 30
re ca ll
# max translations
recall
DSD
Department of
Distributed Systems
Statistics
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
re ca ll
# words
Similarity metric
Full system
DSD
Department of
Distributed Systems
Actuality
n End of 2011 our multilingual plagiarism detection service started
n 2012:
n Added Hungarian-French within two days
DSD
Department of
Distributed Systems
KOPI Portal
http://kopi.sztaki.hu
DSD
Department of Distributed Systems