• Nem Talált Eredményt

A new approach for searching translated plagiarism KOPI

N/A
N/A
Protected

Academic year: 2022

Ossza meg "A new approach for searching translated plagiarism KOPI"

Copied!
24
0
0

Teljes szövegt

(1)

MTA SZTAKI DSD

Department of Distributed Systems

A new approach for searching translated plagiarism

Máté PATAKI

(2)

DSD

Department of

Distributed Systems

KOPI Plagiarism Search Portal

n KOPI Online Plagiarism Search and Information Portal

n MTA SZTAKI (Computer and Automation Research Institute, Hungarian Academy of Sciences)

n http://kopi.sztaki.hu/

(3)

DSD

Department of

Distributed Systems

Problem

1. Lot of students

2. Useful information on the web

3. Theses written digitally

4. Strong foreign language

skills

(4)

DSD

Department of

Distributed Systems

Problem

n Test cases for plagiarism detection software, Debora Weber-Wulff, HTW Berlin, 2010

n 48 different plagiarism checkers

n 42 different tests

n The biggest gap in all the plagiarism checkers was the inability to locate

translated plagiarism. While this is widely

expected as the technology to make such

detections simply is not there.

(5)

DSD

Department of

Distributed Systems

Problem

n CLEF 2010

n Potthast: Overview of the 2nd International Competition on Plagiarism Detection

n After analyzing all 17 reports, certain

algorithmic patterns became apparent to which many participants followed

independently. ... In order to simplify the

detection of cross-language plagiarism, non-

English documents in D are translated to

(6)

DSD

Department of

Distributed Systems

Other uses

n Building parallel corpora

n Searching for existing translations

n Analyzing the spread of news items

n Searching for citations

n Plagiarism detection

(7)

DSD

Department of

Distributed Systems

Why not Google translate?

n Use automatic translation engine

n Expensive or bad quality

n Quite poor for Hungarian

n loose word order

n conjugation

n significantly different grammar

n Cross lingual document retrieval

n Whole documents

(8)

DSD

Department of

Distributed Systems

The algorithm

n Sentence based

n word, n-word, limb, paragraph, document

n Similarity metric

n “Flat” dictionary, collection of words and

translations

(9)

DSD

Department of

Distributed Systems

The algorithm

n Bag of words

n Advantages

n no word disambiguation

n no problem with synonyms

n indifferent to word order

n Disadvantages

n large search space

n linear search time

(10)

DSD

Department of

Distributed Systems

The algorithm

n Important parameters

n Stopwords (by language pair)

n Proper names and unknown words

n Threshold (f+ / f-)

n Size of the dictionary

n Score for found/not found words (by word-

class)

(11)

DSD

Department of

Distributed Systems

Test environment

n English Wikipedia

n 31GB XML

n 3 800 000 articles

n SZTAKI Desktop GRID

n Available as a text:

http://kopiwiki.dsd.sztaki.hu/

n Google Translate

n For the research

(12)

DSD

Department of

Distributed Systems

Demo

n http://www.wikipedia.org

n http://translate.google.com

n http://kopi.sztaki.hu

(13)

DSD

Department of

Distributed Systems

Results

0,4 0,45 0,5 0,55 0,6 0,65

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

pr ob ab ili ty

placeing

Similarity

Query

(14)

DSD

Department of

Distributed Systems

Results

Probabilities to find at least x out of y sentences

(15)

DSD

Department of

Distributed Systems

Demo

(16)

DSD

Department of

Distributed Systems

Demo

1.

2.

3.

(17)

DSD

Department of

Distributed Systems

Demo

(18)

DSD

Department of

Distributed Systems

Results

(19)

DSD

Department of

Distributed Systems

Statistics

0 10 20 30 40 50 60 70 80 90 100

0 0,1 0,2 0,3 0,4 0,5 0,6

0 20 40 60 80 100 120

# w or ds

re ca ll

# max translations

recall

bag size

(20)

DSD

Department of

Distributed Systems

Statistics

Recall as a function of the number of translations per

0,42 0,44 0,46 0,48 0,5 0,52 0,54 0,56 0,58

2 3 4 5 6 7 8 9 10 15 20 30

re ca ll

# max translations

recall

(21)

DSD

Department of

Distributed Systems

Statistics

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29

re ca ll

# words

Similarity metric

Full system

(22)

DSD

Department of

Distributed Systems

Actuality

n End of 2011 our multilingual plagiarism detection service started

n 2012:

n Added Hungarian-French within two days

(23)

DSD

Department of

Distributed Systems

KOPI Portal

http://kopi.sztaki.hu

(24)

DSD

Department of Distributed Systems

Web: http://dsd.sztaki.hu Email: Mate.Pataki@sztaki.hu

Recap

n Translated plagiarism detection

n Alternative method to using machine translation

n information retrieval + a new cross-language similarity metric

n Quick to add new language pairs

n Works for monolingual search as well

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

Considering the parameterized complexity of the local search approach for the MMC problem with parameter ` denoting the neighborhood size, Theorem 3 shows that no FPT local

The current study is intended to fill the gap in the research literature by examining how instructors working with students of English in Central and Eastern European

Hence the concern for Greek minority in Albania has turned into a foreign policy instrument that Greece uses to bring pressure on the Albanian government similar to the one that

A KOPI Online Plágiumkereső Portál egy egyedülálló, nyílt szolgáltatás az internetező közönség számára, amely lehetővé teszi, hogy a felhasználók saját

The overall processing speed of KOPI Engines is maximal in case the number of KOPI Engines using the cluster equals to the threads running in the query engine process (Fig..

Kivonat: Az MTA SZTAKI Elosztott rendszerek Osztálya által fejlesztett KOPI Online Plágiumkereső és Információs Portál egy egyedülálló, nyílt szolgáltatás az

Translations between some language pairs and within some domains can be so good that they can be mistaken for a human translation, but some – and sadly Hungarian- English pair is

This exploratory investigation involved 25 first-year undergraduate students of English at a large Hungarian university. In this introductory year of their study program they have