14 Magyar Számitógépes Nyelvészeti Konferencia
Attempts and examples for the discovery o f hidden information o f Concise explanatory dictionary o f
Hungarian (2nd edition, 2003)
M ártonfi Attila
MTA-ELTB Research Group o f Academic Dictionary o f Hungarian rumc i o n y t u d . hu
Keywords: lexicography, knowledge discovery, etymological statistics, Concise explanatory dictionary o f Hungarian
Knowledge discovery and data m ining - as its p art - are trendy areas o f IT , their aim is utilizing characteristically commercial databases. However die goal (namely extracting as m uch hidden data and unknown patterns as possible by m achine) is essentially is the same as the m ost general goal o f scientific research, therefore a t least partially its approach and toolkit are applicable to lexicographical databases. (Since the size o f lexicographical databases is usually sm aller by orders o f m agnitude than monumental commercial databases occurring w ith the prim er area o f data m ining, the device requirem ent o f the operations is significantly less and the extractable inform ation is more restricted.)
The first notable lexicographical database o f Hungarian is Papp Ferenc’s Reverse- alphabetized dictionary o f the Hungarian language (VégSz.) and its derivative database on PC. The database which is the base o f Papp’s dictionary had four fields m ore than the paper-version: the length in characters, die num ber o f senses in ÉrtSz. (Explanatory dictionary o f the Hungarian language), the etym ology based on Etymological dictionary o f Hungarian, and the usage label given in the head o f entries in ÉrtSz. - because o f typographical reasons these are om itted from the paper-version and its derivative database.
The new edition o f Concise explanatory dictionary ofHungarian (ÉKsz.2) - a s an up- to-date lexicographical project should be - was first prepared as an XM L docum ent, and though its grammatical inform ation (constituting the skeleton o f VégSz.) is substantially m ore poor, with suitable conversions a m ore com plete and m ore m odem data tablet can be generated. It is more modem, because ÉKsz.2 provides up-to-date etym ological facts about the w idest group ofH ungarian words, and it is more com plete, since apart from the part-of-speech and w age labels and the num bers o f drawn senses all o f the entries in this dictionary have the absolute frequency based on Hungarian National Corpus, furtherm ore the word-length in the num ber o f phonemes or syllables can be coded.
W ith some simple queries the generated relational database gives token and type frequency indices o f various etymology, usage label, part-of-speech or num ber o f senses word-groups. Such token frequency indices - for w ant o f a satisfactory database or
Szeged, 2003 december 10-11 15
corpus background - formerly could not have been calculated; the type frequencies provide the possibility for com parision w ith Papp’s exam inations based on form er
sources.
W ith the toolkit o f data m ining more interesting analyses could be perform ed to discover hidden patterns o f the above param eters by m eans o f extracting association rules.