Attempts and examples for the discovery of hidden information of Concise explanatory dictionary of Hungarian (2nd edition, 2003)

(1)

14 Magyar Számitógépes Nyelvészeti Konferencia

Attempts and examples for the discovery o f hidden information o f Concise explanatory dictionary o f

Hungarian (2nd edition, 2003)

M ártonfi Attila

MTA-ELTB Research Group o f Academic Dictionary o f Hungarian rumc i o n y t u d . hu

Keywords: lexicography, knowledge discovery, etymological statistics, Concise explanatory dictionary o f Hungarian

Knowledge discovery and data m ining - as its p art - are trendy areas o f IT , their aim is utilizing characteristically commercial databases. However die goal (namely extracting as m uch hidden data and unknown patterns as possible by m achine) is essentially is the same as the m ost general goal o f scientific research, therefore a t least partially its approach and toolkit are applicable to lexicographical databases. (Since the size o f lexicographical databases is usually sm aller by orders o f m agnitude than monumental commercial databases occurring w ith the prim er area o f data m ining, the device requirem ent o f the operations is significantly less and the extractable inform ation is more restricted.)

The first notable lexicographical database o f Hungarian is Papp Ferenc’s Reverse- alphabetized dictionary o f the Hungarian language (VégSz.) and its derivative database on PC. The database which is the base o f Papp’s dictionary had four fields m ore than the paper-version: the length in characters, die num ber o f senses in ÉrtSz. (Explanatory dictionary o f the Hungarian language), the etym ology based on Etymological dictionary o f Hungarian, and the usage label given in the head o f entries in ÉrtSz. - because o f typographical reasons these are om itted from the paper-version and its derivative database.

The new edition o f Concise explanatory dictionary ofHungarian (ÉKsz.2) - a s an up- to-date lexicographical project should be - was first prepared as an XM L docum ent, and though its grammatical inform ation (constituting the skeleton o f VégSz.) is substantially m ore poor, with suitable conversions a m ore com plete and m ore m odem data tablet can be generated. It is more modem, because ÉKsz.2 provides up-to-date etym ological facts about the w idest group ofH ungarian words, and it is more com plete, since apart from the part-of-speech and w age labels and the num bers o f drawn senses all o f the entries in this dictionary have the absolute frequency based on Hungarian National Corpus, furtherm ore the word-length in the num ber o f phonemes or syllables can be coded.

W ith some simple queries the generated relational database gives token and type frequency indices o f various etymology, usage label, part-of-speech or num ber o f senses word-groups. Such token frequency indices - for w ant o f a satisfactory database or

(2)

Szeged, 2003 december 10-11 15

corpus background - formerly could not have been calculated; the type frequencies provide the possibility for com parision w ith Papp’s exam inations based on form er

sources.

W ith the toolkit o f data m ining more interesting analyses could be perform ed to discover hidden patterns o f the above param eters by m eans o f extracting association rules.