• Nem Talált Eredményt

A Rule-based System for Recognizing Named Enti-

5.2 Rule-based Systems

5.2.1 A Rule-based System for Recognizing Named Enti-

The linguistic workflow of the MNL project aimed at extracting the impor-tant pieces of information from the text and structure of the entire encyclo-pedia, and assign them to ontology classes, thus providing a knowledge representation which could serve as the basis of several web applications.

Since important pieces of information are mostly NEs, the main subtask of the project was NER. We extracted person, location, and organization names, titles of artworks, and temporal expressions1.

For implementing the steps of the linguistic workflow, we used GATE2, General Architecture for Text Engineering [Cunningham et al., 2011], which is an open source software capable of solving text processing prob-lems. GATE provides built-in tools, such as tokenizers, sentence splitters,

1The project ran in 2003–2004, within the boundaries of the Magyar Nagylexikon Kiad ´o Zrt., the company responsible for editing and publishing the Hungarian ency-clopedia, Magyar Nagylexikon. Two consequences follow: first, language processing resources were used as they were available then. Second, results remained unpublished because they were treated confidentially.

2We used version 2.1, the most current version in 2003.

and other higher level processing resources, primarily for English. To adapt it to Hungarian, we had to create several language and processing resources for Hungarian. GATE has another built-in functionality, which is language-independent and very useful for rule-based NER: JAPE is a Java Annotation Patterns Engine, which provides finite state transduction over annotations based on regular expressions.

To achieve our goal of annotating NEs in text, several pre-processing steps had to be taken. The complete workflow of NER is as follows:

Tokenization. Tokenization of the entire text of the encyclopedia was per-formed by GATE’s tokenizer module.

Word lemmatization. Language units annotated as words, thus non-NEs, were lemmatized, and full morphological annotation was assigned to them. This was performed by Jspell, a Java reimplementation ofhunspell[N´emeth et al., 2004].

NE lemmatization. Since list lookup works on literally equal matches, forms of NEs occurring in the text had to be stripped of their suffixes.

However,hunspellworks with a limited lexicon and contains only frequent names and their lemmatization rules. In addition, lemma-tization rules for NEs are different from those for common nouns (cf. Subsection 2.3.2). We therefore adopted a new strategy. The lexicon file of hunspell was replaced with name lists generated from the inherent XML tagset of the encyclopedia3. Affixing rules which can operate on NEs were selected from the original affix file ofhunspell, separately for each name type. In Hungarian, suffixa-tion of foreign words and names works according to the constraints of vowel harmony, so allomorphs are chosen based on the phonolog-ical form of the name in question. For this reason, a new component was added to the NE lemmatizer, which contains phonological and transcription rules for 20 languages.

Gazetteer list compilation. Even though ‘gazetteer’ originally means ge-ographical directory, in the context of NER the phrase is simply used to indicate a list of names. Gazetteers for each type of NE were gen-erated from the inherent XML tagset of the encyclopedia. Since the text had been tagged manually by human editors, and the original goal of tagging was only to prepare articles for printing, the tags had

3The encyclopedia was written and edited in an XML-based editorial system, which contained tags indicating several types of content, e.g. regnal name, place of birth and death, structured according to the rules of a pre-defined DTD.

many errors. For this reason, we checked and cleaned the lists man-ually. In addition, NEs had to be lemmatized to get exact matches.

This was performed by the NE lemmatizer, and suffixed NE forms were changed to their lemmatized form. The gazetteer lists also con-tained abbreviated forms of NEs.

Transducing. First, GATE’s gazetteer module annotates elements of NEs:

if a language unit occurring in the text matches exactly one unit in any of the gazetteer lists, it gets annotated with the appropriate tag, e.g. person first name, person full name, month, or day. Second, after investigating the text of encyclopedia, we manually created regular expression patterns operating on these annotations. Finally, we used JAPE to build finite state transducers from the patterns and annotate NEs in the text.

Sentence splitting. Before splitting the text into sentences, abbreviations covered by gazetteer items were omitted, so as not to be considered sentence ending elements.

This order of steps differs slightly from that currently applied by statis-tical NER systems, for which sentence splitting is usually the second pre-processing step after tokenization, since they operate on sentences and need sentence boundaries to recognize NEs. In contrast, our rule-based system operates on patterns of text, so sentence boundaries are not as im-portant as in the case of supervised systems. Another difference is that we lemmatize NEs and non-NEs separately. For this purpose, the system has to distinguish NEs from non-NEs by the time of lemmatization, which requires a kind of combination of POS tagging and NER. As far as we know, there has only been one attempt to resolve POS tagging and NER in a parallel way. M ´ora and Vincze [2012] emphasize that by exploiting the differences in affixation of proper names and common nouns in Hungar-ian, joining the two steps may accelerate the identification of NEs.

The performance of our rule-based system was not measured in any of the standard ways, for several reasons. The first Hungarian NE tagged gold standard corpus, the Szeged NER corpus [Szarvas et al., 2006a], which could have been used to evaluate our system, was not created un-til a few years later, and at any rate, no full conclusions could have been drawn from it, since encyclopedic text is very different from newswire.

For financial reasons, the project was cut before achieving success with the linguistic workflow and obtaining full results, but manual checking of the output showed that our system was able to identify and classify a good proportion of NEs in the text of the Hungarian encyclopedia.