• Nem Talált Eredményt

The Effects of Gazetteer List Size

6.5 List Lookup Features

6.5.1 The Effects of Gazetteer List Size

One might think that NER can be performed by using lists of person, place and organization names alone, but this is not the case. It is not feasible to list all names, since new companies are formed all the time, and new persons are born, receiving new names. In addition, names can occur in variations: Frederick Flintstone can be mentioned asFrederick,Fred,Freddy, all of which can also be combined with the last name. These variations would have to be listed as well.

Even if it was possible to list all names, there would still be the problem of overlaps between lists, which is caused by the fact that a wide range of names refer to more than one object in the world (cf. Subsection 2.3.1).

Moreover, complex NEs can include common nouns or function words.

For example,People’s Dailycontains a common noun, a possessive marking and an adverb. If this name is included in a list of organization names, a feature like isPartOfOrg=1 can be assigned to every mention of its individual words in a text.

In 1998, Cucchiarelli et al. [1998] reported that one of the bottlenecks in designing NER systems is the limited availability of large gazetteers,

particularly for languages other than English. Their system relies on gazetteers of common proper nouns and a set of heuristic rules, similar to our rule-based system described in Subsection 5.2.1. As explained in Sub-section 5.2.4, rule-based systems highly depend on the size of gazetteer lists.

However, the situation has changed since the 1990s. Currently, most NER systems use some kind of machine learning algorithm based on prob-abilities calculated from training data. The fact that we built NER systems for Hungarian and English with F-measures of 97.14% and 88.02% respec-tively, using only surface features, digit patterns, and morphological in-formation, and without the help of external lists, proves that statistical systems do not rely on gazetteers as much as rule-based systems do. In addition, large amounts of NEs are currently available via the web. On-line databases and collaboratively generated resources such as Wikipedia, DBpedia and Freebase open the door to the extraction of large lists of sev-eral types of names.

When building an NLP system, finding the balance between precision and recall is one of the most essential requirements. Some applications concentrate on precision, while in others, once a minimum of precision is assured, improvements are dominated by recall issues. Recall is impacted most heavily by OOV effects, and OOV effects themselves are an almost direct function of the lists used in the system. Thus, the best way to im-prove the performance of a system is to expand the lists, so as to address the leading cause of recall errors. The impact of OOV words on recall can to some extent be mitigated by synonym-based techniques.

To illustrate the effects of standard mitigation techniques on precision and recall, we now make a slight detour and briefly describe a method we used for finding metaphorical expressions by means of different kinds of lists [Babarczy et al., 2010a,b; Babarczy and Simon, 2012]. For the automatic identification of metaphors, we searched the corpus for sen-tences containing one or more words characterising the source domain and one or more words representing the target domain of a given con-ceptual metaphor. Three different methods of compiling the word lists were tested: a) word association experiment, b) dictionary of synonyms, and c) reference corpus. The first method is based on the assumption that the expressions people associate with a key word for the source domain and a key word for the target domain can provide a lexical profile for a given metaphor type. Word associations were collected in an online ex-periment. For the second method, the word lists obtained from the ation experiment were expanded with the synonyms listed for the associ-ation words in a Hungarian word thesaurus. Compared to the associassoci-ation

list, the size of the word lists substantially increased (see Table 6.7). For the third method, word lists for each source and target domain were ex-tracted from a manually annotated corpus. Based on the three sets of word lists, the test corpus was automatically annotated producing three files in which the sentences were marked by tags showing the type of conceptual metaphor the system identified. Each of the three annotations was then verified manually.

words↓/ method→ association synonyms corpus-based

source domain 1239 6348 126

target domain 674 5094 120

Table 6.7: Number of words in lists compiled by the three methods.

Table 6.8 shows the results of the three methods. The most important findings are that when the association word lists were expanded with syn-onyms, recall slightly improved, but only at the cost of a decline in preci-sion. The corpus-based method, where the appropriate candidates were accurately extracted by hand, was clearly the most successful of the three strategies. (The values are very low, which indicates that our initial hy-pothesis – that the co-occurrence of psycholinguistically typical source domain and target domain words in a sentence is a good predictor of metaphoricity – receives no empirical support.)

method recall (%) precision (%) F-measure (%)

association 3.8 7.5 5.6

synonyms 18.1 4.5 11.3

corpus-based 31.3 55.4 43.3

Table 6.8: Results of the three methods.

Turning back to the NER task: for proper names, similar synonym-based mitigation techniques break down. Based on the results presented above, we can hypothesize that expanding the gazetteer lists will result in higher recall at the cost of a decline in precision, and that shorter, but accurately selected lists will improve both precision and recall. Related works on the effects of gazetteer list size in NER also confirm our hy-pothesis. Morgan et al. [1995], participants of the MUC-6 competition, report that gazetteers provided by the organizers were not used in their

system due to their limited effect on performance. Krupka and Haus-man [1998] present lexicon size experiments, where the basic gazetteers contain 110,000 names, which were reduced to 25,000 and 9,000 names, while system performance did not degrade much (from 91.60% to 91.45%

and 89.13%, respectively). Moreover, they also show that the addition of an extra 42 entries to the gazetteers improves performance dramatically on MUC-7 datasets. Based on these previous experiments, Mikheev et al.

[1999] ask the questions: how important are gazetteers? is their size im-portant? if gazetteers are important but their size is not, then what are the criteria for building gazetteers? They hypothesize and confirm empir-ically that using relatively small gazetteers of well-known names, rather than large gazetteers of low-frequency names, is sufficient. Indeed, while best results do come from the run with full gazetteers, the run with limited gazetteers decreases precision and recall only with 1–4 percents.