• Nem Talált Eredményt

version has been used for English NER [Simon and Nemeskey, 2012]

(cf. Subsection 4.3.4), for recognizing metonymic NEs in English [Farkas et al., 2007] (cf. Subsection 3.3.2), and for classification of semantic rela-tions between pairs of nominals [Hendrickx et al., 2010]. In addition, it has been also used for shallow syntactic analysis of Hungarian texts [Rec-ski and Varga, 2010], but we did not contribute to that work.

task train test F (%)

Hu NER Szeged wikilists Szeged 95.48

En NER CoNLL wikilists CoNLL 86.34

En metonymy resolution

loc-coarse SemEval-2007 SemEval-2007 85.20

org-coarse SemEval-2007 SemEval-2007 76.70

En semantic relations SemEval-2010 SemEval-2010 66.33 Hu chunking Szeged Treebank Szeged Treebank 89.87 Table 5.5: Best overall F-measures achieved by our system on several tasks.

Table 5.5 summarizes the best overall F-measures achieved by our sys-tem on several tasks. The results for NER and metonymy resolution are repeated from Chapters 4 and 3, respectively. The classification of seman-tic relations between pairs of nominals was a SemEval-2010 shared task, to which our system was submitted. The results published here (66.33%

F-measure on the test set provided by the organizers) was the 8th in the competition, so we decided not to write a system description paper. The task description and results of all submitted systems are reported in Hen-drickx et al. [2010]. The result is mentioned here only to illustrate the wide variety of NLP tasks for which our system can be used. And last, but not least: the reimplemented huntagsystem is applied for shallow parsing under the name hunchunk [Recski and Varga, 2010]. For Hungarian, it was trained on the Szeged Treebank [Csendes et al., 2005].

handling a large number of rules is quite difficult, requiring a great deal of domain-specific knowledge engineering. In addition, these systems are brittle and not portable between different domains or tasks. Empirical methods, on the other hand, offer potential solutions to several problems in NLP, e.g. knowledge acquisition by means of automatic learning tech-niques, coverage by means of large amounts of data, robustness by means of frequency-based algorithms, and extensibility by means of portable sys-tems. Although these problems are still not properly solved, using empir-ical methods results in higher performance.

The current dominant technique used in the field of NER is supervised learning. Its disadvantage is that it requires large amounts of previously annotated data, so one might say that the human labour of creating rules has only been shifted to that of building corpora. However, there is a quite new, emerging field of NLP and of NER in particular, which involves us-ing unsupervised or semi-supervised techniques. In these cases, human labour has also been shifted to constructing seed examples and/or em-bedding heuristics in the system. Unsupervised learning is a field where significant improvements can be made in the future. Another future direc-tion is hybridizadirec-tion, combining the strengths of radirec-tionalist and empiricist methodologies.

Chapter 6

Feature Engineering

Features are descriptors or characteristic attributes of datapoints in text.

In supervised learning, feature vectors are assigned to datapoints, each of which contains one or more Boolean- or string-valued features. Feature vector representation is a kind of abstraction over text. The task of the algorithm is then to find regularities in this large amount of information that are relevant to the classification task; in this case, to NER.

NER is a typical sequence labelling task, i.e. the model has to assign NE labels to sequences, e.g. sentences in text. Several machine learning algorithms, however, such as maximum entropy modelling, realize it as a token-based classification task. Contextual information is not lost, since the features of neighbouring tokens can be added to the inspected token’s feature vector.

In this chapter, we present the features most often used for NER in the token-based classification scenario. We categorize features along the di-mension of what kind of properties they provide: surface properties, digit patterns, morphological or syntactic information, or gazetteer list mem-bership. We also study the effect of gazetteer list size on the performance of NER systems.

Defining features for a supervised system is manual work, similar to coding patterns for a rule-based system. However, in the case of statis-tical methodology, it is the data and not the linguist that determines the usefulness of a feature. The human cognition tends to realize only salient phenomena and will regard as important properties some that are then shown to be unimportant by corpus data and conversely, will fail to notice important ones. For this reason, the power of features has to be measured on real data before inclusion into the system. This is called feature engi-neering.

6.1 Methods

To measure the strength of features, we build NER systems for Hungarian and English in parallel, adding new features one by one. If the inspected feature is useful, we retain it and add the next one to the system. We consider a feature useful, if adding it does not decrease the performance.

The rationale behind this decision is that if a feature does not make things worse on the development set, it may be effective on the test set. Figures indicating F-measures achieved by the system after adding a useful feature are italicized in tables.

There are several feature selection methods used in machine learning which select of a subset of relevant features for inclusion in the model.

The simplest algorithm is to test each possible subset of features and find-ing the one which maximizes the F-measure. An exhaustive search of the feature space, however, is generally impractical. Since we work with a large number of features and can rely on results from previous experi-ments, we decided to choose an incremental method of feature selection.

In this scenario, we start with an arbitrary feature subset, then attempt to find a better solution by incremental extension of the subset with new features. After each step, the system’s performance is evaluated and com-pared to the previous results. The feature subset achieving the best overall F-measure is then retained.

Selection of the initial feature of each feature category is based on pre-vious experiments: the strongest feature is added first. In this feature se-lection scenario, however, there are some features which would not be removed if they were added to the system in a different order. Since these features cause insignificant changes in performance, we are probably not off the track when removing them.

As described in Subsection 5.3.2, the hunner system has built-in pa-rameters for setting the number of iterations, the Gaussian penalty, the weight of the language model, and the number of features used for model building (cutoff). We set these parameters so as to keep the system as neutral as possible. Thus, we do not apply Gaussian penalty and set the cutoff to 1, which means that all features are taken into account in model building. The language model weight is set to 1, thus the emission prob-abilities of the maximum entropy model and the transition probprob-abilities of the HMM have equal weights. All experiments are run with 110 itera-tions, which has proven to be sufficient even with such a large number of features.

We use the reimplemented hunner system described in Subsection 5.3.2. Since it is language-independent, it can be used for Hungarian and

English in parallel. For Hungarian, we use the same train–development–

test split of the Szeged NER corpus that was used for the evaluation of the originalhunnersystem. For experiments on English, we use the stan-dard train–development–test sets of the CoNLL-2003 corpus. Unless men-tioned otherwise, the results should be interpreted as standard F-measures (%) achieved on these development datasets. Finally, we evaluate the best feature combination on the corresponding test sets and measure the strength of each feature subset.