Corpus Building - Approaches to Hungarian Named Entity Recognition

In this section, we give a brief overview of the main issues concern-ing corpus buildconcern-ing. There are excellent textbooks on corpus lconcern-inguistics, e.g. McEnery and Wilson [2001]; L ¨udeling and Kyt ¨o [2008]; O’Keeffe and McCarthy [2010], to mention only a few, so we direct the reader to them for more complete discussions.

The very first step of corpus building is determiningcorpus design: in-tended uses of the corpus, the language variety to be covered, the do-main(s) to be represented, the required size, and the future access of the corpus. The last criterion is of key importance in text collection, if one intends to build a corpus that is freely accessible, at least for research pur-poses. In addition to the effort put in data processing, a considerable amount of time has to be devoted to acquiring texts and clearing copy-rights. The data collection process and negotiations on Intellectual Prop-erty Rights (IPR) matters may drag on for months. In corpus linguistics textbooks, issues of data collection and more specifically copyright clear-ance are hardly touched upon. However, there are a few current attempts at improving the state-of-affairs, e.g. Clercq and Perez [2010]; Xiao [2010].

A corpus is a well-organized collection of data, “collected within the boundaries of a sampling frame designed to allow the exploration of cer-tain linguistic feature (or set of features) via the data collected” [McEnery,

2004]. If the object of study is a highly restricted sublanguage or a dead language, identifying the texts to be included in the corpus is straightfor-ward. For example, when constructing the Old Hungarian corpus [Simon et al., 2011; Simon and Sass, 2012], we had to acquire all available sources from the Old Hungarian period (896–1526), creating a corpus of fixed size (approx. 2 million tokens). However, moving forward in the history of the Hungarian language, after the time of Gutenberg’s invention of the printing press, the amount of textual sources increases to the point where including all of them in a corpus seems impossible. In such cases, a sam-pling frame is of crucial importance. The corpus should aim for balance and representativeness within a specific sampling frame, in order to allow a particular variety of language to be studied.

The issue of representativeness is one of the most frequently discussed questions of corpus design (e.g. Biber [1993]). However, attempting to reach representativeness is like shooting at a moving target. Quoting Hun-ston [2008]: “representativeness is the relationship between the corpus and the body of language it is used to represent”. But what do we know about the body of language? Getting information about the language is the very reason why we build corpora. Maybe it is easier to give examples of unrepresentativeness. If one wants to build a representative general-aim corpus of present-day standard Hungarian, collecting only blogs about sports will not be enough. Or citing McEnery’s example [McEnery, 2004]:

imagine that a researcher decides to construct a corpus to assist in the task of developing a dialogue manager for a telephone ticket selling system.

This researcher will not sample the novels of Jane Austen or movie subti-tles to cover the language usage of phone dialogues. Thus, representative-ness is a goal we can aim for, without being convinced that we will reach it.

The first phase of corpus building work starts with the acquisition of source data. In the case of written texts, there are three methods to achieve this. In the most fortunate case, the corpus consists of texts which are available electronically, in some machine-readable format. If a source is only available in print, digitization is necessary in the form of mainly man-ual scanning followed by a conversion process from the scanned images into regular text files aided by optical character recognizer software. This step involves extensive manual proofreading and correction to ensure ini-tial resources of good quality as input to further computational process-ing. The third method is typing up text by hand, which is usually avoided unless the texts concerned are not available in any other way. This is the case, for example, with old manuscripts, handwritten letters, or codices [Oravecz et al., 2010].

McEnery and Wilson [2001] describe annotated corpora as being “en-hanced with various types of linguistic information”. Thedevelopment of annotation, which is more or less prototypical in modern language corpora, requires a number of standard NLP tasks: sentence segmentation and to-kenization, morphological analysis and morphosyntactic disambiguation.

These basic processing steps are usually carried out automatically, since tokenizers, sentence splitters and POS taggers are reliable enough for cer-tain languages such as English and Hungarian that a wholly automated annotation is feasible. Error rates associated with taggers are low, typically reported at around 3 percent. For example, the automatic morphosyntac-tic annotation in the Hungarian National Corpus reaches a general preci-sion of about 97.5%, i.e. 2.5% of all wordforms has an erroneous analysis [V´aradi, 2002]. Higher precision could only be achieved by manual anno-tation, which is usually not feasible for large amounts of data.

More typically, however, NLP tools are not sufficiently accurate so as to allow for fully automated annotation. In these cases, semi-manual or fully manual annotationis required. For building a highly accurately annotated corpus, first the annotation scheme has to be developed, then annotation guidelines have to be taught to the annotators. The more elaborated the guidelines, the cleaner and more useful the corpus as long as the guide-lines remain teachable, but when they become too complex, annotators begin to perform at an unacceptably high error rate. Guidelines have to define the annotation task, enumerate the types of language units to an-notate, and give examples of what to annotate and what not to annotate.

(For examples of annotation guidelines for the NER task, see Section 2.1.) As we proceed in enriching the corpus with linguistic annotation from the basic processing steps to more difficult semantic annotation levels, it will be clear that the more linguistic and semantic knowledge is required, the more liquid the annotation process gets. There are some linguistic phe-nomena which are hard to define, as can be seen in the case of NEs (see Chapter 2) and metonymies (see Chapter 3). If the guidelines are not accu-rate enough, such linguistic phenomena will be identified and categorized based on intuitions of the annotators. This strategy may be unproblematic for very clear-cut classes, but an exhaustive annotation will confront the researcher with many cases that are not clear-cut. In such cases, inter-annotator agreement is usually measured. The most simple measure is the joint probability of agreement:

2∗ |identically tagged entities|

|entities tagged by annotator A|+|entities tagged by annotator B|

This formula was used for calculating inter-annotator agreement in the case of finding and labelling metaphorical expressions in a corpus we built for a study of literal versus metaphorical language use [Babarczy et al., 2010b]. At the first attempt, inter-annotator agreement was only 17%. Af-ter refining the annotation instructions, we made a second attempt, which resulted in an agreement level of 48%, which is still a strikingly low value.

These results indicate that the definition of metaphoricity is problematic in itself, and that the refinement of annotation guidelines results in a more accurate annotated dataset. Contrary to finding metaphorical expressions, recognizing NEs in texts usually results in much higher inter-annotator agreement, mostly above 90%.

However, the joint probability of agreement does not take into account that agreement may happen solely based on chance. For this reason, other coefficients such as Cohen’sκand Krippendorff’sαare traditionally more often used in CL [Artstein and Poesio, 2008]. The strength of agreement is said to be perfect above 0.8κvalue, according to Landis and Koch [1977].

When building a gold standard corpus, researchers aim for as high inter-annotator agreement as possible, so samples whose annotation can-not be agreed on are often excluded from the corpus (e.g. Markert and Nis-sim [2007b]), resulting in clean and noiseless corpora. NLP has been pre-dominantly focused on relatively small and well-curated datasets; there are, however, new emerging attempts at processing data in non-standard sublanguages such as the language of tweets, blogs, social media, or his-torical data. In these NLP tasks, models based on cleaned data do not perform well, so researchers started using collaboratively constructed re-sources to substitute for or supplement conventional rere-sources such as lin-guistically annotated corpora.

4.2 Gold Standard Corpora for Named Entity

In document Approaches to Hungarian Named Entity Recognition (Pldal 55-58)