• Nem Talált Eredményt

3.3 Structuring and categorizing lines

4.2.3 Errors corrected by one of the systems

As opposed to common, full word forms, in the case of shorter terms or abbreviations and domain-specific words, the behaviour of the two systems were different, especially in the task of error detection. The word-based system tended to change these words to some other forms incorrectly, while the SMT system either left them in their original form if they had been correct already, or corrected them to the proper form. Some examples are listed in Table4.9.

Since abbreviations and shortened forms can be disambiguated only in their context (Sikl´osi and Nov´ak, 2013), their correction also requires contextual information ensured by the domain-specific language model. Similarly, the proper correction of special medical terms and Latin expressions is impossible without contextual information. The word-based system was either not able to suggest a correction ranked higher than the original form, or changed such words to some common terms that gained a higher score due to their high frequency in texts from other domains. In some cases even originally correct words were overwritten. For example, the last line in Table 4.9shows the results for the wordvannas. This is a special type of scissors used in surgery, called “vannas scissors”. The word-based system altered this word tovannak ’are’, which is a very common Hungarian word.

4.2. Results 37

Original form Word-based SMT Gold st. English translation szemh´ejsz´elt szemh´ejsz´eli szemh´ejsz´elt szemh´ejsz´elt ’side of eyelid’

tu t´o tu. tu. short form of ’tumor’

inf in inf. inf. short form of ’inferioris’

elasticum elasticus elasticum elasticum ’elasticum’

ell el ell. ell. short form of ’check’

cover over cover cover ’cover’ (a medical test)

skia skin skia skia ’skia’ (a medical test)

deg meg deg. deg. short form of ’degenerate’

jav ja jav. jav. short form of ’correct’

dec de dec. dec. short form of ’December’

tonopen tonogen Tonopen Tonopen ’Tonopen’ (a medication)

ill ´all ill. ill. short form of ’or’

amb ab amb. amb. short form of ’ambulatory’

vannas vannak vannas vannas ’vannas’ (a medical tool)

Table 4.9: Some examples for words corrected properly or untouched by the SMT system, but altered incorrectly by the word-based system

5

Identification and resolution of abbreviations

In which it is shown how much extra letters we use in our every-day language. Doctors need less. Still, in order to be able to understand their message, some character refilling methods are described with special emphasis on the opththalmology domain. To make it more exciting, try to figure out what ‘Cat. incip. o. utr.’ stands for. By the end of this Chapter, this will be revealed.

Contents

5.1 Clinical abbreviations . . . . 40 5.1.1 Series of abbreviations . . . . 40 5.1.2 The lexical context of abbreviation sequences . . . . 42 5.2 Resources . . . . 42 5.2.1 External lexicon . . . . 42 5.2.2 Handmade lexicon . . . . 42 5.3 Methods . . . . 43 5.3.1 Detection of abbreviations. . . . 43 5.3.2 Resolving abbreviations based on external resources . . . . 44 5.3.3 Unsupervised, corpus-induced resolution . . . . 45 5.4 Results and experiments . . . . 46 5.4.1 Fine-tuning the parameters . . . . 46 5.5 Performance on resolving abbreviations . . . . 47

40 5. Identification and resolution of abbreviations

The task of abbreviation resolution is often treated as word sense disambiguation (WSD) (Navigli,2012). The best-performing approaches of WSD use supervised machine learning techniques. In the case of less-resourced languages, however, neither manually annotated data, nor an inventory of possible senses of abbreviations are available, which are prerequisites of supervised algorithms (Nasiruddin,2013). On the other hand, unsupervised WSD methods are composed of two phases: word sense induction (WSI) must precede the disambiguation process. Possible senses for words or abbreviations can be induced from a corpus based on contextual features. However, such methods require large corpora to work properly, especially if the ratio of ambiguous terms and abbreviations is as high as in the case of clinical texts.

Due to confidentiality issues and quality problems, this approach is not promising either.

In this Chapter, I introduce the behaviour of abbreviations in Hungarian clinical documents.

Then, a corpus-based approach is described for the resolution of abbreviations with using the very few lexical resources available in Hungarian. As this method did not provide acceptable results, the construction of a domain-specific lexicon was unavoidable. Instead of trying to create huge resources covering the whole field of medical expressions, it is shown that small domain-specific lexicons are satisfactory and the abbreviations to be included can be derived from the corpus itself. Finally, an analysis of the combination of these methods is presented.

5.1 Clinical abbreviations

The use of a kind of notational text is very common in clinical documents. This dense form of documentation contains a high ratio of standard or arbitrary abbreviations and symbols, some of which may be specific to a special domain or even to a doctor or administrator.

These shortened forms might refer to clinically relevant concepts or to some common phrases that are very frequent in the specific domain. For the clinicians, the meaning of most of these common phrases is as trivial as the standard shortened forms of clinical concepts due to their expertise and familiarity with the context. Some examples for abbreviations falling into these categories are shown in Table 5.1.