• Nem Talált Eredményt

3.3 Structuring and categorizing lines

3.3.3 Categorizing statements

Even though the PARTtags have labelled each part according to the documentation template of the system, the title of these fields is rarely in accordance with the content. For example, the status field is frequently used to include all the information, be it originally anamnesis, treatment, therapy, or any other comments. Thus, it was necessary to categorize each statement in each part of the documents. The units of categorization were the concatenated lines (see Chapter3.2). Moreover, lines containing tabular data were also recognized during this processing step based on the indentation at the beginning of a line and the amount and appearance of whitespace within a line.

The set of these categories, or intentional subheadings, was defined with the help of an ophthalmologist. The categories, their definitions and an example sentence is shown in Table 3.1. These parts are, however, not always present in all the records and there is no compulsory order of these types of statements. However, by tradition, anamnesis is usually in the beginning, while the diagnoses and opinions are at the end. Almost every document includes visual acuity measurements (sometimes nothing else). The original documentation system also provides the possibility to type these data into different fields, but the granularity of these templates is much less sophisticated, and doctors do not tend to use them.

First, using the preprocessed version of the texts, some patterns were identified based on part-of-speech tags and the semantic concept categories assigned to the most frequent entities.

For example, due to the rare use of verbs, if a past tense verb was recognized in a sentence, it was a good indicator of being part of the anamnesis or the complaints of the patient. See Chapter 6.4

Second, some indicator words were extracted from the documents. At the first place, these were those line initial words and short phrases that started with capital letter and were followed by a colon and some more content. These phrases were then ordered by their occurrence frequencies. Then, they were manually assigned a category label referring to the type of the statement that the phrase could be an indicator of. For example the phrasekor´abbi betegs´egek ‘previous illnesses’ was given the labelAnareferring to anamnesis. Table3.2shows some more examples of tags and phrases labelled by them. After having all the phrases occurring at least 10 times in the whole corpus labelled, they were matched against the lines of each document that were found in PART sections and were not recognized as tabular data. If the line started with a phrase or any of its variations (case variations, misspellings, punctuation marks and white spaces were allowed differences), then the line was labelled

18 3. Accessing the content

tag meaning definition and example

Tens Tension Eye pressure expressed in millimeters of mercury, representing the pressure inside the eye.

e.g.: T:18 Hgmm

V/Refr Refraction Refraction/Visus/visual acuity refers to the clarity of vision. The examined correction is measured in spherecal and cylindric dioptre.

e.g.: Refr: +1.25Dsph -2.5Dcyl 5’

Ana Anamnesis Anamnesis includes patient history, family history and the description of the problem and the cause by the patient. May also include allergy information.

e.g.: Hamar elf´arad a szeme,vibr´al´o f´enyeket kb.f´el ´eve l´at, nem tudja, melyik szem´eben. // His eyes quickly get tired, has seen flashing lights for about half a year, doesn’t know in which eye.

Dg Diagnosis The diagnoses found during the examination (or during previous examinations).

e.g.: Dg: Cat. incip. ou., Dystrophia con. lu., Keratitis ou., att´er retinophatia ´es retin´alis elv´altoz´asok. // Dg: Cat. incip.

ou., Dystrophia con. lu., Keratitis ou., Background retinophatia and retinal anomalies.

Beav Treatment The applied treatment during, before or after the actual examination.

e.g.: 2010.05.27 08:25 - (2SOCT) OCT + FLAG

el Opinion The opinion or the suggestions of the doctor. It is not an official diagnosis, but might contain the description of the diagnosis.

e.g.: el: jelenleg szem´eszeti teend¨o nincs // op: no further action is needed

St Status The actual state of the patient. Usually includes the results of the performed examination but without the diagnoses.

e.g.: Jelen st´atusz: // Present state:

Ther Therapy Applied or prescribed therapy e.g.: 2x Humapent

BNO BNO (ICD) The BNO code of a disease or treatment.

e.g.: BNO: H00100 CHALAZION

T Test Performed tests, except refraction measurements. Most commonly slit lamp or ultrasound tests are applied.

e.g.: at´ot´er od besz¨uk¨ult, de kor´abbin´al jobb // Field of vision is narrower, but better than before.

V Visus Refraction/Visus/visual acuity refers to the clarity of vision. The examined correction is measured in spherecal and cylindric dioptre.

e.g.: V: 0.8 +1.0 Dsph -2.5Dcyl 30’ =1.0

Rl Slit lamp Slit lamp is used to examine the inner parts of the eye. The state of the different parts are described as seen by the doctor using the slit lamp.

e.g.: Fundus: ´eles sz´el¨u, j´o sz´ın¨u papilla n´ıv´oban, maculat´aj f´enytelen, sclerotikus erek, perif. ´ep. // Fundus: sharp edges, good colored standard papil, the area of the macula is dim, sclerotic veins, perif.

intact

Kontr Control The decision about the next visit or control examination.

e.g.: Kontroll 2-3 h´onap m´ulva vagy panasz eset´en // Control in 2-3 months or in the case of complaints.

ut´et Operation Applied or prescribed operations.

e.g.: Phaco + PCL impl. o.sin. (Dr R. Zs.) XXX

-e.g.: Tisztelt H´aziorvos! // Dear Family Doctor,

Table 3.1: The tags with their meaning definitions, and an example sentence

3.3. Structuring and categorizing lines 19

with the tag the phrase belonged to. These first two steps were able to categorize 34% of the concatenated lines in the documents.

Table 3.2: Examples of tags and some of the phrases labelled by the tag.

In the third step, the rest of the lines were given a label. In order to do this, all lines labelled in the first two steps were collected for each tag (they will be referred to as tag collections). Then, for each line, the most similar tag collection was determined and the tag of this collection was assigned to the actual line. The similarity measure applied was the tf-idf weighted cosine similarity between a line (l) and a tag collection (c) defined by Formula3.1.

where~lcontained the normalized set of words in line l, and~c the normalized set of words contained in the tag collectionc. During normalization, stopwords and punctuation marks were removed and numbers were replaced by the character x, so that the actual numerical values do not mislead the representation. As a result, all lines within PART sections were labelled by a tag. Finally, tabular lines were assigned the tagVis, since these contained the detailed information about the visual acuity of the patient.

20 3. Accessing the content