• Nem Talált Eredményt

Morphological and Syntactic Annotation of Hungarian Webtext

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Morphological and Syntactic Annotation of Hungarian Webtext"

Copied!
1
0
0

Teljes szövegt

(1)

346 XI. Magyar Számítógépes Nyelvészeti Konferencia

Morphological and Syntactic Annotation of Hungarian Webtext

Veronika Vincze1,2, Viktor Varga1, Petra Anna Papp1, Katalin Ilona Simk´o1, J´anos Zsibrita1, Rich´ard Farkas1

1University of Szeged, Department of Informatics Szeged, ´Arp´ad t´er 2.

{vinczev,zsibrita,rfarkas}@inf.u-szeged.hu, {varga.viktor.1991,papp.petra.anna,kata.simko}@gmail.com

2MTA-SZTE Research Group on Artificial Intelligence Szeged, Tisza Lajos k¨or´ut 103.

For a while now, internet communication has been used as a source of data for research. Texts on the web trying to mimic oral communication include many abbreviations and errors that make their linguistic processing more difficult. Our goal was to create a corpus of texts from the web and manually annotate it for morphology and syntax in order to make it useful for the development of future natural language processing applications for this domain.

Our corpus is made up of public Facebook comments (1208 sentences, 8615 tokens) and questions and answers from gyakorikerdesek.hu (728 sentences, 9702 tokens). Most posts are about users’ hobbies, personal interests and lifestyle.

First, we manually segmented the sentences and tokenised the text, then, us- ing one of the modules of magyarlanc, we built a corpus, structurally similar to the Szeged Korpusz, in which the annotators manually assigned the contextually correct morphological code to each word. Similar to Szeged Treebank and Sze- ged Dependency Treebank, we also created manual constituent and dependency syntax analysis for each sentence. We mainly followed the principles used in the development of our two previous, bigger treebanks, but some modifications were unavoidable given the special form of this text. The corpus is also annotated for semantic and discourse level uncertainty markers and we plan to annotate named entities in it as well.

This first Hungarian, manually annotated web corpus will be used as a test database in developing a morphological and syntactic parser, optimalised for the analysis of texts from the web. The corpus is currently too small to train statistical parsers, however, our goal was to create a benchmark database. We believe that as web texts are so varied both in topic and genre, the application of supervised machine learning techniques would not be a suitable solution, instead, we plan to use domain adaptation methods.

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

In all three semantic fluency tests (animal, food item, and action), the same three temporal parameters (number of silent pauses, average length of silent pauses, average

Our pilot study of annotating Hungarian webtext for uncertainty leads us to conclude that the annota- tion guidelines are mostly applicable to Hungarian as well and webtexts

By examining the factors, features, and elements associated with effective teacher professional develop- ment, this paper seeks to enhance understanding the concepts of

The Government of the Hungarian People's Republic believes that conscious and com- plex government measures and community actions will make it possible for all countries of the world

For future food production it is essential to determine the level and sources of genetic erosion in plant species and to create plans for preservation and development of new

Since the Hungarian sea buckthorn population represents a valuable and prominent natural gene reserve of the species in East-Central Europe, the goal of our study was to

As it was described earlier, it was designed and implemented to collect health- related information from the community and to make it available for the health-related systems

The corpus of texts examined in Vaderna’s book is relevant because it unfolds an intricate story of the birth of modern poetry, and it uncovers the various traditions from