• Nem Talált Eredményt

First International Workshop on Computational Linguistics for Uralic Languages Proceedings of the Workshop January 16

N/A
N/A
Protected

Academic year: 2022

Ossza meg "First International Workshop on Computational Linguistics for Uralic Languages Proceedings of the Workshop January 16"

Copied!
9
0
0

Teljes szövegt

(1)

IWCLUL

First International Workshop on Computational Linguistics for Uralic Languages

Proceedings of the Workshop

January 16

th

, 2015

Tromsø, Norway

(2)

ii

is work is licensed under a Creative Commons Aribution–NoDerivatives 4.0 International Licence. Licence details: http://creativecommons.org/licenses/

by-nd/4.0/. Page numbering and footers have been added by the editors.

WWW address: http://dx.doi.org/10.7557/scs.2015.2 eISSN: 2387-3086

DOI: (whole proceedings) 10.7557/scs.2015.2, specific articles, see footers Editors contact: iwclul-2015@googlegroups.com

(3)

Preface

e Uralic languages are an interesting group of languages from computational-linguistic perspective. ey share large parts of morphological and morphophonological com- plexity that is not present in the Indo-European group which has traditionally dom- inated computational-linguistic research. is can be seen for example in number of word forms per word, which in Indo-European languages is in range of ones or tens whereas for Uralic languages it is in range of hundreds and thousands. Further- more, Uralic languages share a lot of geo-political aspects: the national languages of the group—Finnish, Estonian and Hungarian—are small languages and only moder- ately resourced in terms of computational-linguistic resources while being stable and not in threat of extinction, the recognised minority languages of western-European states—such as North Sámi and Võro—are clearly in category of lesser resourced and more threatened, whereas the majority of Uralic languages in the east of Europe and Russia are close to extinction. Common to all rapid development of more advanced computational-linguistic methods is required for continued vitality of the languages in everyday life, to enable archiving and use of the languages with computers and other devices such as mobile applications.

e research of computational linguistics and Uralistics is being carried out in a handful of universities, research institutes and other sites by relatively few researchers.

Our intention with organising this conference is to gather these researchers together in order to share ideas and resources, and avoid duplicating efforts in gathering and enriching these scarce resources, and hopefully to found an ongoing tradition of con- centrated effort in collecting and improving language resources and technologies for the survival of the Uralic languages.

For the conference we received 14 high-quality submissions about topics including computational lexicography, language documentation, optical character recognition, web-as-corpus and automatic and rule-based morphological analysis methods. ese are all very important for preservation and development of Uralic languages. We also received a broad coverage of languages in the submissions: North Sámi, Khanty, Mansi, Udmurt, Erzya, Moksha, Finnish and Estonian.

e conference was held at UiT Norgga árktalaš universitehta, Norway, on January iii

(4)

Proceedings of 1st International Workshop in Computational Linguistics for Uralic Languages (IWCLUL 2015); ‹http://dx.doi.org/10.7557/scs.2015.2›

16th 2015, and consisted of poster sessions, three talks, two tutorials, and an invited speech, e articles related to poster sessions and the talks are included in this pro- ceedings.

—Tommi A Pirinen, Francis M. Tyers, Trond Trosterud, Conference organisers,

2015, Tromsø

(5)

Organisers

• Tommi A. Pirinen, Ollscoil Chathair Bhaile Átha Cliath

• Francis M. Tyers, UiT Norgga árktalaš universitehta

• Trond Trosterud, UiT Norgga árktalaš universitehta

v

(6)

Proceedings of 1st International Workshop in Computational Linguistics for Uralic Languages (IWCLUL 2015); ‹http://dx.doi.org/10.7557/scs.2015.2›

(7)

Programme committee

• Тимофей Архангельский, Национальный исследовательский университет

”Высшая школа экономики”

• Lars Borin, Göteborgs universitet

• Марина Серафимовна Федина, Финн-йӧгра кывъяслы информатика отсӧг кузя регионкостса лаборатория

• Mark Fishel, Tartu ülikool

• Mikel L. Forcada, Universitat d’Alacant

• Mans Hulden, University of Colorado at Boulder

• Heiki-Jaan Kaalep, Tartu ülikool

• András Kornai, Budapesti Műszaki és Gazdaságtudományi Egyetem

• Krister Lindén, Helsingin yliopisto

• Tommi A. Pirinen, Ollscoil Chathair Bhaile Átha Cliath

• Gabór Prószéky, Pázmány Péter Katolikus Egyetem

• Aarne Ranta, Chalmers tekniska högskola

• Jack Rueter, Helsingin yliopisto

• Trond Trosterud, UiT Norgga árktalaš universitehta

• Francis M. Tyers, UiT Norgga árktalaš universitehta

• Sami Virpioja, Aalto-yliopisto

• Anssi Yli-Jyrä, Helsingin yliopisto

vii

(8)

Proceedings of 1st International Workshop in Computational Linguistics for Uralic Languages (IWCLUL 2015); ‹http://dx.doi.org/10.7557/scs.2015.2›

(9)

Contents

1 Invited spee 1

1.1 Direct comparison of language forms in two-level framework . . . 1

2 Tutorials 3 2.1 Grammatical Framework Tutorial with a Focus on Fenno-Ugric Languages 4 2.2 Language Documentation meets Language Technology . . . 8

3 Accepted Papers 19 3.1 Low-Resource Active Learning of North Sámi Morphological Segmen- tation . . . 20

3.2 Compiling the Uralic Dataset for NorthEuraLex a Lexicostatistical Database of Northern Eurasia . . . 34

3.3 Can Morphological Analyzers Improve the ality of Optical Character Recognition? . . . 45

3.4 Corpus.mari-language.com: A Rudimentary Corpus Searchable by Syn- tactic and Morphological Paerns . . . 57

3.5 Infinite Monkeys of Babel — Crowdsourcing for the beerment of OCR language material . . . 69

3.6 Multilingual Semantic MediaWiki for Finno-Ugric dictionaries . . . 75

3.7 e Finno-Ugric Languages and e Internet project . . . 87

3.8 On the Road to a Dialect Dictionary of Khanty Postpositions . . . 99

3.9 FinUgRevita: Developing Language Technology Tools for Udmurt and Mansi . . . 108 3.10 Automatic creation of bilingual dictionaries for Finno-Ugric languages 119

ix

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

Based on the differences between the stress systems of English and Hungarian, and previous research on the phenomenon of stress deafness (whereby native speakers of languages with

The stemmer evaluation tool presented in the thesis group II can be used on other languages (in addition to English, Polish and Hungarian) and evaluation of

The lexical database of the Humor analyzer consists of an inventory of morpheme allomorphs, the word grammar automaton and two types of data structures used for the local

I implemented a word form generator as a Humor module, which can generate the inflected and derived forms of any multiply derived and/or compound stem without explicitly referring

Chapter 1 (“Discovering the assignment: An Uralic essive typological questionnaire”, by C. de Groot) outlines the main research goals, the list of languages under consideration and

Studying third or additional languages is considered more complex than second language acquisition (Cenoz, 2008, p. There are only two languages involved in SLA: the

Based on the flow type, domain-specific visual languages can be grouped into three subclasses: data flow languages, control flow languages and languages with no flow.. Data

It is shown that the following five classes of weighted languages are the same: (i) the class of weighted languages generated by plain weighted context-free grammars, (ii) the class