Proceedings of the Conference

(1)

Proceedings of the Conference

September 18-20, 2017

Università di Pisa

Istituto di Linguistica Computazionale “A. Zampolli”, CNR Pisa

Edited by Simonetta Montemagni and Joakim Nivre

Cover design Chiara Mannari

(2)

Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

Simonetta Montemagni, Joakim Nivre (Eds.)

Linköping Electronic Conference Proceedings No. 139 ISSN: 1650-3686, eISSN: 1650-3740

ISBN: 978-91-7685-467-9

ACL Anthology W17-65

© 2017 The Authors (individual papers)

© 2017 The Editors (collection)

Inclusion of papers in this collection, electronic publication in the Linköping Electronic Conference Proceedings series, and inclusion in the ACL Anthology with permission of the copyright holders

ii

(3)

Preface

The Depling 2017 conference in Pisa is the fourth meeting in the recently established series of international conferences on Dependency Linguistics which started in Barcelona in 2011 and continued in Prague and Uppsala in 2013 and 2015, respectively. The initiative to organize special meetings devoted to Dependency Linguistics, which is currently at the forefront of both theoretical and computational linguistics, has received great support from the community. We do hope that the present conference will manage to keep up the high standards set by the previous meetings.

This year we received 41 submissions by 93 authors from 27 countries, one of which was withdrawn before reviewing. Of the remaining 40 submissions (each reviewed by 3 members of the Program Committee), 30 were accepted, resulting in an acceptance rate of 75%. All in all, the proceedings contain a wide range of contributions to Dependency Linguistics, ranging from papers advancing new theoretical models, through empirical studies of one or more languages, as well as experimental investigations of computational systems of dependency parsing and linguistic knowledge extraction, to the design and construction of dependency-based linguistic resources (both treebanks and lexicons) for a wide range of languages.

New to Depling 2017 edition is the fact that the conference is held in conjunction with the biennial meeting of SIGPARSE, namely the International Conference on Parsing Technologies (IWPT 2017), organized by the Special Interest Group on “Natural Language Parsing” of the Association for Computational Linguistics (ACL). IWPT 2017 will take place immediately after Depling 2017, from the 20th to 22nd of September 2017. The two conferences have an overlapping event, held on September 20th and focusing on different aspects of dependency parsing, in which the results of a shared task jointly organized by Depling and IWPT are presented and discussed from different and complementary perspectives.

The shared task, named “Extrinsic Parser Evaluation” (EPE) and playing the role of “bridge event”

between the two conferences, is aimed at shedding light on the downstream utility of various dependency representations (at the available levels of accuracy for different parsers), that is, to seek to contrastively isolate the relative contributions of each type of representation (and corresponding parsing systems) to a selection of state-of-the-art systems (which use different types of text and exhibit broad domain and genre variation).

In addition to the accepted papers, the core conference program also includes the contribution of two distinguished keynote speakers, Yoav Goldberg (Bar Ilan University) and Eva Hajičová (Charles University in Prague). We are honoured that they accepted to contribute to Depling 2017 and thank them for agreeing to share their knowledge and expertise on key Dependency Linguistics topics with the conference participants.

Our sincere thanks go to the members of the Program Committee who thoroughly reviewed all the

submissions to the conference and provided detailed comments and suggestions, thus ensuring the

quality of the published papers. Many thanks to the members of the Local Organizing Committee

who took care of all matters related to the local organization of the conference. Thanks are also due

to Michela Carlino, who did a great job in putting the proceedings together, and to Chiara Mannari,

for designing and constructing the Depling and IWPT+Depling conference websites and

continuously updating them. Last but not least, we would like to acknowledge the support from

endorsing organizations and institutions and from our sponsors, who generously provided funds and

(4)

iv

services that are crucial for the organization of this event. At the time of writing, Depling was sponsored by the newly founded “Italian Association of Computational Linguistics” (AILC) and by the University of Pisa. Special thanks are also due to the Institute for Computational Linguistics

“Antonio Zampolli” of the Italian National Research Council (ILC-CNR) for the support in the organization of the event. Thanks finally to everyone who chose to submit their work to Depling 2017, without whom this volume literally would not exist.

We welcome you all to Depling 2017 in Pisa and wish you an enjoyable conference!

Simonetta Montemagni and Joakim Nivre

Program Co-Chairs, Depling 2017

(5)

Organizers

Program Co-Chairs



Simonetta Montemagni, Istituto di Linguistica Computazionale “A. Zampolli” - CNR



Joakim Nivre, Uppsala University

Local Organizing Committee



Giuseppe Attardi, Università di Pisa



Felice Dell’Orletta, Istituto di Linguistica Computazionale “A. Zampolli” - CNR



Alessandro Lenci, Università di Pisa



Simonetta Montemagni, Istituto di Linguistica Computazionale “A. Zampolli” - CNR



Maria Simi, Università di Pisa

Program Committee



Giuseppe Attardi, Università di Pisa



Miguel Ballesteros, IBM Research Watson



Xavier Blanco, Universitat Autònoma de Barcelona



Igor Boguslavsky, Universidad Politecnica de Madrid and Russian Academy of Sciences



Bernd Bohnet, Google



Cristina Bosco, Università di Torino



Marie Candito, Université Paris Diderot



Jinho Choi, University of Colorado at Boulder



Benoit Crabbé, Université Paris Diderot



Eric De La Clergerie, INRIA



Felice Dell’Orletta, Istituto di Linguistica Computazionale “A. Zampolli” - CNR



Marie-Catherine de Marneffe, The Ohio State University



Kim Gerdes, Sorbonne Nouvelle



Filip Ginter, University of Turku



Koldo Gojenola, University of the Basque Country UPV/EHU



Carlos Gómez-Rodríguez, Universidade da Coruña



Eva Hajičová, Charles University in Prague



Richard Hudson, University College London



Leonid Iomdin, Russian Academy of Sciences



Sylvain Kahane, Université Paris Ouest Nanterre



Marco Kuhlmann, Linköping University



François Lareau, Université de Montréal



Alessandro Lenci, Università di Pisa



Beth Levin, Stanford University



Haitao Liu, Zhejiang University



Marketa Lopatkova, Charles University in Prague



Ryan McDonald, Google

(6)

vi

 Igor Mel'čuk, University of Montreal

 Wolfgang Menzel, Hamburg University

 Paola Merlo, Université de Genève

 Jasmina Milicevic, Dalhousie University

 Henrik Høeg Müller, Copenhagen Business School

 Alexis Nasr, Université de la Méditerranée

 Pierre Nugues, Lund University

 Kemal Oflazer, Carnegie Mellon University Qatar

 Timothy Osborne, Zhejiang University

 Jarmila Panevova, Charles University in Prague

 Alain Polguère, Université de Lorraine ATILF CNRS

 Prokopis Prokopidis, Institute for Language and Speech Processing/Athena RC, Greece

 Owen Rambow, Columbia University

 Ines Rehbein, Potsdam University

 Dipti Sharma, IIIT, Hyderabad

 Maria Simi, Università di Pisa

 Reut Tsarfaty, Open University of Israel

 Giulia Venturi, Istituto di Linguistica Computazionale “A. Zampolli” - CNR

 Leo Wanner, Pompeu Fabra University

 Daniel Zeman, Charles University in Prague

 Yue Zhang, Singapore University of Technology and Design

Supporting Institutions

 Università degli Studi di Pisa

o Dipartimento di Filologia, Letteratura e Linguistica o Dipartimento di Informatica

 Istituto di Linguistica Computazionale “A. Zampolli”, Consiglio Nazionale delle Ricerche

Sponsor

 Associazione Italiana di Linguistica Computazionale (AILC)

(7)

Invited Talk: Capturing Dependency Syntax with “Deep” Sequential Models

Yoav Goldberg ... 1

Invited Talk: Syntax-Semantics Interface: A Plea for a Deep Dependency Sentence Structure

Eva Hajičová ... 2

The Benefit of Syntactic vs. Linear N-Grams for Linguistic Description

Melanie Andresen and Heike Zinsmeister ... 4

On the Predicate-Argument Structure: Internal and Absorbing Scope

Igor Boguslavsky ... 15

On the Order of Words in Italian: A Study on Genre vs Complexity

Dominique Brunato and Felice Dell’Orletta ... 25

Revising the METU-Sabancı Turkish Treebank: An Exercise in Surface-Syntactic Annotation of Agglutinative Languages

Alicia Burga, Alp Öktem and Leo Wanner ... 32

Enhanced UD Dependencies with Neutralized Diathesis Alternation

Marie Candito, Bruno Guillaume, Guy Perrier and Djamé Seddah ... 42

Classifying Languages by Dependency Structure. Typologies of Delexicalized Universal Dependency Treebanks

Xinying Chen and Kim Gerdes ... 54

A Dependency Treebank for Kurmanji Kurdish

Memduh Gökırmak and Francis M. Tyers ... 64

(8)

viii

What are the Limitations on the Flux of Syntactic Dependencies? Evidence from UD Treebanks

Sylvain Kahane, Chunxiao Yan and Marie-Amélie Botalla ... 73

Fully Delexicalized Contexts for Syntax-Based Word Embeddings

Jenna Kanerva, Sampo Pyysalo and Filip Ginter ... 83

Universal Dependencies for Dargwa Mehweb

Alexandra Kozhukhar ... 92

Menzerath-Altmann Law in Syntactic Dependency Structure

Ján Mačutek, Radek Čech and Jiří Milička ... 100

Assessing the Annotation Consistency of the Universal Dependencies Corpora

Marie-Catherine de Marneffe, Matias Grioni, Jenna Kanerva and Filip Ginter ... 108

To What Extent is Immediate Constituency Analysis Dependency-Based? A Survey of Foundational Texts

Nicolas Mazziotta and Sylvain Kahane ... 116

Dependency Structure of Binary Conjunctions (of the IF…, THEN… Type)

Igor Mel’čuk ... 127

Non-Projectivity in Serbian: Analysis of Formal and Linguistic Properties

Aleksandra Miletic and Assaf Urieli ... 135

Prices Go Up, Surge, Jump, Spike, Skyrocket, Go through the Roof… Intensifier Collocations with Parametric Nouns of Type PRICE

Jasmina Milićević ... 145

Chinese Descriptive and Resultative V-de Constructions. A Dependency-based Analysis

Ruochen Niu ... 154

(9)

The Component Unit. Introducing a Novel Unit of Syntactic Analysis

Timothy Osborne and Ruochen Niu ... 165

Control vs. Raising in English. A Dependency Grammar Account

Timothy Osborne and Matthew Reeve ... 176

Segmentation Granularity in Dependency Representations for Korean

Jungyeul Park ... 187

Universal Dependencies for Portuguese

Alexandre Rademaker, Fabricio Chalub, Livy Real, Cláudia Freitas, Eckhard Bick and Valeria de Paiva ... 197

UDLex: Towards Cross-language Subcategorization Lexicons

Giulia Rambelli, Alessandro Lenci and Thierry Poibeau ... 207

Universal Dependencies are Hard to Parse – or are They?

Ines Rehbein, Julius Steen, Bich-Ngoc Do and Anette Frank ... 218

Annotating Italian Social Media Texts in Universal Dependencies

Manuela Sanguinetti, Cristina Bosco, Alessandro Mazzei, Alberto Lavelli and Fabio

Tamburini ... 229

Hungarian Copula Constructions in Dependency Syntax and Parsing

Katalin Ilona Simkó and Veronika Vincze ... 240

Semgrex-Plus: a Tool for Automatic Dependency-Graph Rewriting

Fabio Tamburini ... 248

Unity in Diversity: a Unified Parsing Strategy for Major Indian Languages

Juhi Tandon and Dipti Misra Sharma ... 255

(10)

x

Quantitative Comparative Syntax on the Cantonese-Mandarin Parallel Dependency Treebank

Tak-sum Wong, Kim Gerdes, Herman Leung and John Lee ... 266

Understanding Constraints on Non-Projectivity Using Novel Measures

Himanshu Yadav, Ashwini Vaidya and Samar Husain ... 276

Core Arguments in Universal Dependencies

Daniel Zeman ... 287

(11)

Capturing Dependency Syntax with “Deep” Sequential Models

Yoav Goldberg

Bar Ilan University

Department of Computer Science Ramat-Gan, Israel

yoav.goldberg@gmail.com

Neural network (“deep learning”) models are taking over machine learning approaches for language by storm. In particular, recurrent neural networks (RNNs), which are flexible non-markovian models of sequential data, were shown to be effective for a variety of language processing tasks. Somewhat surprisingly, these seemingly purely sequential models are very capable at modeling syntactic phenomena, and using them result in very strong dependency parsers, for a variety of languages.

In this talk, I will briefly describe recurrent-networks, and present empirical evidence for their capabil- ities of learning the subject-verb agreement relation in naturally occuring text, from relatively indirect supervision. This part is based on my joint work with Tal Linzen and Emmanuel Dupoux. I will then describe bi-directional recurrent networks - a simple extension of recurrent networks - and show how they can be used as the basis of state-of-the-art dependency parsers. This is based on my work with Eliyahu Kipperwasser, but will also touch on work by other researchers in that space.

1

(12)

Syntax-Semantics Interface: A Plea for a Deep Dependency Sentence Structure

Eva Hajičová

Charles University

Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics

Prague, Czech Republic

hajicova@ufal.mff.cuni.cz

In collaboration with Václava Kettnerová, Veronika Kolářová, Markéta Lopatková, Jarmila Panevová, and Dan Zeman (and with technical support of Jiří Mírovský)

The aim of the contribution is to bring arguments for a description for natural language that (i) includes a representation (i) of a deep (underlying) sentence structure and (ii) is based on the relation of dependency. Our argumentation rests on linguistic considerations and stems from the Praguian linguistic background, both with respect to the Praguian structuralist tradition as well as to the formal framework of Functional Generative Description and to the experience with building the Prague Dependency Treebank. The arguments, of course, are not novel but we will try to gather and report on our experience when working with deep syntactic depend- ency relations in the description of language; the basic material will be Czech but multilingual comparative aspects will be taken into account as well.

Speaking about a “deep” sentence structure, a natural question to ask is how “deep” this lin- guistic structure is to be. Relevant in this respect is the differentiation between ontological content and linguistic meaning. Two relations will be discussed in some detail and illustrated on examples from Czech and English, namely the relation of synonymy and that of ambiguity (homonymy). The relation of synonymy will be specified as an identity of meaning with re- spect to truth conditions and it will be demonstrated how this criterion may help to test sen- tences and constructions for synonymy. The relation of ambiguity will be exemplified by two specific groups of examples, one concerning surface deletions and the necessity to reconstruct them in the deep structure, and the other group involving the notion of deep order of sentence elements with examples related to the phenomenon of information structure.

The necessity to distinguish surface and deep structure has led to several proposals of a multi- level description of language, both in the domain of theoretical linguistics and in the domain of annotation schemes of language corpora, such as LFG or CCG. We will describe in a nut- shell the Prague Dependency Treebank, focusing on the deep (so-called tectogrammatical) level of annotation.

After some observations on the history of the dependency-based syntactic relations, attention will be focused on two basic topics, namely the issue of headedness and the notion of valency.

We will outline an approach to the distinction between arguments and adjuncts and their se- mantic optionality/obligatoriness based on two operational criteria and we will demonstrate on the example of several Czech valency dictionaries how a dependency-based description brings together grammar and lexicon.

Among the many challenges that still await a deeper analysis, two will be briefly character- ized, namely the phenomenon of projectivity and the representation of coordination.

Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 2-3, Pisa, Italy, September 18-20 2017

2

(13)

To summarize, we argue that both attributes of our approach, namely “deep” and “dependen- cy-based” are important for a theoretical description of language if this description is sup- posed to help to reflect the relation between form and meaning, that is, when it is supposed to serve as a basis for language understanding. Despite undisputable recent progress in NLP which relies more on computational methods than linguistic representations or features, we believe that for true understanding, having an adequate theory is worth the effort.

This work has been supported by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071) and by the project GA17- 07313S of the Grant Agency of the Czech Republic.

(14)

The benefit of syntactic vs. linear n-grams for linguistic description

Melanie Andresen and Heike Zinsmeister Universit¨at Hamburg

Institute for German Language and Literature Germany

{melanie.andresen, heike.zinsmeister}@uni-hamburg.de

Abstract

Automatic dependency annotations have been used in all kinds of language applications. However, there has been much less exploitation of dependency annotations for the linguistic description of language varieties. This paper presents an attempt to employ dependency annotations for describing style. We argue that for this purpose, linear n-grams (that follow the text’s surface) alone do not appropri- ately represent a language like German.

For this claim, we present theoretically as well as empirically founded arguments.

We suggest syntactic n-grams (that follow the dependency paths) as a possible solution. To demonstrate their potential, we compare the German academic languages of linguistics and literary studies using both linear and syntactic n-grams.

The results show that the approach using syntactic n-grams allows for the detection of linguistically meaningful patterns that do not emerge in a linear n-gram analysis, e. g. complex verbs and light verb constructions.

1 Introduction

Linear n-grams in the sense of adjacent strings of tokens, parts of speech, etc. are a very common and successful way of modeling language in computational linguistics. However, linguistic structures do not always work in such linear ways. From a cross-linguistic perspective, some languages are less linearly organized than others.

While many (though not all) syntactic structures in English can indeed be described by linear patterns, this is much less true for languages with a more flexible word order and other syntactic properties that induce long distance relations, e. g. German.

Still, the linear n-gram approach is quite successful when used for applications in such languages.

In the present paper our aim is a slightly different one. We want to employ n-grams not as a means for an application but for linguistic description itself. This requires the language modeling to be more linguistically adequate and interpretable and not just to be a means to an end. We consider the use ofsyntactic n-gramsin addition to linear ones to be a possibility to achieve this aim.

In order to motivate our approach, we will first introduce the concept of syntactic n-grams (section 2) and present related work (section 3). Then we will investigate the descriptive benefit of syntactic n-grams by, firstly, looking at theoretical descriptions of non-linear German syntax (section 4.1), and secondly, by investigating empirical consequences of such structures by describing cross-linguistic differences in Universal De- pendencies (UD) treebanks, with a special focus on the comparison of English and German (section 4.2).

In the main part of this paper we will present a study of stylistic comparison between different academic disciplines, namely between linguistics and literary studies in German (section 5). To capture these differences, we will compare the frequencies of n-grams between the two disciplines and contrast the results yielded by linear and syntactic n-grams in section 6.

Finally, we will summarize our results in section 7. The analyses show that syntactic n-grams capture relevant structures that would be missed in a purely linear approach, e. g. complex verbs and light verb constructions.

2 Syntactic n-grams

Linear n-gram analysis is an omnipresent method in computational linguistics and has proven to be an easy to implement and highly appropriate ap- proximation of how language works in many ap-

Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 4-14, Pisa, Italy, September 18-20 2017

4

(15)

plications (see Jurafsky and Martin (2014, chap. 4) for an overview).

However, for the linguistic description of language this is often not satisfactory, as the underlying linguistic patterns are not always linear. One possible remedy for this issue is the approach of skip-grams (see e. g. Guthrie et al. (2006)), but they disregard linguistic structures and thus gener- ate a lot of noise. Another approach for overcom- ing this problem is the use of syntactic n-grams.

Instead of following the word order as it appears on the surface, they are based on dependency paths in the sentence.

A simple type of syntactic n-grams relying on unary-branching dependency structures is described by Sidorov et al. (2012):

[...] we consider as neighbors the words (or other elements like part-of-speech tags, etc.) that follow one another in the path of the syntactic tree, and not in the text. We call such n-Grams syntactic n-Grams (sn-Grams). (Sidorov et al., 2012, 1)

A more sophisticated approach is suggested by Goldberg and Orwant (2013). Their definition augments the one by Sidorov et al. (2012) by in- cluding all kinds of n-ary branching subtrees:

We define a syntactic-ngram to be a rooted connected dependency tree over k words, which is a subtree of a dependency tree over an entire sentence.

(Goldberg and Orwant, 2013, 3)

This results in the additional inclusion of n-grams with more than one dependent per head, which is also advocated by Sidorov (2013).¹

As a base for the more widespread use of syntactic n-grams, Goldberg and Orwant (2013) create a comprehensive database on the basis of the Google Books corpus for general use. In their representation of n-grams, they exclude functional words and include multiple layers of annotation (part of speech, dependency relation, head). In addition, they preserve the information about the word order in the text. Our analysis will be based on the simpler type of syntactic n-grams by Sidorov et al. (2012) (see section 5.2).

1Compare also to the concept of catenae presented in Os- borne et al. (2012).

3 Related Work

In this section, we will briefly refer to other types of syntactically motivated features and applications they were used in. Then we will look at the use of n-grams and syntactic features in authorship attribution and stylistic analysis.

Dependency-based features have been used for various applications. For example, Snow et al.

(2004) use dependency paths between nouns as one feature to extract lexical hypernymy relations.

Pad´o and Lapata (2007) use similar dependency subtrees as a feature to create general semantic space models. Versley (2013) uses subgraphs to describe larger structures, in particular implicit discourse relations in texts.

Syntactic features have also been systematically compared to linguistically less informed features like linear n-grams or bag-of-words approaches.

Lapesa and Evert (2017) evaluate the performance of dependency-based and simpler window-based models for computing semantic similarity and find the simpler model to be superior in most cases.

Bott and Schulte im Walde (2015) present similar findings when employing syntactically informed features in the task of predicting compositionality of German particle verbs.

Sidorov et al. (2012) use syntactic n-grams in an authorship attribution task. Their syntactic n- grams include the syntactic relation labels only and achieve good results compared to linear n- grams. Stamatatos (2009) gives an overview of the use of other types of syntactic features in authorship attribution. These include for instance syntactic rewrite rules based on phrase structures and syntactic errors. In a more recent study, van Cranenburgh and Bod (2017) successfully quantify the literariness of novels by using, among others, fragments of syntactic constituency trees as features. They stress the fact that these features have the advantage of being more interpretable than others that are not syntactically motivated.

N-gram approaches have also been used for more interpretative analyses in the humanities.

Biber et al. (1999) and others investigate academic language with the help of so-called ‘lexical bun- dles’. In literary studies, Mahlberg (2013), among others, uses data-driven ‘clusters’ for describing the style of Charles Dickens’ prose. Both approaches rely on token-based n-grams only and do not make use of syntactic annotation.

Most of the computational linguistics ap-

(16)

proaches have in common that they use syntactic n-grams or syntactic subtrees for some practi- cal application. Even stylistic approaches of aim at classifying documents rather than describing them. On the other hand, studies in the humanities that aim at describing and interpreting language tend to use rather simple features that do not include syntactic information. By merging the means of the first with the aims of the second group, we will explore the potential syntactic n-grams hold for the linguistic description of languages.

4 Non-linear structures

We will at first motivate the need for syntactic n- grams by consideringnon-linear structuresin the sense of structures that are expressed in a discontinuous token string. This means that they cannot be captured by regular linear n-grams. In particular, we are interested in structures which occur frequently enough for us to expect them to have an impact on n-gram creation. Section 4.1 gives a theoretical foundation by introducing non-linear syntactic structures from German. Section 4.2 dis- cusses empirical consequences of these properties with a special focus on the comparison of English and German.

4.1 Theoretical foundation

To what extent the syntactic structure of a language is linear is a question of typology and dif- fers widely between languages. The use of n- grams for linguistic applications and analyses is a method that favors languages with dominantlylin- ear structures, i. e. structures that are expressed by continuous token strings. German is one example of a language that is rich in non-linear structures.² We will first focus on non-linear structures that are projective, i. e. structures that do not cause dependency paths to overlap. These are commonly discussed under the model of Topological Fields that describes German as using so-called bracketing structures: Once the first part of the bracket is realized, the reader/hearer expects the second part to occur as well (see K¨ubler and Zinsmeister (2015, 73) or Becker and Frank (2002) for an En- glish description). Three types of these structures can be distinguished:

2The non-linear characteristics of German are most prominently described and parodied by Mark Twain (1880).

M. dehnt den Begriff auf neue Medien aus M. extends the term to new media (particle)

(a) Example of a finite particle verb

weil die erste Silbe immer unbetont ist because the first syllable always unstressed is (b) Example of a subordinate clause with the verb in final position

mit dem von M. sehr genau beschriebenen Fall with the by M. very exactly described case

(c) Example of a noun phrase

Figure 1: Examples of non-linear structures in German

Main clauses. In main clauses, several types of complex verbal structures lead to non-linearity:

• full verbs complemented by auxiliary and/or modal verbs,

• copula verbs complemented by predicatives,

• light verb constructions,

• finite particle verbs.

In all of these verb constructions, the finite part of the verb will be in second position while the other verbal elements are in final position. The number of phrases in between, in the so-called middle field, is theoretically unlimited. Figure 1a shows an example of the particle verbausdehnen(‘to extend’) with the finite verbal partdehnt in second position and the separated particleausin sentence- final position.

Subordinate clauses. This bracketing structure is opened by the phrase-initial subjunction and closed by the finite and non-finite verb forms that are in sentence-final position (see example in Figure 1b).

6

(17)

Noun phrases. Finally, German also has non- linear structures similar to English: The noun phrase is opened by a determiner (or indirectly by a preposition) and closed by the noun itself. In between, the phrase can be extended by mainly adjective phrases. Additionally, the German noun phrase can comprise structures in pre-nominal position that would be placed post-nominally in En- glish as shown in the example in Figure 1c.

Maier et al. (2014) present additional discontinuous structures that are characterized not only by the distance between their elements, but also by non-projective dependencies, i. e. by crossing dependencies: “extraposition, a place- holder/repeated element construction, topicaliza- tion, scrambling, local movement, parentheticals, and fronting of pronouns” (Maier et al., 2014, 1).

However, these structures are much rarer than the projective non-linear ones described above and are not expected to be reflected in the frequency data of the n-gram analysis.

In the light of the example of German we have seen that there are languages with many non-linear structures that do not have an equivalent in En- glish.

4.2 Empirical consequences

In order to empirically demonstrate and quantify the degree to which languages make use of non- linear structures and describe their nature, we focus on the distance between head and dependent in dependency annotated data in terms of surface tokens. For a cross-linguistic comparison we use the training data of Universal Dependencies 2.0 (Nivre et al., 2017). Table 1 shows the median and mean distance and standard deviation between head and dependent in several languages³. Punc- tuation and the root were excluded from the calcu- lation. A distance of 0 means that head and dependent are directly adjacent.

First, we can see that even in English – the language most applications were primarily developed for – head and dependent are often non-adjacent.

On average, 1.77 words are in between head and dependent. Second, it becomes clear that the distances vary greatly also within languages, with Arabic and Persian having a very high standard deviation of 6.78 and 5.09, respectively. Even though one should bear in mind that some differ-

3The sample of languages is only a subset of more than 50 languages available in UD.

median mean sd

Persian 0 2.62 5.09

German 1 2.28 4.02

Arabic 0 2.14 6.78

Dutch 1 2.06 3.54

English 1 1.77 3.32

French 1 1.71 3.92

Russian 1 1.70 3.51

Swedish 1 1.70 4.79

Czech 1 1.70 3.24

Turkish 0 1.69 3.46

Italian 1 1.68 4.12

Table 1: Distance between head and dependent in UD treebanks (without punctuation and root)

ences might be due to the language-specific imple- mentations of the Universal Dependencies, we can assume that there are in fact differences between the languages.

Figure 2 exemplary shows the distribution of the distances of the part of speech sconj (= subor- dinating conjunctions) to its head in more detail.

Here, the differences between the languages are more pronounced than with other parts of speech.

Turkish and Arabic do not have this part of speech.

With a median of six (marked by the black line inside the box), German features the highest distance, followed by Persian, another verb-final language, and Dutch, which is similar to German in this respect.

In the remainder of this paper we will focus on German as an example of a language in which the average distance is significantly higher than in En- glish⁴and more variable.

4t = 42.998, df = 386460, p-value<2.2e-16

Figure 2: Distance to head of words with the part of speechsconjin all languages

(18)

Which syntactic structures are related to these differences? Figure 3a shows boxplots of the distance distributions between heads and dependents in English and German grouped by the part of speech of the dependent. The most obvious differences relate to the theoretical findings in section 4.1. German verbs and auxiliary verbs show much larger distances from their heads than their English counterparts, as can be expected because of the German bracketing structure. Subordinat- ing conjunctions (SCONJ) show the largest difference in the two languages with the interquartile ranges of their distributions not even overlapping.

This reflects the German brackets in subordinate clauses, which result in a large distance between the subjunction and the finite verb of the subordinate clause.

Another clear difference is in pronouns, which are positioned early in the sentence in German (before or immediately after the finite verb, the so-called ‘Wackernagel position’, Cardinaletti and Roberts (2002, 133)), while their head (usually the main verb) can be sentence-final. Also nouns and adverbs tend to be slightly further away from their head in German than in English. This can prob- ably be attributed to the generally freer word order in German (empirically shown in Futrell et al.

(2015)).

Figure 3b shows the same relation from the other direction: The same distances grouped by the part of speech of the head. Again, German verbs and auxiliary verbs prove to be further away from their dependent than the English ones. Ad- jectives are another notable case. According to the Universal Dependencies’ guidelines, adjectives are considered the root of the sentence when they occur in predicative structures (e. g.This is very easy.). The copula is one of its dependents, which can again be far away from the predicative adjective in German.

Finally, all of the phenomena described above are also reflected when looking at the distances grouped by syntactic relation: Many of the high-distance relations in German refer to different types of clauses (acl, advcl, ccomp, csubj) and complex verbs (aux, compound:prt (particle verbs)), especially in combination with passives (csubj:pass, nsubj:pass, aux:pass). mark is the relation between subjunctions and finite verbs in subordinate clauses. It also features a clear difference

(a) Distance by pos of dependent

(b) Distance by pos of head

Figure 3: Distance between head and dependent in UD treebanks (without outliers)

in distance between the two languages.

This section has shown that the non-linear structures described in section 4.1 have an impact on the distance between head and dependent. It could be demonstrated that these distances are much larger in German dependency structures than in English ones. This means that the modeling of German using only linear n-grams is not fully adequate for its linguistic description. In the next section, we will compare the contribution of syntactic and linear n-grams to a stylistic analysis of German academic language.

5 Study: Disciplinary differences in academic writing style

The following study is part of a larger project on stylistic analysis of German academic texts written in the disciplines of linguistics and literary studies, respectively. This field of research is motivated by the fact that these two disciplines are often combined in one common study program such as German Studies or German Language and Lit- erature. While this suggests that the disciplines are very closely related, writing styles differ widely (see e. g. Afros and Schryer (2009)). We present an attempt to capture these differences by an n- 8

(19)

gram analysis based on linear and syntactic n- grams.

5.1 Data and preprocessing

The study is based on a corpus of 60 German PhD theses, 30 for each of the two disciplines linguistics and literary studies.

All texts were accessible as PDF files. In a first preprocessing step, we converted them to HTML to use the HTML markup for semi-automatically deleting irrelevant parts of the text. In particular, we deleted parts that do not belong to the targeted varieties and often interrupt the running text: ta- bles and figures, footnotes, citations and examples.

We also removed all text sequences in parentheses as most of them comprise references, especially in linguistics. Additionally, we excluded sentences with more than 40% of the words in quotes, as- suming that they do not represent the target variety either. Other elements we had to exclude man- ually, e. g. title page, table of contents, and list of references. The resulting plain text version has a total count of 3,579,437 tokens.

We tokenized the texts using the systemPunkt (Kiss and Strunk, 2006)⁵ and annotated the sentences with an off-the-shelf version of MATE dependency parser (Bohnet, 2010) trained on the TIGER Corpus (Seeker and Kuhn, 2012). Note that in contrast to the previous chapter, we de- cided against using Universal Dependencies. As this part of the study deals with German only, we consider the tag set developed specifically for Ger- man more appropriate. For the purpose of evaluation, two annotators consensually created a gold standard for a random sample of 22 sentences (600 tokens) against which we compared the parser’s output. The parser performance is good (UAS:

0.95, LAS: 0.93), especially given that it is applied to out-of-domain data.

5.2 N-gram generation

We extracted several data sets from the prepro- cessed corpus:

• linear n-grams of sizes 2-5 using tokens, lemmas, pos-tags and dependency relation labels,

• syntactic n-gramsof sizes 2-5 using tokens, lemmas, pos-tags and dependency relation labels, generated by taking every word of the

5http://www.nltk.org/api/nltk.tokenize.html, 23.07.2017

sentence as a starting point and following the dependency path backwards bynsteps.

The data set for the present analysis is not suffi- ciently large to allow for a representation of syntactic n-grams that includes as many annotations as Goldberg and Orwant (2013) used. To avoid issues of data sparsity, only one level of information at a time is included, e. g. token OR lemma OR part of speech OR the dependency relation label. In line with Sidorov et al. (2012), the analysis is restricted to unary syntactic n-grams following only one branch in the syntactic tree.

We exclude n-grams with a total frequency of less than 10 from further analysis. For all the resulting n-grams we calculate relative frequencies in all 60 texts. The difference in frequency between the two subcorpora is assessed based on the t-test as suggested by Paquot and Bestgen (2009) and Lijffijt et al. (2014). Each data set is then ranked according to the t-test’s p-values.

6 Results and Discussion

In the analysis, we inspect the degree of overlap between linear and syntactic n-grams in order to assess whether the two types truly give us complementary information (section 6.1). However, our main question is whether both types contribute meaningfully to a linguistic description of the disciplinary differences between linguistics and literary studies. Section 6.2 therefore gives an exemplary interpretation of the most distinctive linear and syntactic 4-grams. On that basis, the final section 6.3 presents an attempt to quantify linguistic interpretability.

6.1 Overlap between linear and syntactic n-grams

In order to first get a general idea of the added value of syntactic n-grams independent of our research question about disciplinary differences, we quantify the overlap between linear and syntactic n-grams. To this end we investigated to what degree the syntactic n-grams correspond to linear n- grams.

We calculated for all four levels (token, lemma, part of speech and dependency relation), to what extent the 200 highest-scoring syntactic n-grams correspond to linear n-grams.⁶ For each of the

6With increasing n, the number of n-grams passing the frequency threshold of 10 decreases quickly. Therefore, the number for syntactic token 5-grams is only based on 37 items

(20)

Figure 4: Proportion of syntactic n-grams that correspond to a linear n-gram (by n-gram size and level of annotation)

200 syntactic n-gram types, we checked all corresponding token instances for linearity (score 1) or non-linearity (score 0) and calculated the mean for each type. The resulting value gives us information about the overlap of linear and syntactic n-grams: A score of 1 means that all token instances of the syntactic n-gram are also linear n- grams. A score of 0 means that none of the token instances of the syntactic n-gram correspond to linear n-grams.

Figure 4 shows the resulting distribution of overlap by n-gram size and level of annotation.

The proportion of linear n-grams is low, with a mean between 0.36 and 0.57 already for bigrams, depending on the level of annotation. As expected, the proportion of linear n-grams decreases asnin- creases. With every additional transition from one word to the next, the probability of at least one deviation from the linear order rises.

Additionally, there is a tendency of decreasing linearity with increasing abstractness from token to lemma to part of speech and dependency relation. One particular combination of tokens can be exclusively realized linearly but a lemma com- prises several different token combinations, which will not all be realized linearly. With increasing abstractness, more heterogeneous cases are sub- sumed under one label, making purely linear instances less and less likely.

However, it has to be borne in mind that syntactic n-grams with more than one branch were not included. These might correspond to linear n-grams to a higher degree, resulting in a higher overlap between the two types of n-grams. In the

that do not necessarily achieve low p-values in the t-test.

Also, the syntactic token 4-grams and linear token 4-/5-grams are partially based on items that do not pass the level of significance (p=0.001).

present analysis, linear n-grams cover some structures that correspond to syntactic units, but are not captured by our narrow approach to syntactic n- grams. Consequently, the widening of our real- ization of syntactic n-grams is advisable in future work.

6.2 Interpretation of linear and syntactic 4-grams

We will now focus on the possibilities of interpreting linear and syntactic n-grams in order to draw conclusions about linguistic properties of the Ger- man academic languages of linguistics and literary studies. In this section, we discuss one example in detail while the next section will present possibilities of quantifying these interpretations on a larger scale. The focus will be on token n-grams as they can easily be read by humans. Especially longer part-of-speech sequences (like ART-NN- APPR-PPOSAT-NN⁷) are quite abstract and re- quire a person with experience with the tag set and possibly a set of example instances (see Andresen and Zinsmeister (2017) for an attempt to include these).

Table 2a and Table 2b show the 15 highest- scoring 4-grams for the linear and the syntactic data set, respectively. These are the n-grams with the highest difference in frequency when comparing the disciplines. In addition to the n-gram, an approximate translation into English is provided.

Given the fragmentary nature of n-grams, these translations are sometimes based on additional as- sumptions about the context and do therefore only represent one of several possible meanings. The row color indicates in which discipline the n-gram is more frequent: n-grams more frequent in literary studies are colored gray, those more frequent in linguistics white.

Among the linear n-grams in Table 2a, structures following a comma dominate the ranking.

This can be explained by the fact that the beginning of subordinate clauses is grammatically restricted to some specific patterns. Because of the grammatical gender in German, some structures reoccur in several similar forms. Many patterns that are significantly more frequent in literary studies indicate relative clauses (rank 3, 4, 5, 7, 8 and 12). For linguistics this is only true for

7The tag set used here is the STTS (Schiller et al., 1999).

This sequence corresponds to article – noun – preposition – possessive pronoun in attributive position – noun, e. g. the name of his mother.

10

(21)

rank linear n-gram literal translation comment 1 , die bei der , that.3SG.F/3PLat the

2 davon aus , dass expect that fragment of: expect that the

3 , das in der , that.3SG.Nin the

4 , in der er , in which he

5 , der sich von , that.3SG.Mit.REFLof

6 aus , dass die out, that the.3SG.F/3PL fragment of: expect that the 7 , in dem sie , in that3SG.M/Nshe/they

8 , in dem sich , in that3SG.M/Nit.REFL

9 bei der Auswahl der in the selection of 10 , ob es sich , whether it it.REFL

11 , bei denen sich , at which it.REFL

12 , der sich in , that.3SG.Mit.REFLin 13 , sich in die , it.REFLin the

14 aus sich selbst heraus out of it.REFL

15 , die sich auf , that.3SG.F/3PLit.REFLon

(a) Linear token 4-grams

rank syntactic n-gram literal translation translation

1 und>können>werden>. and>can>be>. and can be. (passive) 2 rückt>in>Vordergrund>den bring>to>fore>the bring to the fore 3 rückt>in>Nähe>die bring>in>proximity>the bring sth. closer to 4 ist>in>Lage>der is>in>condition>the is capable of 5 im>als>im>auch in>as>in>also in X as well as Y 6 bei>als>bei>auch at>as>at>also at X as well as Y 7 kann>werden>gelesen>als can>be>read>as can be read as 8 werden>erläutert>im

>Folgenden

is>explained>in the>following

In the following, ... is explained

9 ist>in>Regel>der ist>in>rule>the is generally 10 war>in>Lage>der was>in>condition>the was capable of 11 und>kann>nicht>mehr and>can>not>anymore and can no longer 12 zu>Beginn>Jahrhunderts

>des

at>beginning>century>the at the beginning of the century

13 werden>vorgestellt>Im

>Folgenden

is>presented>in the>following

In the following, ... is presented

14 in>H¨alfte>Jahrhunderts>des in>half>century>of the in the ... half of the century 15 stellt>in>Mittelpunkt>den puts>in>center>the centers/focuses on

(b) Syntactic token 4-grams

Table 2: Highest-scoring token 4-grams for linear and syntactic n-grams (rank based on t-test; gray = n-gram is more frequent in literary studies, white = n-gram is more frequent in linguistics)

(22)

rank 1, 11 and 15. Interestingly, all of these use the pronoundie, which can be feminine singular, but is more likely to be plural (independent of gender). We might derive the explanatory hypothesis that literary scholars write more about individuals while linguists are rather concerned with groups of phenomena in a generic way. This is in accor- dance with the intuitive idea of how these disciplines work.

The results for syntactic n-grams in Table 2b are quite different. The most distinctive is a very general complex verb pattern in passive voice with the modal verb can, that can be combined with any main verb and is more common in linguistics.

There are also some more specific complex verbs that include a main verb (rank 7, 8 and 13). Ad- ditionally, there are the light verb constructionsin den Vordergrund rücken (‘bring to the fore’), in die Nähe rücken(‘bring sth. closer to sth. else’), in der Lage sein (‘be able to do sth.’) andin den Mittelpunkt stellen(‘focus on sth.’). All of these structures relate to the properties of German described in section 4.1 and would not be detected in a purely linear n-gram approach. Other syntactic n-grams refer to structures that can be captured similarly by linear n-grams, e. g. the syntactic 4- gramist>in>Regel>dercorresponds to the linear n-gramist in der Regel. This reflects the findings of section 6.1 showing overlap as well as differences between the two types of n-grams.

6.3 Quantifying linguistic interpretability Taking these interpretations as a starting point, we made the attempt to quantify the interpretability of linear and syntactic n-grams. Thereby we hope to objectify the n-grams’ potential and provide a foundation for a deepened comparison.

A sample of syntactic and linear n-grams⁸ was annotated by three annotators according to the following categories:

1. This n-gram contains a (complex) lexical unit (LEX) or overlaps with one (LEX-P).

2. This n-gram contains a grammatical structure (GRAM) or overlaps with one (GRAM-P).

3. This n-gram contains a structure that is ambiguous between lexicon and grammar (LEX-P GRAM-P).

8For the n-gram sizes 2-5, we chose the 20 highest- scoring syntactic and linear token n-grams, respectively, giv- ing a total sample size of 160 items.

Figure 5: Annotation of information in n-grams dependentonn-gramtype,n=160

4. This n-gram does not contain a (complex) lexical unit or grammatical structure (NONE).

For categories 1 to 3, the annotators were asked to additionally provide the lexical unit or grammatical structure they were thinking of.

The annotators reached an inter-annotator- agreement of Fleiss’κ0.55 which we consider sat- isfying given the natural ambiguity of the task. Af- ter discussing nine elements where no agreement was reached initially, all three annotators agreed on one category for 57% of items. For the rest at least two annotators agreed on one category. The following results are based on a majority vote.

Figure 5 shows the distribution of annotation categories for the two n-gram types. For the linear n-grams, more grammatical phenomena were found, and for syntactic n-grams, more lexical phenomena (especially complete lexical items) were found. The difference is significant with p

< 0.001 (Fisher’s Exact Test), which shows that there are many non-linear lexical items that are detected by the syntactic n-grams only. The number of non-interpretable instances is higher in syntactic n-grams (1 vs. 10 instances). These are e. g. sequences of only one word and the following punctuation or sequences related to specific properties of the annotation scheme.

Regarding the concrete structures observed, there is a clear overlap in lexical phenomena, e. g.

the sequence in der Regel (‘as a rule’) is a linear as well as a syntactic n-gram. Syntactic n- grams additionally capture light verb constructions that are non-linear (see section 4.1 ), e. g.

12

(23)

den<Vordergrund<in<r¨uckt(‘bring to the fore’), which might explain the higher proportion of lexical phenomena. In grammatical structures, on the other hand, there is hardly any overlap. While most linear n-grams (35 of 55 grammatical structures in total) capture different types of relative clauses (e. g. the trigram, die ihm, ‘that [...] him’), among the syntactic n-grams complex verb structures (11 of 19 grammatical structures in total) and phenomena of coordination (5 of 19) dominate.

Together, linear and syntactic n-grams result in an informative comparison of the two disciplines: In literary studies we find many more relative clauses and light verb constructions, while linguistics employs more complex verb forms like passive and modal verbs. A more comprehensive interpretation of these and more data with respect to the disciplinary differences is conducted in An- dresen and Zinsmeister (2017).

The annotation experiment shows that linear and syntactic n-grams capture very different phenomena and can complement each other in useful ways. At this point, it is not possible to generalize these results as they need to be verified by analyz- ing more data of different genres (and languages).

7 Conclusion

The research presented in this paper shows that an analysis based on syntactic n-grams, under- stood as n-grams following the path of dependency relations in the sentence, can give linguistically meaningful insights in the properties of a language variety. We have demonstrated theoretically and empirically that there are many non- linear structures in languages like German. These are not adequately taken into consideration in a language representation based on linear n-grams only. Through the example of comparing the Ger- man academic languages of linguistics and literary studies we showed that linear and syntactic n- grams capture very different linguistic structures.

In our exemplary study, especially complex verbs and light verb constructions could not be detected by the linear n-gram analysis.⁹ However, the analysis of syntactic n-grams is highly dependent on the quality of the dependency annotation. Also, some structures are frequent only because of specific properties of the annotation scheme. It re-

9Our aim was to increase coverage of phenomena included in the analysis. We do not to automatically distinguish between light verb constructions and free verb-noun associa- tions.

mains a desideratum for future research to deter- mine the influence of the annotation scheme and the potential of Universal Dependencies to allow for a cross-linguistic comparison of this type of analysis.

For the future, it would be desirable to include syntactic n-grams that take more than one dependent per head into account. Currently, patterns such as a verb and its subject and object or a noun and two modifiers are missed by the syntactic n- grams of our study. The linear n-grams can com- pensate this only very partially. Also, it should be considered to systematically evaluate the potential of dependency-based annotations in comparison to other syntactic models, e. g. constituency-based models.

Acknowledgments

We would like to thank Yannick Versley and Fabian Barteld for their very helpful comments on an earlier version of the paper, Sarah Jablotschkin for contributing to the manual n-gram evaluation, and Piklu Gupta for improving our English. All remaining errors are our own.

References

Elena Afros and Catherine F. Schryer. 2009. Pro- motional (meta)discourse in research articles in language and literary studies. English for Specific Pur- poses, 28(1):58–68, January.

Melanie Andresen and Heike Zinsmeister. 2017. Ap- proximating Style by n-Gram-based Annotation. In Proceedings of the Workshop on Stylistic Variation, Copenhagen, Denmark, September.

Markus Becker and Anette Frank. 2002. A Stochastic Topological Parser of German. In Proceedings of COLING 2002, pages 71–77.

Douglas Biber, Stig Johansson, Geoffrey Leech, Su- san Conrad, and Edward Finegan. 1999. Longman Grammar of Spoken and Written English. Longman, Harlow.

Bernd Bohnet. 2010. Very High Accuracy and Fast Dependency Parsing is not a Contradiction. InPro- ceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Bei- jing, China.

Stefan Bott and Sabine Schulte im Walde. 2015.

Exploiting Fine-grained Syntactic Transfer Features to Predict the Compositionality of German Par- ticle Verbs. In Proceedings of the 11th Inter- national Conference on Computational Semantics, IWCS 2015, 15-17 April, 2015, Queen Mary Uni- versity of London, London, UK, pages 34–39.

(24)

Anna Cardinaletti and Ian Roberts. 2002. Clause Structure and X-Second. In Guglielmo Cinque, ed- itor, Functional Structure in DP and IP: The Car- tography of Syntactic Structures, volume 1, pages 123–166. Oxford University Press.

Richard Futrell, Kyle Mahowald, and Edward Gibson.

2015. Quantifying Word Order Freedom in Depen- dency Corpora. In Proceedings of the Third In- ternational Conference on Dependency Linguistics (Depling 2015), pages 91–100, Uppsala.

Yoav Goldberg and Jon Orwant. 2013. A Dataset of Syntactic-Ngrams over Time from a Very Large Cor- pus of English Books. InSecond Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 241–247, Atlanta, Georgia, USA, June. Association for Computational Linguistics.

David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006. A closer look at skip-gram modelling. InProceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006), pages 1–4.

Dan Jurafsky and James H Martin. 2014. Speech and language processing, volume 3. Pearson.

Tibor Kiss and Jan Strunk. 2006. Unsupervised Mul- tilingual Sentence Boundary Detection. Computa- tional Linguistics, 32(4):485–525, December.

Sandra K¨ubler and Heike Zinsmeister. 2015. Corpus Linguistics and Linguistically Annotated Corpora.

Bloomsbury, London, New York.

Gabriella Lapesa and Stefan Evert. 2017. Large-scale evaluation of dependency-based DSMs: Are they worth the effort? InProceedings of the 15th Confer- ence of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 394–400, Valencia, Spain, April. Association for Computational Linguistics.

Jefrey Lijffijt, Terttu Nevalainen, Tanja S¨aily, Panagi- otis Papapetrou, Kai Puolam¨aki, and Heikki Man- nila. 2014. Significance testing of word frequencies in corpora. Digital Scholarship in the Humanities, pages 1–24, December.

Michaela Mahlberg. 2013.Corpus Stylistics and Dick- ens’s Fiction. Number 14 in Routledge advances in corpus linguistics. Routledge, New York.

Wolfgang Maier, Miriam Kaeshammer, Peter Bau- mann, and Sandra K¨ubler. 2014. Discosuite - A Parser Test Suite for German Discontinuous Struc- tures. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland. European Language Resources Association (ELRA).

Joakim Nivre, ˇZeljko Agi´c, and Lars Ahrenberg. 2017.

Universal Dependencies 2.0. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University.

Timothy Osborne, Michael Putnam, and Thomas Groß.

2012. Catenae: Introducing a Novel Unit of Syntac- tic Analysis.Syntax, 15(4):354–396, December.

Sebastian Pad´o and Mirella Lapata. 2007.

Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161–199.

Magali Paquot and Yves Bestgen. 2009. Distinc- tive words in academic writing: A comparison of three statistical tests for keyword extraction. In Andreas H. Jucker, Daniel Schreier, and Marianne Hundt, editors, Corpora: Pragmatics and Dis- course, pages 247–269. Brill, January.

Anne Schiller, Simone Teufel, Christine Thielen, and Christine Stöckert. 1999. Guidelines für das Tag- ging deutscher Textcorpora mit STTS (kleines und großes Tagset). Stuttgart, Tübingen.

Wolfgang Seeker and Jonas Kuhn. 2012. Making El- lipses Explicit in Dependency Conversion for a Ger- man Treebank. In Proceedings of the 8th Interna- tional Conference on Language Resources and Eval- uation, pages 3132–3139, Istanbul, Turkey.

Grigori Sidorov, Francisco Velasquez, Efstathios Sta- matatos, Alexander Gelbukh, and Liliana Chanona- Hern´andez. 2012. Syntactic Dependency-Based N- grams as Classification Features. In Ildar Batyr- shin and Miguel Gonz´alez Mendoza, editors, Ad- vances in Computational Intelligence, number 7630 in Lecture Notes in Computer Science, pages 1–11.

Springer, October.

Grigori Sidorov. 2013. Syntactic Dependency Based N-grams in Rule Based Automatic English as Sec- ond Language Grammar Correction. International Journal of Computational Linguistics and Applica- tions, 4(2):169–188.

Rion Snow, Daniel Jurafsky, Andrew Y Ng, et al. 2004.

Learning syntactic patterns for automatic hypernym discovery. InNIPS, volume 17, pages 1297–1304.

Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the Ameri- can Society for Information Science and Technology, 60(3):538–556, March.

Mark Twain. 1880. A Tramp Abroad. Chatto & Win- dus, London.

Andreas van Cranenburgh and Rens Bod. 2017. A Data-Oriented Model of Literary Language. Pro- ceedings of the 15th Conference of the European Chapter of the Association for Computational Lin- guistics: Volume 1, Long Papers, 1:1228–1238.

Yannick Versley. 2013. A graph-based approach for implicit discourse relations. Computational Lin- guistics in the Netherlands Journal, 3:148–173.

14

(25)

On the Predicate-Argument Structure: Internal and Absorbing Scope

Igor Boguslavsky Russian Academy of Sciences Institute for Information Transmission

Problems, Russia

Universidad Politécnica de Madrid, Spain

igor.m.boguslavsky@gmail.com

Abstract

Valency filling is considered a major mechanism for constructing the semantic structure of the sentence from semantic structures of words. This approach requires a broader view of valency and actant, cover- ing all kinds of actant-bearing words and all types of valency filling. We introduce the concept of scope as a generalization of act- ant: it is any fragment of a Syntactic (SyntScope) or Semantic Structure (Sem- Scope) that fills a valency of a predicate.

Actant is a particular case of scope. We dis- cuss two classes of situations, mostly on the material of Russian, that manifest non- isomorphism between SyntScope and Sem- Scope: (a) meaning α that fills a valency of word L constitutes only a part of the mean- ing of word L′ (internal scope); (b) predi- cate π is an internal component of the meaning of word L; π extends its valency (distinct from valencies of L) to words different from L (absorbing scope).

1 Introduction

This paper is a continuation of a series of pub- lications (Boguslavsky 1985, 1996, 1998, 2003, 2007, 2014, 2016) in which we discuss different types of valency slot filling. Several introductory remarks are in order.

First of all, instantiating valency slots, or, in a different terminology, identifying arguments of predicates, is a major step in constructing the semantic structure of the sentence, because it is the main mechanism of meaning amal- gamation, a kind of semantic glue that con- nects meanings together. This view of valencies implies that the concepts of valency and actant (or, argument) should be interpreted

broader than it is often done. Here we follow the tradition of the Moscow Semantic School (MSS), which in its turn, shares these notions with the Meaning – Text theory (Apresjan 1974, Mel'čuk 1974). For MSS, the starting point in defining the concept of valency of a word is the semantic analysis of the situation denoted by this word. The analytical semantic definition of a word, constructed according to certain rules (Apresjan 1995), should explicitly present all obligatory participants of the situation denoted by this word. For a word L to have a certain valency slot it is necessary, though insufficient, that a situation denoted by L should contain a corresponding participant in an intuitively obvious way. Another condition is that this participant should be expressible in a sentence along with L in a systematic way (Mel’čuk 2014). A word or a phrase that de- notes such a participant is said to fill (or instantiate) the valency slot and is called an actant.

The range of valency words is not restricted to verbs and nouns. Other parts of speech, such as adjectives, adverbs, particles, conjunctions, and prepositions are equally entitled to be classed as actant-bearing words. Moreover, being non-prototypical predicates, they sub- stantially extend our idea of the inventory of the ways which predicates use to instantiate their valencies.

The next remark is that we are going to speak about valency filling at two representation levels – at the level of the syntactic structure (SyntS) and at the level of the semantic structure (SemS). SyntS is a dependency tree, the nodes of which are lexical units (LU) – lexemes or multiword expressions that func- tion as a whole. In SemS LUs are represented by their semantic decomposition, which is a

15

Proceedings of the Conference