• Nem Talált Eredményt

Proceedings of the Workshop

N/A
N/A
Protected

Academic year: 2022

Ossza meg "Proceedings of the Workshop"

Copied!
867
0
0

Teljes szövegt

(1)

SemEval 2014

The 8th International Workshop on Semantic Evaluation (SemEval 2014)

Proceedings of the Workshop

August 23-24, 2014

Dublin, Ireland

(2)

Organized and sponsored in part by:

The ACL Special Interest Group on the Lexicon (SIGLEX)

c

2014: Papers marked with a Creative Commons or other specific license statement are copyright of their respective authors (or their employers).

ISBN 978-1-941643-24-2

(3)

Welcome to SemEval-2014

The Semantic Evaluation (SemEval) series of workshops focuses on the evaluation and comparison of systems that can analyse diverse semantic phenomena in text with the aim of extending the current state- of-the-art in semantic analysis and creating high quality annotated datasets in a range of increasingly challenging problems in natural language semantics. SemEval provides an exciting forum for researchers to propose challenging research problems in semantics and to build systems/techniques to address such research problems.

SemEval-2014 is the eighth workshop in the series. The first three workshops, SensEval-1 (1998), SensEval-2 (2001), and SensEval-3 (2004), focused on word sense disambiguation, each time growing in the number of languages offered in the tasks and in the number of participating teams. In 2007, the workshop was renamed as SemEval, and in the next four workshops SemEval-2007/2010/2012/2013 the nature of the tasks evolved to include semantic analysis tasks outside of word sense disambiguation.

Starting in 2012, SemEval turned into a yearly event.

This volume contains papers accepted for presentation at the SemEval-2014 International Workshop on Semantic Evaluation Exercises. SemEval-2014 was co-located with the 25th International Conference on Computational Linguistics (COLING) in Dublin.

SemEval-2014 included the following 10 shared tasks:

1. Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Entailment

2. Grammar Induction for Spoken Dialogue Systems 3. Cross-Level Semantic Similarity

4. Aspect Based Sentiment Analysis 5. L2 Writing Assistant

6. Supervised Semantic Parsing of Spatial Robot Commands 7. Analysis of Clinical Text

8. Broad-Coverage Semantic Dependency Parsing 9. Sentiment Analysis in Twitter

10. Multilingual Semantic Textual Similarity

About 185 teams submitted more than 500 systems for the 10 tasks of SemEval-2014. This volume contains both Task Description papers that describe each of the above tasks and System Description papers that describe the systems that participated in the above tasks. A total of 10 task description papers and 139 system description papers are included in this volume.

We are grateful to all program committee members for their high quality, elaborate and thoughtful reviews. The papers in this proceedings have surely benefited from this feedback. We also thank the COLING’2014 conference organizers for the local organization and the forum. Finally, we most gratefully acknowledge the support of our sponsor, the ACL Special Interest Group on the Lexicon (SIGLEX).

Welcome to SemEval-2014, Preslav Nakov and Torsten Zesch

(4)
(5)

SemEval Chairs:

• Preslav Nakov, Qatar Computing Research Institute

• Torsten Zesch, University of Duisburg-Essen Reviewing Track Chairs:

• Carmen Banea, University of Michigan

• Raffaella Bernardi, University of Trento

• Mona Diab, The George Washington University

• Kais Dukes, University of Leeds

• Maarten van Gompel, Radboud University Nijmegen

• Elias Iosif, Athena Research and Innovation Center

• David Jurgens, Sapienza University of Rome

• Marco Kuhlmann, Linköping University

• Preslav Nakov, Qatar Computing Research Institute

• Stephan Oepen, Universitetet i Oslo

• Maria Pontiki, Institute for Language and Speech Processing

• Sameer Pradhan, Harvard University

• Sara Rosenthal, Columbia University

• Mohammad Taher Pilehvar, Sapienza University of Rome Reviewers:

• Naveed Afzal, King Abdulaziz University North Branch Jeddah

• Apoorv Agarwal, Columbia University

• Eneko Agirre, University of the Basque Country

• Željko Agi´c, University of Zagreb

• Ameeta Agrawal, York University

• Mariana S. C. Almeida, Priberam / Instituto de Telecomunicações

• Miguel A. Alonso, Universidade da Coruña

• Ana Alves, CISUC - University of Coimbra and Polythecnic Institute of Coimbra

• Silvio Amir, INESC-ID, IST

• Beakal Gizachew Assefa, Koc University

• Georgia Athanasopoulou, Technical University of Crete

• Giuseppe Attardi, Università di Pisa

• Pedro Balage Filho, University of São Paulo

• Timothy Baldwin, The University of Melbourne

• Sivaji Bandyopadhyay, Jadavpur University

(6)

• Carmen Banea, University of Michigan

• Pierpaolo Basile, University of Bari

• Roberto Basili, University of Roma, Tor Vergata

• Osman Baskaya, Koç University

• Frederic Bechet, Aix Marseille Universite - LIF/CNRS

• Islam Beltagy, The University of Texas at Austin

• Gabor Berend, University Of Szeged

• Raffaella Bernardi, University of Trento

• Yves Bestgen, Université Catholique de Louvain

• Steven Bethard, University of Alabama at Birmingham

• Ergun Biçici, Centre for Next Generation Localisation, Dublin City University

• Johannes Bjerva, Center for Language and Cognition Groningen, University of Groningen

• Pavel Blinov, VSHU

• Bernd Bohnet, University Birmingham

• Fritjof Bornebusch, University of Bremen

• Houda Bouamor, Carnegie Mellon University

• Caroline Brun, Xerox Research Centre Europe

• Tomáš Brychcín, University of West Bohemia

• Paul Buitelaar, INSIGHT, National University of Ireland, Galway

• Davide Buscaldi, LIPN, Université Paris 13

• Glaucia Cancino, Universität Bremen

• Yenier Castañeda, University of Matanzas, Cuba

• Giuseppe Castellucci, University of Rome, Tor Vergata

• Tawunrat Chalothorn, University of Northumbria at Newcastle

• Maryna Chernyshevich, IHS Inc. / IHS Global Belarus

• Arodami Chorianopoulou, Technical University of Crete

• Mark Cieliebak, Zurich University of Applied Sciences

• Philipp Cimiano, University of Bielefeld

• Francisco Couto, University of Lisbon

• Daniel Dahlmeier, SAP

• Hong-Jie Dai, Taipei Medical University

• Alexandre Denis, LORIA/University of Lorraine

• Mona Diab, The George Washington University

• Cicero dos Santos, IBM Research

• Yantao Du, Institute of Computer Science and Technology of Peking University

• Kais Dukes, University of Leeds

• Asif Ekbal, IIT Patna

• Kilian Evang, University of Groningen

• Stefan Evert, FAU Erlangen-Nürnberg

• Richárd Farkas, University of Szeged

(7)

• Lorenzo Ferrone, University of Rome, Tor Vergata

• Oliver Ferschke, UKP Lab, Technische Universität Darmstadt

• Tim Finin, University of Maryland

• Lucie Flekova, UKP Lab, Technische Universität Darmstadt

• Dan Flickinger, Stanford University

• Jennifer Foster, Dublin City University

• Dimitris Galanis, Institute for Language and Speech Processing, Athena Research Center

• Pablo Gamallo, CITIUS, University of Santiago de Compostela

• Björn Gambäck, Norwegian University of Science and Technology

• Marcos Garcia, University of Santiago de Compostela

• Aitor García Pablos, Vicomtech-IK4

• Georgi Georgiev, Ontotext AD

• Swarnendu Ghosh, Jadavpur University

• Daniel Gildea, University of Rochester

• Filip Ginter, University of Turku

• Paulo Gomes, CISUC - University of Coimbra

• Hugo Gonçalo Oliveira, CISUC, University of Coimbra

• Anubhav Gupta, Université de Franche-Comté

• Rohit Gupta, University of Wolverhampton

• Iryna Gurevych, UKP Lab, Technische Universität Darmstadt

• Yoan Gutiérrez, University of Matanzas

• Tobias Günther, Retresco GmbH

• Hussam Hamdan, Aix-Marseille Université

• Lushan Han, University of Maryland

• Viktor Hangya, University of Szeged

• Eva Hasler, University of Edinburgh

• Iris Hendrickx, Center for Language Studies, Radboud University Nijmegen

• Nora Hollenstein, University of Zurich

• Veronique Hoste, Ghent University

• Estevam Hruschka, Federal University of São Carlos

• Pingping Huang, Peking University

• Elias Iosif, Athena Research and Innovation Center

• Angelina Ivanova, University of Oslo

• Martin Jaggi, ETH Zurich

• Arun kumar Jayapal, Trinity College Dublin

• Sergio Jiménez, National University of Colombia

• Salud María Jiménez-Zafra, University of Jaén

• Richard Johansson, University of Gothenburg

• David Jurgens, Sapienza University of Rome

• Magdalena Kacmajor, IBM

(8)

• Rafael - Michael Karampatsis, Athens University of Economics and Business

• Abhay Kashyap, University of Maryland

• Rohit Kate, University of Wisconsin-Milwaukee

• Melinda Katona, University of Szeged

• Svetlana Kiritchenko, National Research Council Canada

• Ioannis Klasinas, Technical University of Crete

• Manfred Klenner, University of Zurich

• Evgeniy Kotelnikov, Vyatka State University of Humanities at Kirov

• Milen Kouylekov, University of Oslo

• Marco Kuhlmann, Linköping University

• Alice Lai, University of Illinois at Urbana-Champaign

• Man Lan, East China Normal University

• Joseph Le Roux, Université Paris Nord

• Els Lefever, LT3, Ghent University

• Maria Liakata, University of Warwick

• Pengfei Liu, The Chinese University of Hong Kong

• Peter Ljunglöf, University of Gothenburg and Chalmers University of Technology

• André Lynum, Norwegian University of Science and Technology

• Nikolaos Malandrakis, Signal Analysis and Interpretation Laboratory (SAIL), USC

• Suresh Manandhar, University of York

• Soumik Mandal, Jadavpur University

• Morgane Marchand, CEA-LIST / CNRS-LIMSI

• Marco Marelli, University of Trento

• Patricio Martinez-Barco, Universidad de Alicante

• André F. T. Martins, Priberam, Instituto de Telecomunicacoes

• Eugenio Martínez-Cámara, University of Jaén

• Sérgio Matos, DETI/IEETA, University of Aveiro, Portugal

• Willem Mattelaer, Katholieke Universiteit Leuven

• Kathy McKeown, Columbia University

• Helen Meng, The Chinese University of Hong Kong

• Todor Mihaylov, Sofia University

• Yasuhide Miura, Research & Technology Group, Fuji Xerox Co., Ltd.

• Saif Mohammad, National Research Council Canada

• Behrang Mohit, Carnegie Mellon University

• Andres Montoyo, University of Alicante

• Jose G. Moreno, Normandie University - GREYC

• Rafael Muñoz-Guillena, University of Alicante

• Preslav Nakov, Qatar Computing Research Institute

• Naveen Nandan, SAP Research & Innovation

• Shrikanth Narayanan, University of Southern California

(9)

• Sapna Negi, National University of Ireland, Galway

• Max Nitze, University of Bremen

• Stephan Oepen, Universitetet i Oslo

• Maite Oronoz, University of the Basque Country

• Reynier Ortega Bueno, CERPAMID, Cuba

• Woodley Packard, University of Washington

• Partha Pakray, Norwegian University of Science and Technology

• Thiago Pardo, University of São Paulo

• Parth Pathak, ezDI

• Braja Gopal Patra, Jadavpur University

• John Pavlopoulos, Athens University of Economics and Business

• Ted Pedersen, University of Minnesota, Duluth

• Viktor Pekar, University of Birmingham

• Mohammad Taher Pilehvar, Sapienza University of Rome

• David Pinto, Benemérita Universidad Autónoma de Puebla

• Maria Pontiki, Institute for Language and Speech Processing

• Alexandros Potamianos, National Technical University of Athens

• Sameer Pradhan, Harvard University

• Thomas Proisl, FAU Erlangen-Nürnberg

• Avinesh PVS, IBM India Pvt Ltd, IBM Software Group, Watson

• SV Ramanan, RelAgent Private Lrd

• German Rigau, UPV/EHU

• Miguel Rios, University of Wolverhampton

• Alejandro Riveros, Universidad Nacional de Colombia

• Sara Rosenthal, Columbia University

• Alex Rudnick, Indiana University

• José Saias, Universidade de Evora

• Kim Schouten, Erasmus University Rotterdam

• Clemens Schulze Wettendorf, FAU Erlangen Nürnberg

• Djamé Seddah, Université Paris Sorbonne (Paris IV)

• Nádia Silva, University of São Paulo

• Mario J. Silva, IST/INESC-ID

• Emilio Silva-Schlenker, Universidad Nacional de Colombia, Universidad de los Andes

• Michel Simard, National Research Council Canada

• Vikram Singh, Indian Institute of Technology-Patna

• Noah A. Smith, Carnegie Mellon University

• Josef Steinberger, University of West Bohemia

• Svetlana Stoyanchev, AT&T Labs Research

• Veselin Stoyanov, Facebook

• Md Arafat Sultan, University of Colorado Boulder

(10)

• Aleš Tamchyna, Charles University in Prague, UFAL MFF

• Liling Tan, Universität des Saarlandes

• Duyu Tang, Harbin Institute of Technology

• Cindi Thompson, University of San Francisco

• Zhu Tiantian, East China Normal University

• Zhiqiang Toh, Institute for Infocomm Research

• Lamia Tounsi, Dublin City University

• David Townsend, Montclair State University

• L. Alfonso Urena Lopez, University of Jaén

• Fatih Uzdilli, ZHAW Zurich University of Applied Sciences

• Antal van den Bosch, Radboud University Nijmegen

• Maarten van Gompel, Radboud University Nijmegen

• Cynthia Van Hee, Ghent University

• Kateˇrina Veselovská, Charles University in Prague

• Akriti Vij, SAP Research & Innovation, Singapore, and Nanyang Technological University, Singapore

• David Vilares, Universidade da Coruña

• Julio Villena-Román, Daedalus, S.A.

• Ngoc Phuoc An Vo, HLT-FBK, Trento

• Joachim Wagner, Centre for Next Generation Localisation, Dublin City University

• Wenting Wang, NA

• Andy Way, Centre for Next Generation Localisation, Dublin City University

• Janyce Wiebe, University of Pittsburgh

• Deniz Yuret, Koç University

• Roberto Zamparelli, Universitá di Trento

• Kalliopi Zervanou, University of Southern California

• Torsten Zesch, Language Technology Lab, University of Duisburg-Essen

• Yaoyun Zhang, University of Texas

• Fangxi Zhang, East China Normal University

• Jiang Zhao, East China Normal University

• Ming Zhou, Microsoft Research Asia

• Xiaodan Zhu, National Research Council Canada

Invited Speakers (jointly for SemEval and *SEM):

• Mark Steedman, University of Edinburgh

• Tim Baldwin, The University of Melbourne

(11)

Table of Contents

SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment

Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini and Roberto Zamparelli . . . .1 SemEval-2014 Task 2: Grammar Induction for Spoken Dialogue Systems

Ioannis Klasinas, Elias Iosif, Katerina Louka and Alexandros Potamianos . . . .9 SemEval-2014 Task 3: Cross-Level Semantic Similarity

David Jurgens, Mohammad Taher Pilehvar and Roberto Navigli . . . .17 SemEval-2014 Task 4: Aspect Based Sentiment Analysis

Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos and Suresh Manandhar. . . .27 SemEval 2014 Task 5 - L2 Writing Assistant

Maarten van Gompel, Iris Hendrickx, Antal van den Bosch, Els Lefever and Veronique Hoste . .36 SemEval-2014 Task 6: Supervised Semantic Parsing of Robotic Spatial Commands

Kais Dukes . . . .45 SemEval-2014 Task 7: Analysis of Clinical Text

Sameer Pradhan, Noémie Elhadad, Wendy Chapman, Suresh Manandhar and Guergana Savova.54 SemEval 2014 Task 8: Broad-Coverage Semantic Dependency Parsing

Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Dan Flickinger, Jan Hajic, An- gelina Ivanova and Yi Zhang . . . .63 SemEval-2014 Task 9: Sentiment Analysis in Twitter

Sara Rosenthal, Alan Ritter, Preslav Nakov and Veselin Stoyanov. . . .73 SemEval-2014 Task 10: Multilingual Semantic Textual Similarity

Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Wei- wei Guo, Rada Mihalcea, German Rigau and Janyce Wiebe . . . .81 AI-KU: Using Co-Occurrence Modeling for Semantic Similarity

Osman Baskaya . . . .92 Alpage: Transition-based Semantic Graph Parsing with Syntactic Features

Corentin Ribeyre, Eric Villemonte de la Clergerie and Djamé Seddah . . . .97 ASAP: Automatic Semantic Alignment for Phrases

Ana Alves, Adriana Ferrugento, Mariana Lourenço and Filipe Rodrigues. . . .104 AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

Svetlana Stoyanchev, Hyuckchul Jung, John Chen and Srinivas Bangalore . . . .109 AUEB: Two Stage Sentiment Analysis of Social Network Messages

Rafael - Michael Karampatsis, John Pavlopoulos and Prodromos Malakasiotis . . . .114 Bielefeld SC: Orthonormal Topic Modelling for Grammar Induction

John Philip McCrae and Philipp Cimiano . . . .119

(12)

Biocom Usp: Tweet Sentiment Analysis with Adaptive Boosting Ensemble

Nádia Silva, Estevam Hruschka and Eduardo Hruschka . . . .123 Biocom Usp: Tweet Sentiment Analysis with Adaptive Boosting Ensemble

Nádia Silva, Estevam Hruschka and Eduardo Hruschka . . . .129 BioinformaticsUA: Concept Recognition in Clinical Narratives Using a Modular and Highly Efficient Text Processing Framework

Sérgio Matos, Tiago Nunes and José Luís Oliveira . . . .135 Blinov: Distributed Representations of Words for Aspect-Based Sentiment Analysis at SemEval 2014

Pavel Blinov and Eugeny Kotelnikov . . . .140 BUAP: Evaluating Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment

Saul Leon, Darnes Vilariño, David Pinto, Mireya Tovar and Beatriz Beltrán . . . .145 BUAP: Evaluating Features for Multilingual and Cross-Level Semantic Textual Similarity

Darnes Vilariño, David Pinto, Saul Leon, Mireya Tovar and Beatriz Beltrán . . . .149 BUAP: Polarity Classification of Short Texts

David Pinto, Darnes Vilariño, Saul Leon, Miguel Jasso and Cupertino Lucero . . . .154 CECL: a New Baseline and a Non-Compositional Approach for the Sick Benchmark

Yves Bestgen . . . .160 CISUC-KIS: Tackling Message Polarity Classification with a Large and Diverse Set of Features

João Leal, Sara Pinto, Ana Bento, Hugo Gonçalo Oliveira and Paulo Gomes . . . .166 Citius: A Naive-Bayes Strategy for Sentiment Analysis on English Tweets

Pablo Gamallo and Marcos Garcia . . . .171 CMU: Arc-Factored, Discriminative Semantic Dependency Parsing

Sam Thomson, Brendan O’Connor, Jeffrey Flanigan, David Bamman, Jesse Dodge, Swabha Swayamdipta, Nathan Schneider, Chris Dyer and Noah A. Smith . . . .176 CMUQ-Hybrid: Sentiment Classification By Feature Engineering and Parameter Tuning

Kamla Al-Mannai, Hanan Alshikhabobakr, Sabih Bin Wasi, Rukhsar Neyaz, Houda Bouamor and Behrang Mohit . . . .181 CMUQ@Qatar:Using Rich Lexical Features for Sentiment Analysis on Twitter

Sabih Bin Wasi, Rukhsar Neyaz, Houda Bouamor and Behrang Mohit . . . .186 CNRC-TMT: Second Language Writing Assistant System Description

Cyril Goutte, Michel Simard and Marine Carpuat . . . .192 Columbia NLP: Sentiment Detection of Sentences and Subjective Phrases in Social Media

Sara Rosenthal, Kathy McKeown and Apoorv Agarwal . . . .198 COMMIT-P1WP3: A Co-occurrence Based Approach to Aspect-Level Sentiment Analysis

Kim Schouten, Flavius Frasincar and Franciska de Jong . . . .203 Coooolll: A Deep Learning System for Twitter Sentiment Classification

Duyu Tang, Furu Wei, Bing Qin, Ting Liu and Ming Zhou . . . .208

(13)

Copenhagen-Malmö: Tree Approximations of Semantic Parsing Problems

Natalie Schluter, Anders Søgaard, Jakob Elming, Dirk Hovy, Barbara Plank, Héctor Martínez Alonso, Anders Johanssen and Sigrid Klerke . . . .213 DAEDALUS at SemEval-2014 Task 9: Comparing Approaches for Sentiment Analysis in Twitter

Julio Villena-Román, Janine García-Morera and José Carlos González-Cristóbal . . . .218 DCU: Aspect-based Polarity Classification for SemEval Task 4

Joachim Wagner, Piyush Arora, Santiago Cortes, Utsab Barman, Dasha Bogdanova, Jennifer Foster and Lamia Tounsi . . . .223 DIT: Summarisation and Semantic Expansion in Evaluating Semantic Similarity

Magdalena Kacmajor and John D. Kelleher . . . .230 DLIREC: Aspect Term Extraction and Term Polarity Classification System

Zhiqiang Toh and Wenting Wang . . . .235 DLS@CU: Sentence Similarity from Word Alignment

Md Arafat Sultan, Steven Bethard and Tamara Sumner . . . .241 Duluth : Measuring Cross-Level Semantic Similarity with First and Second Order Dictionary Overlaps

Ted Pedersen . . . .247 ECNU: A Combination Method and Multiple Features for Aspect Extraction and Sentiment Polarity Classification

Fangxi Zhang, Zhihua Zhang and Man Lan. . . .252 ECNU: Expression- and Message-level Sentiment Orientation Classification in Twitter Using Multiple Effective Features

Jiang Zhao, Man Lan and Tiantian Zhu . . . .259 ECNU: Leveraging on Ensemble of Heterogeneous Features and Information Enrichment for Cross Level Semantic Similarity Estimation

Tiantian Zhu and Man Lan . . . .265 ECNU: One Stone Two Birds: Ensemble of Heterogenous Measures for Semantic Relatedness and Tex- tual Entailment

Jiang Zhao, Tiantian Zhu and Man Lan . . . .271 ezDI: A Hybrid CRF and SVM based Model for Detecting and Encoding Disorder Mentions in Clinical Notes

Parth Pathak, Pinal Patel, Vishal Panchal, Narayan Choudhary, Amrish Patel and Gautam Joshi278 FBK-TR: Applying SVM with Multiple Linguistic Features for Cross-Level Semantic Similarity

Ngoc Phuoc An Vo, Tommaso Caselli and Octavian Popescu. . . .284 FBK-TR: SVM for Semantic Relatedeness and Corpus Patterns for RTE

Ngoc Phuoc An Vo, Octavian Popescu and Tommaso Caselli . . . .289 GPLSI: Supervised Sentiment Analysis in Twitter using Skipgrams

Javi Fernández, Yoan Gutiérrez, Jose Manuel Gómez and Patricio Martinez-Barco . . . .294 haLF: Comparing a Pure CDSM Approach with a Standard Machine Learning System for RTE

Lorenzo Ferrone and Fabio Massimo Zanzotto. . . .300

(14)

HulTech: A General Purpose System for Cross-Level Semantic Similarity based on Anchor Web Counts Jose G. Moreno, Rumen Moraliyski, Asma Berrezoug and Gaël Dias . . . .305 IHS R&D Belarus: Cross-domain extraction of product features using CRF

Maryna Chernyshevich . . . .309 IITP: A Supervised Approach for Disorder Mention Detection and Disambiguation

Utpal Kumar Sikdar, Asif Ekbal and Sriparna Saha. . . .314 IITP: Supervised Machine Learning for Aspect based Sentiment Analysis

Deepak Kumar Gupta and Asif Ekbal . . . .319 IITPatna: Supervised Approach for Sentiment Analysis in Twitter

Raja Selvarajan and Asif Ekbal . . . .324 Illinois-LH: A Denotational and Distributional Approach to Semantics

Alice Lai and Julia Hockenmaier . . . .329 In-House: An Ensemble of Pre-Existing Off-the-Shelf Parsers

Yusuke Miyao, Stephan Oepen and Daniel Zeman . . . .335 Indian Institute of Technology-Patna: Sentiment Analysis in Twitter

VIKRAM SINGH, Arif Md. Khan and Asif Ekbal . . . .341 INSIGHT Galway: Syntactic and Lexical Features for Aspect Based Sentiment Analysis

Sapna Negi and Paul Buitelaar . . . .346 iTac: Aspect Based Sentiment Analysis using Sentiment Trees and Dictionaries

Fritjof Bornebusch, Glaucia Cancino, Melanie Diepenbeck, Rolf Drechsler, Smith Djomkam, Alvine Nzeungang Fanseu, Maryam Jalali, Marc Michael, Jamal Mohsen, Max Nitze, Christina Plump, Mathias Soeken, Fred Tchambo, Toni and Henning Ziegler . . . .351 IUCL: Combining Information Sources for SemEval Task 5

Alex Rudnick, Levi King, Can Liu, Markus Dickinson and Sandra Kübler . . . .356 IxaMed: Applying Freeling and a Perceptron Sequential Tagger at the Shared Task on Analyzing Clinical Texts

Koldo Gojenola, Maite Oronoz, Alicia Perez and Arantza Casillas . . . .361 JOINT_FORCES: Unite Competing Sentiment Classifiers with Random Forest

Oliver Dürr, Fatih Uzdilli and Mark Cieliebak . . . .366 JU_CSE: A Conditional Random Field (CRF) Based Approach to Aspect Based Sentiment Analysis

Braja Gopal Patra, Soumik Mandal, Dipankar Das and Sivaji Bandyopadhyay . . . .370 JU-Evora: A Graph Based Cross-Level Semantic Similarity Analysis using Discourse Information

Swarnendu Ghosh, Nibaran Das, Teresa Gonçalves and Paulo Quaresma . . . .375 Kea: Sentiment Analysis of Phrases Within Short Texts

Ameeta Agrawal and Aijun An. . . .380 KUL-Eval: A Combinatory Categorial Grammar Approach for Improving Semantic Parsing of Robot Commands using Spatial Context

Willem Mattelaer, Mathias Verbeke and Davide Nitti . . . .385

(15)

KUNLPLab:Sentiment Analysis on Twitter Data

Beakal Gizachew Assefa . . . .391 Linköping: Cubic-Time Graph Parsing with a Simple Scoring Scheme

Marco Kuhlmann . . . .395 LIPN: Introducing a new Geographical Context Similarity Measure and a Statistical Similarity Measure based on the Bhattacharyya coefficient

Davide Buscaldi, Jorge García Flores, Joseph Le Roux, Nadi Tomeh and Belém Priego Sanchez400 LT3: Sentiment Classification in User-Generated Content Using a Rich Feature Set

Cynthia Van Hee, Marjan Van de Kauter, Orphee De Clercq, Els Lefever and Veronique Hoste 406 LyS: Porting a Twitter Sentiment Analysis Approach from Spanish to English

David Vilares, Miguel Hermo, Miguel A. Alonso, Carlos Gómez-Rodríguez and Yerai Doval .411 Meerkat Mafia: Multilingual and Cross-Level Semantic Textual Similarity Systems

Abhay Kashyap, Lushan Han, Roberto Yus, Jennifer Sleeman, Taneeya Satyapanich, Sunil Gandhi and Tim Finin. . . .416 MindLab-UNAL: Comparing Metamap and T-mapper for Medical Concept Extraction in SemEval 2014 Task 7

Alejandro Riveros, Maria De Arteaga, Fabio González, Sergio Jimenez and Henning Müller . .424 NILC_USP: An Improved Hybrid System for Sentiment Analysis in Twitter Messages

Pedro Balage Filho, Lucas Avanço, Thiago Pardo and Maria das Graças Volpe Nunes . . . .428 NILC_USP: Aspect Extraction using Semantic Labels

Pedro Balage Filho and Thiago Pardo. . . .433 NRC-Canada-2014: Detecting Aspects and Sentiment in Customer Reviews

Svetlana Kiritchenko, Xiaodan Zhu, Colin Cherry and Saif Mohammad . . . .437 NRC-Canada-2014: Recent Improvements in the Sentiment Analysis of Tweets

Xiaodan Zhu, Svetlana Kiritchenko and Saif Mohammad . . . .443 NTNU: Measuring Semantic Similarity with Sublexical Feature Representations and Soft Cardinality

André Lynum, Partha Pakray, Björn Gambäck and Sergio Jimenez . . . .448 OPI: Semeval-2014 Task 3 System Description

Marek Kozlowski . . . .454 Peking: Profiling Syntactic Tree Parsing Techniques for Semantic Graph Parsing

Yantao Du, Fan Zhang, Weiwei Sun and Xiaojun Wan . . . .459 Potsdam: Semantic Dependency Parsing by Bidirectional Graph-Tree Transformations and Syntactic Parsing

Željko Agi´c and Alexander Koller . . . .465 Priberam: A Turbo Semantic Parser with Second Order Features

André F. T. Martins and Mariana S. C. Almeida. . . .471 RelAgent: Entity Detection and Normalization for Diseases in Clinical Records: a Linguistically Driven Approach

SV Ramanan and Senthil Nathan . . . .477

(16)

RoBox: CCG with Structured Perceptron for Supervised Semantic Parsing of Robotic Spatial Commands Kilian Evang and Johan Bos . . . .482 RTM-DCU: Referential Translation Machines for Semantic Similarity

Ergun Bicici and Andy Way . . . .487 RTRGO: Enhancing the GU-MLT-LT System for Sentiment Analysis of Short Messages

Tobias Günther, Jean Vancoppenolle and Richard Johansson . . . .497 SA-UZH: Verb-based Sentiment Analysis

Nora Hollenstein, Michael Amsler, Martina Bachmann and Manfred Klenner . . . .503 SAIL-GRS: Grammar Induction for Spoken Dialogue Systems using CF-IRF Rule Similarity

Kalliopi Zervanou, Nikolaos Malandrakis and Shrikanth Narayanan . . . .508 SAIL: Sentiment Analysis using Semantic Similarity and Contrast Features

Nikolaos Malandrakis, Michael Falcone, Colin Vaz, Jesse James Bisogni, Alexandros Potamianos and Shrikanth Narayanan . . . .512 SAP-RI: A Constrained and Supervised Approach for Aspect-Based Sentiment Analysis

Naveen Nandan, Daniel Dahlmeier, Akriti Vij and Nishtha Malhotra . . . .517 SAP-RI: Twitter Sentiment Analysis in Two Days

Akriti Vij, Nishta Malhotra, Naveen Nandan and Daniel Dahlmeier . . . .522 SeemGo: Conditional Random Fields Labeling and Maximum Entropy Classification for Aspect Based Sentiment Analysis

Pengfei Liu and Helen Meng . . . .527 SemantiKLUE: Robust Semantic Similarity at Multiple Levels Using Maximum Weight Matching

Thomas Proisl, Stefan Evert, Paul Greiner and Besim Kabashi . . . .532 Sensible: L2 Translation Assistance by Emulating the Manual Post-Editing Process

Liling Tan, Anne Schumann, Jose Martinez and Francis Bond . . . .541 Senti.ue: Tweet Overall Sentiment Classification Approach for SemEval-2014 Task 9

José Saias. . . .546 SentiKLUE: Updating a Polarity Classifier in 48 Hours

Stefan Evert, Thomas Proisl, Paul Greiner and Besim Kabashi . . . .551 ShrdLite: Semantic Parsing Using a Handmade Grammar

Peter Ljunglöf . . . .556 SimCompass: Using Deep Learning Word Embeddings to Assess Cross-level Similarity

Carmen Banea, Di Chen, Rada Mihalcea, Claire Cardie and Janyce Wiebe. . . .560 SINAI: Voting System for Aspect Based Sentiment Analysis

Salud María Jiménez-Zafra, Eugenio Martínez-Cámara, Maite Martin and L. Alfonso Urena Lopez 566

SINAI: Voting System for Twitter Sentiment Analysis

Eugenio Martínez-Cámara, Salud María Jiménez-Zafra, Maite Martin and L. Alfonso Urena Lopez 572

(17)

SNAP: A Multi-Stage XML-Pipeline for Aspect Based Sentiment Analysis

Clemens Schulze Wettendorf, Robin Jegan, Allan Körner, Julia Zerche, Nataliia Plotnikova, Julian Moreth, Tamara Schertl, Verena Obermeyer, Susanne Streil, Tamara Willacker and Stefan Evert . . . .578 SSMT:A Machine Translation Evaluation View To Paragraph-to-Sentence Semantic Similarity

Pingping Huang and Baobao Chang . . . .585 SU-FMI: System Description for SemEval-2014 Task 9 on Sentiment Analysis in Twitter

Boris Velichkov, Borislav Kapukaranov, Ivan Grozev, Jeni Karanesheva, Todor Mihaylov, Yasen Kiprov, Preslav Nakov, Ivan Koychev and Georgi Georgiev . . . .590 Supervised Methods for Aspect-Based Sentiment Analysis

Hussam Hamdan, Patrice Bellot and Frederic Bechet . . . .596 Swiss-Chocolate: Sentiment Detection using Sparse SVMs and Part-Of-Speech n-Grams

Martin Jaggi, Fatih Uzdilli and Mark Cieliebak . . . .601 Synalp-Empathic: A Valence Shifting Hybrid System for Sentiment Analysis

Alexandre Denis, Samuel Cruz-Lara, Nadia Bellalem and Lotfi Bellalem . . . .605 SZTE-NLP: Aspect level opinion mining exploiting syntactic cues

Viktor Hangya, Gabor Berend, István Varga and Richárd Farkas . . . .610 SZTE-NLP: Clinical Text Analysis with Named Entity Recognition

Melinda Katona and Richárd Farkas . . . .615 TCDSCSS: Dimensionality Reduction to Evaluate Texts of Varying Lengths - an IR Approach

Arun kumar Jayapal, Martin Emms and John Kelleher . . . .619 Team Z: Wiktionary as a L2 Writing Assistant

Anubhav Gupta . . . .624 TeamX: A Sentiment Analyzer with Enhanced Lexicon Mapping and Weighting Scheme for Unbalanced DataYasuhide Miura, Shigeyuki Sakaki, Keigo Hattori and Tomoko Ohkuma . . . .628 TeamZ: Measuring Semantic Textual Similarity for Spanish Using an Overlap-Based Approach

Anubhav Gupta . . . .633 The Impact of Z_score on Twitter Sentiment Analysis

Hussam Hamdan, Patrice Bellot and Frederic Bechet . . . .636 The Meaning Factory: Formal Semantics for Recognizing Textual Entailment and Determining Semantic Similarity

Johannes Bjerva, Johan Bos, Rob van der Goot and Malvina Nissim . . . .642 Think Positive: Towards Twitter Sentiment Analysis from Scratch

Cicero dos Santos . . . .647 ThinkMiners: Disorder Recognition using Conditional Random Fields and Distributional Semantics

Ankur Parikh, Avinesh PVS, Joy Mustafi, Lalit Agarwalla and Ashish Mungi . . . .652 TJP: Identifying the Polarity of Tweets from Contexts

Tawunrat Chalothorn and Jeremy Ellman . . . .657

(18)

TMUNSW: Disorder Concept Recognition and Normalization in Clinical Notes for SemEval-2014 Task 7

Jitendra Jonnagaddala, Manish Kumar, Hong-Jie Dai, Enny Rachmani and Chien-Yeh Hsu. . . .663 tucSage: Grammar Rule Induction for Spoken Dialogue Systems via Probabilistic Candidate Selection

Arodami Chorianopoulou, Georgia Athanasopoulou, Elias Iosif, Ioannis Klasinas and Alexandros Potamianos . . . .668 TUGAS: Exploiting unlabelled data for Twitter sentiment analysis

Silvio Amir, Miguel B. Almeida, Bruno Martins, João Filgueiras and Mario J. Silva . . . .673 Turku: Broad-Coverage Semantic Parsing with Rich Features

Jenna Kanerva, Juhani Luotolahti and Filip Ginter . . . .678 UBham: Lexical Resources and Dependency Parsing for Aspect-Based Sentiment Analysis

Viktor Pekar, Naveed Afzal and Bernd Bohnet . . . .683 UEdin: Translating L1 Phrases in L2 Context using Context-Sensitive SMT

Eva Hasler . . . .688 ÚFAL: Using Hand-crafted Rules in Aspect Based Sentiment Analysis on Parsed Data

Kateˇrina Veselovská and Aleš Tamchyna . . . .694 UIO-Lien: Entailment Recognition using Minimal Recursion Semantics

Elisabeth Lien and Milen Kouylekov . . . .699 UKPDIPF: Lexical Semantic Approach to Sentiment Polarity Prediction in Twitter Data

Lucie Flekova, Oliver Ferschke and Iryna Gurevych. . . .704 ULisboa: Identification and Classification of Medical Concepts

André Leal, Diogo Gonçalves, Bruno Martins and Francisco M Couto . . . .711 UMCC_DLSI_SemSim: Multilingual System for Measuring Semantic Textual Similarity

Alexander Chavez, Héctor Dávila, Yoan Gutiérrez, Antonio Fernández-Orquín, Andrés Montoyo and Rafael Muñoz . . . .716 UMCC_DLSI: A Probabilistic Automata for Aspect Based Sentiment Analysis

Yenier Castañeda, Armando Collazo, Elvis Crego, Jorge L. Garcia, Yoan Gutierrez, David Tomás, Andrés Montoyo and Rafael Muñoz . . . .722 UMCC_DLSI: Sentiment Analysis in Twitter using Polirity Lexicons and Tweet Similarity

Pedro Aniel Sánchez-Mirabal, Yarelis Ruano Torres, Suilen Hernández Alvarado, Yoan Gutiérrez, Andrés Montoyo and Rafael Muñoz . . . .727 UNAL-NLP: Combining Soft Cardinality Features for Semantic Textual Similarity, Relatedness and En- tailment

Sergio Jimenez, George Dueñas, Julia Baquero and Alexander Gelbukh . . . .732 UNAL-NLP: Cross-Lingual Phrase Sense Disambiguation with Syntactic Dependency Trees

Emilio Silva-Schlenker, Sergio Jimenez and Julia Baquero . . . .743 UNIBA: Combining Distributional Semantic Models and Word Sense Disambiguation for Textual Simi- larity

Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro . . . .748

(19)

UniPi: Recognition of Mentions of Disorders in Clinical Text

Giuseppe Attardi, Vittoria Cozza and Daniele Sartiano . . . .754 UNITOR: Aspect Based Sentiment Analysis with Structured Learning

Giuseppe Castellucci, Simone Filice, Danilo Croce and Roberto Basili . . . .761 University_of_Warwick: SENTIADAPTRON - A Domain Adaptable Sentiment Analyser for Tweets - Meets SemEval

Richard Townsend, Aaron Kalair, Ojas Kulkarni, Rob Procter and Maria Liakata . . . .768 UO_UA: Using Latent Semantic Analysis to Build a Domain-Dependent Sentiment Resource

Reynier Ortega Bueno, Adrian Fonseca Bruzón, Carlos Muñiz Cuza, Yoan Gutiérrez and Andres Montoyo . . . .773 UoW: Multi-task Learning Gaussian Process for Semantic Textual Similarity

Miguel Rios. . . .779 UoW: NLP techniques developed at the University of Wolverhampton for Semantic Similarity and Textual Entailment

Rohit Gupta, Hanna Bechara, Ismail El Maarouf and Constantin Orasan. . . .785 USF: Chunking for Aspect-term Identification & Polarity Classification

Cindi Thompson. . . .790 UTexas: Natural Language Semantics using Distributional Semantics and Probabilistic Logic

Islam Beltagy, Stephen Roller, Gemma Boleda, Katrin Erk and Raymond Mooney . . . .796 UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text

Yaoyun Zhang, Jingqi Wang, Buzhou Tang, Yonghui Wu, Min Jiang, Yukun Chen and Hua Xu802 UTU: Disease Mention Recognition and Normalization with CRFs and Vector Space Representations

Suwisa Kaewphan, Kai Hakala and Filip Ginter. . . .807 UW-MRS: Leveraging a Deep Grammar for Robotic Spatial Commands

Woodley Packard . . . .812 UWB: Machine Learning Approach to Aspect-Based Sentiment Analysis

Tomáš Brychcín, Michal Konkol and Josef Steinberger . . . .817 UWM: Applying an Existing Trainable Semantic Parser to Parse Robotic Spatial Commands

Rohit Kate . . . .823 UWM: Disorder Mention Extraction from Clinical Text Using CRFs and Normalization Using Learned Edit Distance Patterns

Omid Ghiasvand and Rohit Kate . . . .828 V3: Unsupervised Generation of Domain Aspect Terms for Aspect Based Sentiment Analysis

Aitor García Pablos, Montse Cuadros and German Rigau . . . .833 XRCE: Hybrid Classification for Aspect-based Sentiment Analysis

Caroline Brun, Diana Nicoleta Popa and Claude Roux. . . .838

(20)
(21)

SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and

Textual Entailment

Marco Marelli(1) Luisa Bentivogli(2) Marco Baroni(1) Raffaella Bernardi(1) Stefano Menini(1,2)Roberto Zamparelli(1)

(1) University of Trento, Italy

(2)FBK - Fondazione Bruno Kessler, Trento, Italy

{name.surname}@unitn.it,{bentivo,menini}@fbk.eu

Abstract

This paper presents the task on the evalu- ation of Compositional Distributional Se- mantics Models on full sentences orga- nized for the first time within SemEval- 2014. Participation was open to systems based on any approach. Systems were pre- sented with pairs of sentences and were evaluated on their ability to predict hu- man judgments on(i)semantic relatedness and(ii) entailment. The task attracted 21 teams, most of which participated in both subtasks. We received 17 submissions in the relatedness subtask (for a total of 66 runs) and 18 in the entailment subtask (65 runs).

1 Introduction

Distributional Semantic Models (DSMs) approx- imate the meaning of words with vectors sum- marizing their patterns of co-occurrence in cor- pora. Recently, several compositional extensions of DSMs (CDSMs) have been proposed, with the purpose of representing the meaning of phrases and sentences by composing the distributional rep- resentations of the words they contain (Baroni and Zamparelli, 2010; Grefenstette and Sadrzadeh, 2011; Mitchell and Lapata, 2010; Socher et al., 2012). Despite the ever increasing interest in the field, the development of adequate benchmarks for CDSMs, especially at the sentence level, is still lagging. Existing data sets, such as those intro- duced by Mitchell and Lapata (2008) and Grefen- stette and Sadrzadeh (2011), are limited to a few hundred instances of very short sentences with a fixed structure. In the last ten years, several large

This work is licensed under a Creative Commons At- tribution 4.0 International Licence. Page numbers and pro- ceedings footer are added by the organisers. Licence details:

http://creativecommons.org/licenses/by/4.0/

data sets have been developed for various com- putational semantics tasks, such as Semantic Text Similarity (STS)(Agirre et al., 2012) or Recogniz- ing Textual Entailment (RTE) (Dagan et al., 2006).

Working with such data sets, however, requires dealing with issues, such as identifying multiword expressions, recognizing named entities or access- ing encyclopedic knowledge, which have little to do with compositionality per se. CDSMs should instead be evaluated on data that are challenging for reasons due to semantic compositionality (e.g.

context-cued synonymy resolution and other lexi- cal variation phenomena, active/passive and other syntactic alternations, impact of negation at vari- ous levels, operator scope, and other effects linked to the functional lexicon). These issues do not oc- cur frequently in, e.g., the STS and RTE data sets.

With these considerations in mind, we devel- oped SICK (Sentences Involving Compositional Knowledge), a data set aimed at filling the void, including a large number of sentence pairs that are rich in the lexical, syntactic and semantic phe- nomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets that are not within the scope of compositional distributional seman- tics. Moreover, we distinguished between generic semantic knowledge about general concept cate- gories (such as knowledge that a couple is formed by a bride and a groom) and encyclopedic knowl- edge about specific instances of concepts (e.g., knowing the fact that the current president of the US is Barack Obama). The SICK data set contains many examples of the former, but none of the lat- ter.

2 The Task

The Task involved two subtasks. (i)Relatedness:

predicting the degree of semantic similarity be- tween two sentences, and (ii)Entailment: detect- ing the entailment relation holding between them 1

(22)

(see below for the exact definition). Sentence re- latedness scores provide a direct way to evalu- ate CDSMs, insofar as their outputs are able to quantify the degree of semantic similarity between sentences. On the other hand, starting from the assumption that understanding a sentence means knowing when it is true, being able to verify whether an entailment is valid is a crucial chal- lenge for semantic systems.

In the semantic relatedness subtask, given two sentences, systems were required to produce a re- latedness score (on a continuous scale) indicating the extent to which the sentences were expressing a related meaning. Table 1 shows examples of sen- tence pairs with different degrees of semantic re- latedness; gold relatedness scores are expressed on a 5-point rating scale.

In the entailment subtask, given two sentences A and B, systems had to determine whether the meaning of B was entailed by A. In particular, sys- tems were required to assign to each pair either the ENTAILMENT label (when A entails B, viz., B cannot be false when A is true), the CONTRA- DICTION label (when A contradicted B, viz. B is false whenever A is true), or the NEUTRAL label (when the truth of B could not be determined on the basis of A). Table 2 shows examples of sen- tence pairs holding different entailment relations.

Participants were invited to submit up to five system runs for one or both subtasks. Developers of CDSMs were especially encouraged to partic- ipate, but developers of other systems that could tackle sentence relatedness or entailment tasks were also welcome. Besides being of intrinsic in- terest, the latter systems’ performance will serve to situate CDSM performance within the broader landscape of computational semantics.

3 The SICK Data Set

The SICK data set, consisting of about 10,000 En- glish sentence pairs annotated for relatedness in meaning and entailment, was used to evaluate the systems participating in the task. The data set creation methodology is outlined in the following subsections, while all the details about data gen- eration and annotation, quality control, and inter- annotator agreement can be found in Marelli et al.

(2014).

3.1 Data Set Creation

SICK was built starting from two existing data sets: the 8K ImageFlickr data set1 and the SemEval-2012 STS MSR-Video Descriptions data set.2 The 8K ImageFlickr dataset is a dataset of images, where each image is associated with five descriptions. To derive SICK sentence pairs we randomly chose 750 images and we sampled two descriptions from each of them. The SemEval- 2012 STS MSR-Video Descriptions data set is a collection of sentence pairs sampled from the short video snippets which compose the Microsoft Re- search Video Description Corpus. A subset of 750 sentence pairs were randomly chosen from this data set to be used in SICK.

In order to generate SICK data from the 1,500 sentence pairs taken from the source data sets, a 3- step process was applied to each sentence compos- ing the pair, namely(i) normalization,(ii) expan- sionand(iii) pairing. Table 3 presents an example of the output of each step in the process.

The normalization step was carried out on the original sentences (S0) to exclude or simplify in- stances that contained lexical, syntactic or seman- tic phenomena (e.g., named entities, dates, num- bers, multiword expressions) that CDSMs are cur- rently not expected to account for.

The expansion step was applied to each of the normalized sentences (S1) in order to create up to three new sentences with specific characteristics suitable to CDSM evaluation. In this step syntac- tic and lexical transformations with predictable ef- fects were applied to each normalized sentence, in order to obtain(i)a sentence with a similar mean- ing (S2),(ii)a sentence with a logically contradic- tory or at least highly contrasting meaning (S3), and(iii)a sentence that contains most of the same lexical items, but has a different meaning (S4) (this last step was carried out only where it could yield a meaningful sentence; as a result, not all normal- ized sentences have an (S4) expansion).

Finally, in the pairing step each normalized sentence in the pair was combined with all the sentences resulting from the expansion phase and with the other normalized sentence in the pair.

Considering the example in Table 3,S1aandS1b were paired. Then,S1aandS1bwere each com- bined withS2a,S2b,S3a,S3b,S4a, andS4b, lead-

1http://nlp.cs.illinois.edu/HockenmaierGroup/data.html

2http://www.cs.york.ac.uk/semeval- 2012/task6/index.php?id=data

(23)

Relatedness score Example

1.6 A: “A man is jumping into an empty pool”

B: “There is no biker jumping in the air”

2.9 A: “Two children are lying in the snow and are making snow angels”

B: “Two angels are making snow on the lying children”

3.6 A: “The young boys are playing outdoors and the man is smiling nearby”

B: “There is no boy playing outdoors and there is no man smiling”

4.9 A: “A person in a black jacket is doing tricks on a motorbike”

B: “A man in a black jacket is doing tricks on a motorbike”

Table 1: Examples of sentence pairs with their gold relatedness scores (on a 5-point rating scale).

Entailment label Example

ENTAILMENT A: “Two teams are competing in a football match”

B: “Two groups of people are playing football”

CONTRADICTION A: “The brown horse is near a red barrel at the rodeo”

B: “The brown horse is far from a red barrel at the rodeo”

NEUTRAL A: “A man in a black jacket is doing tricks on a motorbike”

B: “A person is riding the bicycle on one wheel”

Table 2: Examples of sentence pairs with their gold entailment labels.

ing to a total of 13 different sentence pairs.

Furthermore, a number of pairs composed of completely unrelated sentences were added to the data set by randomly taking two sentences from two different pairs.

The result is a set of about 10,000 new sen- tence pairs, in which each sentence is contrasted with either a (near) paraphrase, a contradictory or strongly contrasting statement, another sentence with very high lexical overlap but different mean- ing, or a completely unrelated sentence. The ra- tionale behind this approach was that of building a data set which encouraged the use of a com- positional semantics step in understanding when two sentences have close meanings or entail each other, hindering methods based on individual lex- ical items, on the syntactic complexity of the two sentences or on pure world knowledge.

3.2 Relatedness and Entailment Annotation Each pair in the SICK dataset was annotated to mark (i) the degree to which the two sentence meanings are related (on a 5-point scale), and(ii) whether one entails or contradicts the other (con-

sidering both directions). The ratings were col- lected through a large crowdsourcing study, where each pair was evaluated by 10 different subjects, and the order of presentation of the sentences was counterbalanced (i.e., 5 judgments were collected for each presentation order). Swapping the order of the sentences within each pair served a two- fold purpose: (i) evaluating the entailment rela- tion in both directions and (ii) controlling pos- sible bias due to priming effects in the related- ness task. Once all the annotations were collected, the relatedness gold score was computed for each pair as the average of the ten ratings assigned by participants, whereas a majority vote scheme was adopted for the entailment gold labels.

3.3 Data Set Statistics

For the purpose of the task, the data set was ran- domly split into training and test set (50% and 50%), ensuring that each relatedness range and en- tailment category was equally represented in both sets. Table 4 shows the distribution of sentence pairs considering the combination of relatedness ranges and entailment labels. The “total” column

(24)

Original pair

S0a:A sea turtle is hunting for fish S0b:The turtle followed the fish Normalized pair

S1a:A sea turtle is hunting for fish S1b:The turtle is following the fish Expanded pairs

S2a:A sea turtle is hunting for food S2b:The turtle is following the red fish S3a:A sea turtle is not hunting for fish S3b:The turtle isn’t following the fish S4a:A fish is hunting for a turtle in the sea S4b:The fish is following the turtle

Table 3: Data set creation process.

indicates the total number of pairs in each range of relatedness, while the “total” row contains the total number of pairs in each entailment class.

SICK Training Set

relatedness CONTRADICT ENTAIL NEUTRAL TOTAL

1-2 range 0 (0%) 0 (0%) 471 (10%) 471

2-3 range 59 (1%) 2 (0%) 638 (13%) 699 3-4 range 498 (10%) 71 (1%) 1344 (27%) 1913 4-5 range 155 (3%) 1344 (27%) 352 (7%) 1851

TOTAL 712 1417 2805 4934

SICK Test Set

relatedness CONTRADICT ENTAIL NEUTRAL TOTAL

1-2 range 0 (0%) 1 (0%) 451 (9%) 452

2-3 range 59 (1%) 0 (0%) 615(13%) 674

3-4 range 496 (10%) 65 (1%) 1398 (28%) 1959 4-5 range 157 (3%) 1338 (27%) 326 (7%) 1821

TOTAL 712 1404 2790 4906

Table 4: Distribution of sentence pairs across the Training and Test Sets.

4 Evaluation Metrics and Baselines Both subtasks were evaluated using standard met- rics. In particular, the results on entailment were evaluated using accuracy, whereas the outputs on relatedness were evaluated using Pearson correla- tion, Spearman correlation, and Mean Squared Er- ror (MSE). Pearson correlation was chosen as the official measure to rank the participating systems.

Table 5 presents the performance of 4 base- lines. The Majority baseline always assigns the most common label in the training data (NEUTRAL), whereas the Probability baseline assigns labels randomly according to their rela- tive frequency in the training set. The Overlap baseline measures word overlap, again with parameters (number of stop words and EN- TAILMENT/NEUTRAL/CONTRADICTION thresholds) estimated on the training part of the data.

Baseline Relatedness Entailment

Chance 0 33.3%

Majority NA 56.7%

Probability NA 41.8%

Overlap 0.63 56.2%

Table 5: Performance of baselines. Figure of merit is Pearson correlation for relatedness and accuracy for entailment. NA =Not Applicable

5 Submitted Runs and Results

Overall, 21 teams participated in the task. Partici- pants were allowed to submit up to 5 runs for each subtask and had to choose the primary run to be in- cluded in the comparative evaluation. We received 17 submissions to the relatedness subtask (for a total of 66 runs) and 18 for the entailment subtask (65 runs).

We asked participants to pre-specify a pri- mary run to encourage commitment to a theoretically-motivated approach, rather than post-hoc performance-based assessment. Inter- estingly, some participants used the non-primary runs to explore the performance one could reach by exploiting weaknesses in the data that are not likely to hold in future tasks of the same kind (for instance, run 3 submitted by The Meaning Factory exploited sentence ID ordering informa- tion, but it was not presented as a primary run).

Participants could also use non-primary runs to test smart baselines. In the relatedness subtask six non-primary runs slightly outperformed the official winning primary entry,3 while in the entailment task all ECNU’s runs but run 4 were better than ECNU’s primary run. Interestingly, the differences between the ECNU’s runs were

3They were: The Meaning Factory’s run3 (Pearson 0.84170) ECNU’s runs2 (0.83893) run5 (0.83500) and Stan- fordNLP’s run4 (0.83462) and run2 (0.83103).

(25)

due to the learning methods used.

We present the results achieved by primary runs against the Entailment and Relatedness subtasks in Table 6 and Table 7, respectively.4 We witnessed a very close finish in both subtasks, with 4 more systems within 3 percentage points of the winner in both cases. 4 of these 5 top systems were the same across the two subtasks. Most systems per- formed well above the best baselines from Table 5.The overall performance pattern suggests that, owing perhaps to the more controlled nature of the sentences, as well as to the purely linguistic nature of the challenges it presents, SICK entail- ment is “easier” than RTE. Considering the first five RTE challenges (Bentivogli et al., 2009), the median values ranged from 56.20% to 61.75%, whereas the average values ranged from 56.45%

to 61.97%. The entailment scores obtained on the SICK data set are considerably higher, being 77.06% for the median system and 75.36% for the average system. On the other hand, the re- latedness task is more challenging than the one run on MSRvid (one of our data sources) at STS 2012, where the top Pearson correlation was0.88 (Agirre et al., 2012).

6 Approaches

A summary of the approaches used by the sys- tems to address the task is presented in Table 8.

In the table, systems in bold are those for which the authors submitted a paper (Ferrone and Zan- zotto, 2014; Bjerva et al., 2014; Beltagy et al., 2014; Lai and Hockenmaier, 2014; Alves et al., 2014; Le´on et al., 2014; Bestgen, 2014; Zhao et al., 2014; Vo et al., 2014; Bic¸ici and Way, 2014;

Lien and Kouylekov, 2014; Jimenez et al., 2014;

Proisl and Evert, 2014; Gupta et al., 2014). For the others, we used the brief description sent with the system’s results, double-checking the information with the authors. In the table, “E” and “R” refer to the entailment and relatedness task respectively, and “B” to both.

Almost all systems combine several kinds of features. To highlight the role played by com- position, we draw a distinction between compo- sitional and non-compositional features, and di- vide the former into ‘fully compositional’ (sys-

4ITTK’s primary run could not be evaluated due to tech- nical problems with the submission. The best ITTK’s non- primary run scored 78,2% accuracy in the entailment task and 0.76rin the relatedness task.

ID Compose ACCURACY

Illinois-LH run1 P/S 84.6

ECNU run1 S 83.6

UNAL-NLP run1 83.1

SemantiKLUE run1 82.3

The Meaning Factory run1 S 81.6

CECL ALL run1 80.0

BUAP run1 P 79.7

UoW run1 78.5

Uedinburgh run1 S 77.1

UIO-Lien run1 77.0

FBK-TR run3 P 75.4

StanfordNLP run5 S 74.5

UTexas run1 P/S 73.2

Yamraj run1 70.7

asjai run5 S 69.8

haLF run2 S 69.4

RTM-DCU run1 67.2

UANLPCourse run2 S 48.7

Table 6: Primary run results for the entailment subtask. The table also shows whether a sys- tem exploits composition information at either the phrase (P) or sentence (S) level.

tems that compositionally computed the meaning of the full sentences, though not necessarily by as- signing meanings to intermediate syntactic con- stituents) and ‘partially compositional’ (systems that stop the composition at the level of phrases).

As the table shows, thirteen systems used compo- sition in at least one of the tasks; ten used compo- sition for full sentences and six for phrases, only.

The best systems are among these thirteen sys- tems.

Let us focus on such compositional methods.

Concerning the relatedness task, the fine-grained analyses reported for several systems (Illinois- LH, The Meaning Factory and ECNU) shows that purely compositional systems currently reach per- formance above 0.7 r. In particular, ECNU’s compositional feature gives 0.75r, The Meaning Factory’s logic-based composition model 0.73 r, and Illinois-LH compositional features combined with Word Overlap 0.75 r. While competitive, these scores are lower than the one of the best

(26)

ID Compose r ρ MSE

ECNU run1 S 0.828 0.769 0.325

StanfordNLP run5 S 0.827 0.756 0.323 The Meaning Factory run1 S 0.827 0.772 0.322

UNAL-NLP run1 0.804 0.746 0.359

Illinois-LH run1 P/S 0.799 0.754 0.369

CECL ALL run1 0.780 0.732 0.398

SemantiKLUE run1 0.780 0.736 0.403

RTM-DCU run1 0.764 0.688 0.429

UTexas run1 P/S 0.714 0.674 0.499

UoW run1 0.711 0.679 0.511

FBK-TR run3 P 0.709 0.644 0.591

BUAP run1 P 0.697 0.645 0.528

UANLPCourse run2 S 0.693 0.603 0.542 UQeResearch run1 0.642 0.626 0.822

ASAP run1 P 0.628 0.597 0.662

Yamraj run1 0.535 0.536 2.665

asjai run5 S 0.479 0.461 1.104

Table 7: Primary run results for the relatedness subtask (r for Pearson andρfor Spearman corre- lation). The table also shows whether a system ex- ploits composition information at either the phrase (P) or sentence (S) level.

purely non-compositional system (UNAL-NLP) which reaches the 4th position (0.80rUNAL-NLP vs. 0.82 r obtained by the best system). UNAL- NLP however exploits an ad-hoc “negation” fea- ture discussed below.

In the entailment task, the best non- compositional model (again UNAL-NLP) reaches the 3rd position, within close reach of the best system (83% UNAL-NLP vs. 84.5% obtained by the best system). Again, purely compositional models have lower performance. haLF CDSM reaches 69.42% accuracy, Illinois-LH Word Overlap combined with a compositional feature reaches 71.8%. The fine-grained analysis reported by Illinois-LH (Lai and Hockenmaier, 2014) shows that a full compositional system (based on point-wise multiplication) fails to capture contradiction. It is better than partial phrase-based compositional models in recognizing entailment pairs, but worse than them on recognizing neutral pairs.

Given our more general interest in the distri- butional approaches, in Table 8 we also classify the different DSMs used as ‘Vector Space Mod-

els’, ‘Topic Models’ and ‘Neural Language Mod- els’. Due to the impact shown by learning methods (see ECNU’s results), we also report the different learning approaches used.

Several participating systems deliberately ex- ploitad-hocfeatures that, while not helping a true understanding of sentence meaning, exploit some systematic characteristics of SICK that should be controlled for in future releases of the data set.

In particular, the Textual Entailment subtask has been shown to rely too much on negative words and antonyms. The Illinois-LH team reports that, just by checking the presence of negative words (the Negation Feature in the table), one can detect 86.4% of the contradiction pairs, and by combin- ing Word Overlap and antonyms one can detect 83.6% of neutral pairs and 82.6% of entailment pairs. This approach, however, is obviously very brittle (it would not have been successful, for in- stance, if negation had been optionally combined with word-rearranging in the creation of S4 sen- tences, see Section 3.1 above).

Finally, Table 8 reports about the use of external resources in the task. One of the reasons we cre- ated SICK was to have a compositional semantics benchmark that would not require too many ex- ternal tools and resources (e.g., named-entity rec- ognizers, gazetteers, ontologies). By looking at what the participants chose to use, we think we succeeded, as only standard NLP pre-processing tools (tokenizers, PoS taggers and parsers) and rel- atively few knowledge resources (mostly, Word- Net and paraphrase corpora) were used.

7 Conclusion

We presented the results of the first task on the evaluation of compositional distributional seman- tic models and other semantic systems on full sen- tences, organized within SemEval-2014. Two sub- tasks were offered: (i) predicting the degree of re- latedness between two sentences, and (ii) detect- ing the entailment relation holding between them.

The task has raised noticeable attention in the community: 17 and 18 submissions for the relat- edness and entailment subtasks, respectively, for a total of 21 participating teams. Participation was not limited to compositional models but the major- ity of systems (13/21) used composition in at least one of the subtasks. Moreover, the top-ranking systems in both tasks use compositional features.

However, it must be noted that all systems also ex-

Hivatkozások

KAPCSOLÓDÓ DOKUMENTUMOK

Uplifted coastal terraces are preserved around the perimeters of the islands, and my students and I are precisely measuring these terraces at numerous sites using combinations of

By the analysis shown here, the R5 antibodies, in conjunction with the current Mendez Cocktail extraction (AOAC OMA Method 2012.01) would not be considered suitable for use in

Major research areas of the Faculty include museums as new places for adult learning, development of the profession of adult educators, second chance schooling, guidance

Fats notably contribute to the enrichment of the nutritional quality of food. The presence of fat provides a specific mouthfeel and pleasant creamy or oily

Eva Hosszu ∗ Department of Telecommunications and Media Informatics Budapest University of Technology and Economics hosszu@tmit.bme.hu Janos Tapolcai ∗ Department of

1 Department of Computer Science and Information Theory, Budapest University of Technology and Economics.. 1 Introduction

The importance of physical activity in health protection of future workers: International Conference and Workshop: Proceedings from the International Conference and

Web usage mining, from the data mining aspect, is the task of applying data mining techniques to discover usage patterns from Web data in order to understand and better serve