BUAP: Polarity Classification of Short Texts

ditionals, with low number of lexical resources is proposed. These relations are integrated in classi-cal models of representation like bag of words with the aim of improving the accuracy values obtained in the process of classification. The influence of se-mantic operators such as modals and negations are analyzed, in particular, the degree in which they af-fect the emotion present in a given paragraph or sen-tence.

One of the major advances obtained in the task of sentiment analysis has been done in the frame-work of the SemEval competition. In 2013, several teams have participated with different approaches Becker et al. (2013); Han et al. (2013); Chawla et al.

(2013); Balahur and Turchi (2013); Balage Filho and Pardo (2013); Moreira et al. (2013); Reckman et al. (2013); Tiantian et al. (2013); Marchand et al.

(2013); Clark and Wicentwoski (2013); Hamdan et al. (2013); Mart´ınez-C´amara et al. (2013); Lev-allois (2013). Most of these works have contributed in the mentioned task by proposing methods, tech-niques for representing and classifying documents towards the automatic classification of sentiment in Tweets.

3 Description of the Presented Approach We have employed a supervised approach based on machine learning in which we construct a classifica-tion model using the following general features ob-tained from the training corpus.

1. Character trigrams 2. PoS tags trigrams

3. Significant Tweet words obtained by using a graph mining tool known as SubDue

The description of how we calculated each feature in order to construct a representation vector for each message is given as follows.

The probability of each character trigram given the polarity class, P(trigram|class), was cal-culated in the training corpus. Thereafter, we assigned a normalized probability to each sen-tence polarity by combining the probability of each character trigram of the sentence, i.e., P_|message|

i=1 log[P(trigrami|class)]. Since we have four classes (“positive”,“negative”,“neutral”

and “objective”), we have obtained four features for the final vectorial representation of the message.

We then calculated other four features by per-forming a similar calculation than the previous one, but in this case, using the PoS tags of the message.

For this purpose, we used the Twitter NLP and Part-of-Speech Tagging tool provided by the Carnegie Mellon University (Owoputi et al., 2013). Since the PoS tag given by this tool is basically a character, then the same procedure can be applied.

We performed preliminary experiments by using these eight features on a trial corpus, and we ob-served that the results may be improved by select-ing significant words that may not be discovered by the statistical techniques used until now. So, we decided to make use of techniques based on graph mining for attempting to find those signifi-cant words. In order to find them, we constructed a graph representation for each message class (“pos-itive”,“negative”,“neutral” and “objective”), using the training corpus. The manner we constructed those graphs is shown as follows.

Formally, given a graphG= (V, E, L, f)withV being the non-empty set of vertices,E⊆V ×V the edges, L the tag set, and f : E → L, a function that assigns a tag to a pair of associated vertices.

This graph-based representation attempt to capture the sequence among the sentence words, so as the sequence among their PoS tags with the aim of feed-ing a graph minfeed-ing tool which may extract relevant features that may be further used for representing the texts. Thus, the setV is constructed from the differ-ent words and PoS of the target documdiffer-ent.

In order to demonstrate the way we construct the graph for each short text, consider the following message: “ooh i love you for posting this :-)”. The associated graph representation to this message is shown in Figure 1.

Once each paragraph is represented by means of a graph, we apply a data mining algorithm in or-der to find subgraphs from which we will be able to find the significant words which will be, in our case, basically, the nodes of these subgraphs. Sub-due is a data mining tool widely used in structured domains. This tool has been used for discovering structured patterns in texts represented by means of graphs Olmos et al. (2005). Subdue uses an eval-uation model named “Minimum encoding”, a

tech-Figure 1: Graph based message representation with words and their corresponding PoS tags

nique derived from the minimum description length principle Rissanen (1989), in which t he best graph sub-structures are chosen. The best subgraphs are those that minimize the number of bits that repre-sent the graph. In this case, the number of bits is calculated consi dering the size of the graph adjan-cency matrix. Thus, the best substructure is the one that minimizes I(S) +I(G|S), where I(S)is the number of bits required to describe the sub structure S, andI(G|S)is the number of bits required to de-scribe graph G after it has been compacted by the substructureS.

By applying this procedure we obtained 597 sig-nicant negative words, 445 positive words, 616 ob-jective words and 925 positive words. For the final representation vector we compiled the union of these words, obtaining 1915 significant words. Therefore, the total number of features for each message was 1,923.

We have used the training corpus provided at the competition (Rosenthal et al., 2014), however, we removed all those messsages tagged as the class

“objective-OR-neutral”, because all these messages introduced noise to the classification process. In to-tal, we constructed 5,217 vectors of message repre-sentation which fed a support vector machine

classi-fier. We have used the SVM implementation of the WEKA tool with default parameters for our exper-iments (Hall et al., 2009). The obtained results are shown in the next section.

4 Experimental Results

The test corpus was made up short texts (mes-sages) categorized as: “LiveJournal2014”,

“SMS2013”, “Twitter2013”, “Twitter2014” and

“Twitter2014Sarcasm”. A complete description of the training and test datasets can be found at the task description paper (Rosenthal et al., 2014).

In Table 1 we can see the results obtained at the competition. Our approach performed in almost all the cases slightly below to the overall average, ex-cept when we processed the corpus of Twitter with Sarcasm characteristics. We consider that two main problems were the cause of this result: 1) The corpus was very unbalanced and our approaches for allevi-ating this problem were not sufficient, and 2) From our particular point of view, there were a high differ-ence between the vocabulary of the training and the test corpus, thus, leading the classification model to fail.

Table 1: Results obtained at the substask B of the Semeval 2014 Task 9

System LiveJournal2014 SMS2013 Twitter2013 Twitter2014 Twitter2014Sarcasm Average

NRC-Canada-B 74.84 70.28 70.75 69.85 58.16 68.78

CISUC KIS-B-late 74.46 65.90 67.56 67.95 55.49 66.27

coooolll-B 72.90 67.68 70.40 70.14 46.66 65.56

TeamX-B 69.44 57.36 72.12 70.96 56.50 65.28

RTRGO-B 72.20 67.51 69.10 69.95 47.09 65.17

AUEB-B 70.75 64.32 63.92 66.38 56.16 64.31

SWISS-CHOCOLATE-B 73.25 66.43 64.81 67.54 49.46 64.30

SentiKLUE-B 73.99 67.40 69.06 67.02 43.36 64.17

TUGAS-B 69.79 62.77 65.64 69.00 52.87 64.01

SAIL-B 69.34 56.98 66.80 67.77 57.26 63.63

senti.ue-B 71.39 59.34 67.34 63.81 55.31 63.44

Synalp-Empathic-B 71.75 62.54 63.65 67.43 51.06 63.29

Lt 3-B 68.56 64.78 65.56 65.47 47.76 62.43

UKPDIPF-B 71.92 60.56 60.65 63.77 54.59 62.30

AMI ERIC-B 65.32 60.29 70.09 66.55 48.19 62.09

ECNU-B 69.44 59.75 62.31 63.17 51.43 61.22

LyS-B 69.79 60.45 66.92 64.92 42.40 60.90

SU-FMI-B-late 68.24 61.67 60.96 63.62 48.34 60.57

NILC USP-B-twitter 69.02 61.35 65.39 63.94 42.06 60.35

CMU-Qatar-B-late 65.63 62.95 65.11 65.53 40.52 59.95

columbia nlp-B 68.79 59.84 64.60 65.42 40.02 59.73

CMUQ-Hybrid-B-late 65.14 61.75 63.22 62.71 40.95 58.75

Citius-B 62.40 57.69 62.53 61.92 41.00 57.11

KUNLPLab-B 63.77 55.89 58.12 61.72 44.60 56.82

USP Biocom-B 67.80 53.57 58.05 59.21 43.56 56.44

UPV-ELiRF-B 64.11 55.36 63.97 59.33 37.46 56.05

Rapanakis-B 59.71 54.02 58.52 63.01 44.69 55.99

DejaVu-B 64.69 55.57 57.43 57.02 42.46 55.43

GPLSI-B 57.32 46.63 57.49 56.06 53.90 54.28

Indian Inst of Tech-Patna-B 60.39 51.96 52.58 57.25 41.33 52.70

BUAP-B 53.94 44.27 56.85 55.76 51.52 52.47

SAP-RI-B 57.86 49.00 50.18 55.47 48.64 52.23

UMCC DLSI Sem 53.12 50.01 51.96 55.40 42.76 50.65

Alberta-B 52.38 49.05 53.85 52.06 40.40 49.55

SINAI-B 58.33 57.34 50.59 49.50 31.15 49.38

IBM EG-B 59.24 46.62 54.51 52.26 34.14 49.35

SU-sentilab-B-tweet 55.11 49.60 50.17 49.52 31.49 47.18

lsis lif-B 61.09 38.56 46.38 52.02 34.64 46.54

IITPatna-B 54.68 40.56 50.32 48.22 36.73 46.10

UMCC DLSI Graph-B 47.81 36.66 43.24 45.49 53.15 45.27

University-of-Warwick-B 39.60 29.50 39.17 45.56 39.77 38.72

DAEDALUS-B 40.83 40.86 36.57 33.03 28.96 36.05

Overall average 63.81 55.82 59.72 60.30 45.43 57.02

5 Conclusions

We have presented an approach for detecting mes-sage polarity using basically three kind of features:

character trigrams, PoS tags trigrams and significant words obtained by means of a graph mining tool.

The obtained results show that these features were not sufficient for detecting the correct polarity of a given message with high precision. We consider that the unbalanced characteristic and the fact the vocab-ulary changed significantly from the training to the test corpus influenced the results we obtained at the competition. However, a deep analysis we plan to do to the datasets evaluated will allow us in the

fu-ture to find more accurate feafu-tures for the message polarity detection task.

References

Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca Passonneau. Sentiment analysis of twitter data. In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 30–38, Portland, Oregon, June 2011.

Pedro Balage Filho and Thiago Pardo. Nilc usp:

A hybrid system for sentiment analysis in twitter messages. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2:

Proceedings of the Seventh International Work-shop on Semantic Evaluation (SemEval 2013), pages 568–572, Atlanta, Georgia, USA, June 2013.

Alexandra Balahur and Marco Turchi. Improving sentiment analysis in twitter using multilingual machine translated data. In Proceedings of the In-ternational Conference Recent Advances in Natu-ral Language Processing RANLP 2013, pages 49–

55, Hissar, Bulgaria, September 2013. INCOMA Ltd. Shoumen, BULGARIA.

Lee Becker, George Erhart, David Skiba, and Valen-tine Matula. Avaya: Sentiment analysis on twitter with self-training and polarity lexicon expansion.

In Second Joint Conference on Lexical and Com-putational Semantics (*SEM), Volume 2: Pro-ceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 333–340, Atlanta, Georgia, USA, June 2013.

Karan Chawla, Ankit Ramteke, and Pushpak Bhat-tacharyya. Iitb-sentiment-analysts: Participation in sentiment analysis in twitter semeval 2013 task.

Sam Clark and Rich Wicentwoski. Swatcs: Combin-ing simple classifiers with estimated accuracy. In Second Joint Conference on Lexical and Compu-tational Semantics (*SEM), Volume 2: Proceed-ings of the Seventh International Workshop on Se-mantic Evaluation (SemEval 2013), pages 425–

429, Atlanta, Georgia, USA, June 2013.

Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Wit-ten. The weka data mining software: an update.

SIGKDD Explor. Newsl., 11(1):10–18, November 2009. ISSN 1931-0145.

Hussam Hamdan, Frederic B´echet, and Patrice Bel-lot. Experiments with dbpedia, wordnet and sen-tiwordnet as resources for sentiment analysis in micro-blogging. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh Inter-national Workshop on Semantic Evaluation

(Se-mEval 2013), pages 455–459, Atlanta, Georgia, USA, June 2013.

Qi Han, Junfei Guo, and Hinrich Schuetze. Codex:

Combining an svm classifier and character n-gram language models for sentiment analysis on twitter text. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh Inter-national Workshop on Semantic Evaluation (Se-mEval 2013), pages 520–524, Atlanta, Georgia, USA, June 2013.

Clement Levallois. Umigon: sentiment analysis for tweets based on terms lists and heuristics. In Sec-ond Joint Conference on Lexical and Computa-tional Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Seman-tic Evaluation (SemEval 2013), pages 414–417, Atlanta, Georgia, USA, June 2013.

Morgane Marchand, Alexandru Ginsca, Romaric Besanc¸on, and Olivier Mesnard. [lvic-limsi]: Us-ing syntactic features and multi-polarity words for sentiment analysis in twitter. In Second Joint Con-ference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 418–424, Atlanta, Geor-gia, USA, June 2013.

Eugenio Mart´ınez-C´amara, Arturo Montejo-R´aez, M. Teresa Mart´ın-Valdivia, and L. Alfonso Ure˜na L ´opez. Sinai: Machine learning and emotion of the crowd for sentiment analysis in microblogs. In Second Joint Conference on Lexical and Compu-tational Semantics (*SEM), Volume 2: Proceed-ings of the Seventh International Workshop on Se-mantic Evaluation (SemEval 2013), pages 402–

407, Atlanta, Georgia, USA, June 2013.

Silvio Moreira, Jo˜ao Filgueiras, Bruno Martins, Francisco Couto, and M´ario J. Silva. Reac-tion: A naive machine learning approach for sen-timent classification. In Second Joint Confer-ence on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 490–494, Atlanta, Geor-gia, USA, June 2013.

Subhabrata Mukherjee and Pushpak Bhattacharyya.

Sentiment analysis in Twitter with lightweight discourse analysis. In Proceedings of COLING 2012, pages 1847–1864, Mumbai, India, Decem-ber 2012. The COLING 2012 Organizing Com-mittee.

Ivan Olmos, Jesus A. Gonzalez, and Mauricio Os-orio. Subgraph isomorphism detection using a code based representation. In FLAIRS Confer-ence, pages 474–479, 2005.

Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A Smith. Improved part-of-speech tagging for online conversational text with word clusters.

In Proceedings of NAACL-HLT, pages 380–390, 2013.

Hilke Reckman, Cheyanne Baird, Jean Crawford, Richard Crowell, Linnea Micciulla, Saratendu Sethi, and Fruzsina Veress. teragram: Rule-based detection of sentiment phrases using sas senti-ment analysis. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh Inter-national Workshop on Semantic Evaluation (Se-mEval 2013), pages 513–519, Atlanta, Georgia, USA, June 2013.

Jorma Rissanen. Stochastic Complexity in Statis-tical Inquiry Theory. World Scientific Publish-ing Co., Inc., River Edge, NJ, USA, 1989. ISBN 981020311X.

Sara Rosenthal, Preslav Nakov, Alan Ritter, and Veselin Stoyanov. Semeval-2014 task 9: Senti-ment analysis in twitter. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014), Dublin, Ireland, 2014.

Zhu Tiantian, Zhang Fangxi, and Man Lan. Ec-nucs: A surface information based system de-scription of sentiment analysis in twitter in the semeval-2013 (task 2). In Second Joint Con-ference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 408–413, Atlanta, Geor-gia, USA, June 2013.

CECL: a New Baseline and a Non-Compositional Approach for the Sick Benchmark

Yves Bestgen

Centre for English Corpus Linguistics Universit´e catholique de Louvain yves.bestgen@uclouvain.be

Abstract

This paper describes the two procedures for determining the semantic similarities between sentences submitted for the Se-mEval 2014 Task 1. MeanMaxSim, an unsupervised procedure, is proposed as a new baseline to assess the efficiency gain provided by compositional models. It out-performs a number of other baselines by a wide margin. Compared to the word-overlap baseline, it has the advantage of taking into account the distributional simi-larity between words that are also involved in compositional models. The second procedure aims at building a predictive model using as predictors MeanMaxSim and (transformed) lexical features describ-ing the differences between each sentence of a pair. It finished sixth out of 17 teams in the textual similarity sub-task and sixth out of 19 in the textual entailment sub-task.

1 Introduction

The SemEval-2014 Task 1 (Marelli et al., 2014a) was designed to allow a rigorous evaluation of compositional distributional semantic models (CDSMs). CDSMs aim to represent the meaning of phrases and sentences by composing the dis-tributional representations of the words they con-tain (Baroni et al., 2013; Bestgen and Cabiaux, 2002; Erk and Pado, 2008; Grefenstette, 2013;

Kintsch, 2001; Mitchell and Lapata, 2010); they are thus an extension of Distributional Semantic Models (DSMs), which approximate the meaning of words with vectors summarizing their patterns of co-occurrence in a corpus (Baroni and Lenci,

This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http:

//creativecommons.org/licenses/by/4.0/

2010; Bestgen et al., 2006; Kintsch, 1998; Lan-dauer and Dumais, 1997). The dataset for this task, called SICK (Sentences Involving Composi-tional Knowledge), consists of almost 10,000 En-glish sentence pairs annotated for relatedness in meaning and entailment relation by ten annotators (Marelli et al., 2014b).

The rationale behind this dataset is that ”un-derstanding when two sentences have close mean-ings or entail each other crucially requires a com-positional semantics step” (Marelli et al., 2014b), and thus that annotators judge the similarity be-tween the two sentences of a pair by first build-ing a mental representation of the meanbuild-ing of each sentence and then comparing these two represen-tations. However, another option was available to the annotators. They could have paid atten-tion only to the differences between the sentences, and assessed the significance of these differences.

Such an approach could have been favored by the dataset built on the basis of a thousand sentences modified by a limited number of (often) very specific transformations, producing sentence pairs that might seem quite repetitive. An analysis con-ducted during the training phase of the challenge brought some support for this hypothesis. The analysis focused on pairs of sentences in which the only difference between the two sentences was the replacement of one content word by another, as in A man is singing to a girlvs. A man is singing to a woman, but also in A man is sitting in a field vs. A man is running in a field. The material was divided into two parts, 3500 sentence pairs in the training set and the remaining 1500 in the test set. First, the average similarity score for each pair of interchanged words was calculated on the training set (e.g., in this sample, there were 16 sen-tence pairs in whichwoman andmanwere inter-changed, and their mean similarity score was 3.6).

Then, these mean scores were used as the similar-ity scores of the sentence pairs of the test sample 160

in which the same words were interchanged. The correlation between the actual scores and the pre-dicted score was 0.83 (N=92), a value that can be considered as very high, given the restrictions on the range in which the predicted similarity scores vary (min=3.5 and max=5.0; Howell, 2008, pp.

272-273). It is important to note that this observa-tion does not prove that the participants have not built a compositional representation, especially as it only deals with a very specific type of trans-formation. It nevertheless suggests that analyz-ing only the differences between the sentences of a pair could allow the similarity between them to be effectively estimated.

Following these observations, I opted to try to determine the degree of efficacy that can be achieved by two non-compositional approaches.

The first approach, totally unsupervised, is pro-posed as a new baseline to evaluate the efficacy gains brought by compositional systems. The sec-ond, a supervised approach, aims to capitalize on the properties of the SICK benchmark. While these approaches have been developed specifically for the semantic relatedness sub-task, the second has also been applied to the textual entailment sub-task. This paper describes the two proposed ap-proaches, their implementation in the context of SemEval 2014 Task 1, and the results obtained.

2 Proposed Approaches 2.1 A New Baseline for CDSM

An evident baseline in the field of CDSM is based on the proportion of common words in two sen-tences after the removal (or retaining) of stop words (Cheung and Penn, 2012). Its main weak-ness is that it does not take into account the seman-tic similarities between the words that are bined in the CDSM models. It follows that a com-positional approach may seem significantly better than this baseline, even if it is not compositionality that matters but only the distributional part. At first glance, this problem can be circumvented by using as baseline a simple compositional model like the additive model. The analyses below show that this model is much less effective for the SILK dataset than the distributional baseline proposed here.

MeanMaxSim, the proposed baseline, is an ex-tension of the classic measure based on the pro-portion of common words, taking advantage of the distributional similarity but not of compositional-ity. It corresponds to the mean, calculated using all

the words of the two sentences, of the maximum semantic similarity between each word in a sen-tence and all the words of the other sensen-tence. More formaly, given two sentencesa= (a₁, .., a_n)and b= (b₁, ..b_m),

MMS = ⁽ P

imaxjsim(ai,bj)+P

jmaxisim(ai,bj))

In this study, the cosine between the word distri-n+m

butional representations was used as the measure of semantic similarity, but other measures may be used. The common words of the two sentences have an important impact on MeanMaxSim, since their similarity with themselves is equal to the maximum similarity possible. Their impact would be much lower if the average similarity between a word and all the words in the other sentence were employed instead of the maximum similar-ity. Several variants of this measure can be used, for example not taking into account every instance where a word is repeated in a sentence or not al-lowing any single word to be the ”most similar” to several other words.

2.2 A Non-Compositional Approach Based on the Differences Between the Sentences The main limitation of the first approach in the context of this challenge is that it is completely unsupervised and therefore does not take advan-tage of the training set provided by the task orga-nizers. The second approach addresses this limi-tation. It aims to build a predictive model, using as predictors MeanMaxSim but also lexical fea-tures describing the differences between each sen-tence of a pair. For the extraction of these fea-tures, each pair of sentences of the whole dataset (training and testing sets) is analyzed to iden-tify all the lemmas that are not present with the same frequency in both sentences. Each of these differences is encoded as a feature whose value corresponds to the unsigned frequency difference.

This step leads to a two-way contingency table with sentence pairs as rows and lexical features as columns. Correspondence Analysis (Blasius and Greenacre, 1994; Lebart et al., 2000), a sta-tistical procedure available in many off-the-shelf software like R (Nenadic and Greenacre, 2006), is then used to decompose this table into orthogonal dimensions ordered according to the correspond-ing part of associations between rows and columns they explain. Each row receives a coordinate on these dimensions and these coordinates are used as predictors of the relatedness scores of the sentence

pairs. In this way, not only are the frequencies of lexical features transformed into continuous pre-dictors, but these predictors also take into account the redundancy between the lexical features. Fi-nally, a predictive model is built on the basis of the training set by means of multiple linear regres-sion with stepwise selection of the best predictors.

For the textual entailment sub-task, the same pro-cedure was used except that the linear regression was replaced by a linear discriminant analysis.

3 Implementation Details

This section describes the steps and additional resources used to implement the proposed ap-proaches for the SICK challenge.

3.1 Preprocessing of the Dataset

All sentences were tokenized and lemmatized by the Stanford Parser (de Marneffe et al., 2006;

Toutanova et al., 2003).

3.2 Distributional Semantics

Latent Semantic Analysis (LSA), a classical DSM (Deerwester et al., 1991; Landauer et al., 1998), was used to gather the semantic similarity between words from corpora. The starting point of the anal-ysis is a lexical table containing the frequencies of every word in each of the text segments included in the corpus. This table is submitted to a singu-lar value decomposition, which extracts the most significant orthogonal dimensions. In this seman-tic space, the meaning of a word is represented by a vector and the semantic similarity between two words is estimated by the cosine between their cor-responding vectors.

Three corpora were used to estimate these simi-larities. The first one, the TASA corpus, is com-posed of excerpts, with an approximate average length of 250 words, obtained by a random sam-pling of texts that American students read (Lan-dauer et al., 1998). The version to which T.K.

Landauer (Institute of Cognitive Science, Univer-sity of Colorado, Boulder) provided access con-tains approximately 12 million words.

The second corpus, the BNC (British National Corpus; Aston and Burnard, 1998) is composed of approximately 100 million words and covers many different genres. As the documents included in this corpus can be of up to 45,000 words, they were divided into segments of 250 words, the last segment of a text being deleted if it contained

fewer than 250 words.

The third corpus (WIKI, approximately 600 million words after preprocessing) is derived from the Wikipedia Foundation database, downloaded in April 2011. It was built using WikiExtractor.py by A. Fuschetto. As for the BNC, the texts were cut into 250-word segments, and any segment of fewer than 250 words was deleted.

All these corpora were lemmatized by means of the TreeTagger (Schmid, 1994). In addition, a series of functional words were removed as well as all the words whose total frequency in the cor-pus was lower than 10. The resulting (log-entropy weighted) matrices of co-occurrences were sub-mitted to a singular value decomposition (SVD-PACKC, Berry et al., 1993) and the first 300 eigen-vectors were retained.

3.3 Unsupervised Approach Details

Before estimating the semantic similarity between a pair of sentences using MeanMaxSim, words (in their lemmatized forms) considered as stop words were filtered out. This stop word list (n=82), was built specifically for the occasion on the basis of the list of the most frequent words in the training dataset.

3.4 Supervised Approach Details

To identify words not present with the same fre-quency in both sentences, all the lemmas (includ-ing those belong(includ-ing to the stop word list) were taken into account. The optimization of the param-eters of the predictive model was performed using a three-fold cross-validation procedure, with two thirds of the 5000 sentence pairs for training and the remaining third for testing. The values tested by means of an exhaustive search were:

• Minimum threshold frequency of the lexical features in the complete dataset: from 10 to 70 by step of 10.

• Number of dimensions retained from the CA:

from 10 to the total number of dimensions available by step of 10.

• P-value threshold to enter or remove predic-tors from the model: 0.01 and from 0.05 to 0.45 by step of 0.05.

This cross-validation procedure was repeated five times, each time changing the random distri-bution of sentence pairs in the samples. The fi-nal values of the three parameters were selected

In document Proceedings of the Workshop (Pldal 174-200)