Aline Villavicencio

2023

pdf abs
Evaluating Open-Domain Dialogues in Latent Space with Next Sentence Prediction and Mutual Information
Kun Zhao | Bohao Yang | Chenghua Lin | Wenge Rong | Aline Villavicencio | Xiaohui Cui
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The long-standing one-to-many issue of the open-domain dialogues poses significant challenges for automatic evaluation methods, i.e., there may be multiple suitable responses which differ in semantics for a given conversational context. To tackle this challenge, we propose a novel learning-based automatic evaluation metric (CMN), which can robustly evaluate open-domain dialogues by augmenting Conditional Variational Autoencoders (CVAEs) with a Next Sentence Prediction (NSP) objective and employing Mutual Information (MI) to model the semantic similarity of text in the latent space. Experimental results on two open-domain dialogue datasets demonstrate the superiority of our method compared with a wide range of baselines, especially in handling responses which are distant to the “golden” reference responses in semantics.

2022

pdf abs
SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding
Harish Tayyar Madabushi | Edward Gow-Smith | Marcos Garcia | Carolina Scarton | Marco Idiart | Aline Villavicencio
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents the shared task on Multilingual Idiomaticity Detection and Sentence Embedding, which consists of two subtasks: (a) a binary classification task aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context. Each subtask includes different settings regarding the amount of training data. Besides the task description, this paper introduces the datasets in English, Portuguese, and Galician and their annotation procedure, the evaluation metrics, and a summary of the participant systems and their results. The task had close to 100 registered participants organised into twenty five teams making over 650 and 150 submissions in the practice and evaluation phases respectively.

Deep neural models, in particular Transformer-based pre-trained language models, require a significant amount of data to train. This need for data tends to lead to problems when dealing with idiomatic multiword expressions (MWEs), which are inherently less frequent in natural text. As such, this work explores sample efficient methods of idiomaticity detection. In particular we study the impact of Pattern Exploit Training (PET), a few-shot method of classification, and BERTRAM, an efficient method of creating contextual embeddings, on the task of idiomaticity detection. In addition, to further explore generalisability, we focus on the identification of MWEs not present in the training data. Our experiments show that while these methods improve performance on English, they are much less effective on Portuguese and Galician, leading to an overall performance about on par with vanilla mBERT. Regardless, we believe sample efficient methods for both identifying and representing potentially idiomatic MWEs are very encouraging and hold significant potential for future exploration.

pdf
Improving Tokenisation by Alternative Treatment of Spaces
Edward Gow-Smith | Harish Tayyar Madabushi | Carolina Scarton | Aline Villavicencio
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

pdf abs
Effective Cross-Task Transfer Learning for Explainable Natural Language Inference with T5
Irina Bigoulaeva | Rachneet Singh Sachdeva | Harish Tayyar Madabushi | Aline Villavicencio | Iryna Gurevych
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)

We compare sequential fine-tuning with a model for multi-task learning in the context where we are interested in boosting performance on two of the tasks, one of which depends on the other. We test these models on the FigLang2022 shared task which requires participants to predict language inference labels on figurative language along with corresponding textual explanations of the inference predictions. Our results show that while sequential multi-task learning can be tuned to be good at the first of two target tasks, it performs less well on the second and additionally struggles with overfitting. Our findings show that simple sequential fine-tuning of text-to-text models is an extraordinarily powerful method of achieving cross-task knowledge transfer while simultaneously predicting multiple interdependent targets. So much so, that our best model achieved the (tied) highest score on the task.

pdf bib abs
Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings
Marcely Zanon Boito | Bolaji Yusuf | Lucas Ondel | Aline Villavicencio | Laurent Besacier
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

Documenting languages helps to prevent the extinction of endangered dialects - many of which are otherwise expected to disappear by the end of the century. When documenting oral languages, unsupervised word segmentation (UWS) from speech is a useful, yet challenging, task. It consists in producing time-stamps for slicing utterances into smaller segments corresponding to words, being performed from phonetic transcriptions, or in the absence of these, from the output of unsupervised speech discretization models. These discretization models are trained using raw speech only, producing discrete speech units that can be applied for downstream (text-based) tasks. In this paper we compare five of these models: three Bayesian and two neural approaches, with regards to the exploitability of the produced units for UWS. For the UWS task, we experiment with two models, using as our target language the Mboshi (Bantu C25), an unwritten language from Congo-Brazzaville. Additionally, we report results for Finnish, Hungarian, Romanian and Russian in equally low-resource settings, using only 4 hours of speech. Our results suggest that neural models for speech discretization are difficult to exploit in our setting, and that it might be necessary to adapt them to limit sequence length. We obtain our best UWS results by using Bayesian models that produce high quality, yet compressed, discrete representations of the input speech signal.

pdf bib
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Smaranda Muresan | Preslav Nakov | Aline Villavicencio
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Smaranda Muresan | Preslav Nakov | Aline Villavicencio
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Findings of the Association for Computational Linguistics: ACL 2022
Smaranda Muresan | Preslav Nakov | Aline Villavicencio
Findings of the Association for Computational Linguistics: ACL 2022

2021

pdf abs
CogNLP-Sheffield at CMCL 2021 Shared Task: Blending Cognitively Inspired Features with Transformer-based Language Models for Predicting Eye Tracking Patterns
Peter Vickers | Rosa Wainwright | Harish Tayyar Madabushi | Aline Villavicencio
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

The CogNLP-Sheffield submissions to the CMCL 2021 Shared Task examine the value of a variety of cognitively and linguistically inspired features for predicting eye tracking patterns, as both standalone model inputs and as supplements to contextual word embeddings (XLNet). Surprisingly, the smaller pre-trained model (XLNet-base) outperforms the larger (XLNet-large), and despite evidence that multi-word expressions (MWEs) provide cognitive processing advantages, MWE features provide little benefit to either model.

pdf bib abs
Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021): Workshop and Shared Task Report
Ali Hürriyetoğlu | Hristo Tanev | Vanni Zavarella | Jakub Piskorski | Reyyan Yeniterzi | Osman Mutlu | Deniz Yuret | Aline Villavicencio
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of workshops is to foster research and development of reliable, valid, robust, and practical solutions for automatically detecting descriptions of socio-political events, such as protests, riots, wars and armed conflicts, in text streams. This year workshop contributors make use of the state-of-the-art NLP technologies, such as Deep Learning, Word Embeddings and Transformers and cover a wide range of topics from text classification to news bias detection. Around 40 teams have registered and 15 teams contributed to three tasks that are i) multilingual protest news detection detection, ii) fine-grained classification of socio-political events, and iii) discovering Black Lives Matter protest events. The workshop also highlights two keynote and four invited talks about various aspects of creating event data sets and multi- and cross-lingual machine learning in few- and zero-shot settings.

pdf abs
Probing for idiomaticity in vector space models
Marcos Garcia | Tiago Kramer Vieira | Carolina Scarton | Marco Idiart | Aline Villavicencio
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Contextualised word representation models have been successfully used for capturing different word usages and they may be an attractive alternative for representing idiomaticity in language. In this paper, we propose probing measures to assess if some of the expected linguistic properties of noun compounds, especially those related to idiomatic meanings, and their dependence on context and sensitivity to lexical choice, are readily available in some standard and widely used representations. For that, we constructed the Noun Compound Senses Dataset, which contains noun compounds and their paraphrases, in context neutral and context informative naturalistic sentences, in two languages: English and Portuguese. Results obtained using four types of probing measures with models like ELMo, BERT and some of its variants, indicate that idiomaticity is not yet accurately represented by contextualised models

pdf abs
AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models
Harish Tayyar Madabushi | Edward Gow-Smith | Carolina Scarton | Aline Villavicencio
Findings of the Association for Computational Linguistics: EMNLP 2021

Despite their success in a variety of NLP tasks, pre-trained language models, due to their heavy reliance on compositionality, fail in effectively capturing the meanings of multiword expressions (MWEs), especially idioms. Therefore, datasets and methods to improve the representation of MWEs are urgently needed. Existing datasets are limited to providing the degree of idiomaticity of expressions along with the literal and, where applicable, (a single) non-literal interpretation of MWEs. This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings, spanning both English and Portuguese. We use this dataset in two tasks designed to test i) a language model’s ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms. Our experiments demonstrate that, on the task of detecting idiomatic usage, these models perform reasonably well in the one-shot and few-shot scenarios, but that there is significant scope for improvement in the zero-shot scenario. On the task of representing idiomaticity, we find that pre-training is not always effective, while fine-tuning could provide a sample efficient method of learning representations of sentences containing MWEs.

pdf abs
Assessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levels
Marcos Garcia | Tiago Kramer Vieira | Carolina Scarton | Marco Idiart | Aline Villavicencio
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Accurate assessment of the ability of embedding models to capture idiomaticity may require evaluation at token rather than type level, to account for degrees of idiomaticity and possible ambiguity between literal and idiomatic usages. However, most existing resources with annotation of idiomaticity include ratings only at type level. This paper presents the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, with human annotations for 280 noun compounds in English and 180 in Portuguese at both type and token level. We compiled 8,725 and 5,091 token level annotations for English and Portuguese, respectively, which are strongly correlated with the corresponding scores obtained at type level. The NCTTI dataset is used to explore how vector space models reflect the variability of idiomaticity across sentences. Several experiments using state-of-the-art contextualised models suggest that their representations are not capturing the noun compounds idiomaticity as human annotators. This new multilingual resource also contains suggestions for paraphrases of the noun compounds both at type and token levels, with uses for lexical substitution or disambiguation in context.

2020

pdf abs
Leveraging Contextual Embeddings and Idiom Principle for Detecting Idiomaticity in Potentially Idiomatic Expressions
Reyhaneh Hashempour | Aline Villavicencio
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

The majority of studies on detecting idiomatic expressions have focused on discovering potentially idiomatic expressions overlooking the context. However, many idioms like blow the whistle could be interpreted idiomatically or literally depending on the context. In this work, we leverage the Idiom Principle (Sinclair et al., 1991) and contextualized word embeddings (CWEs), focusing on Context2Vec (Melamud et al., 2016) and BERT (Devlin et al., 2019) to distinguish between literal and idiomatic senses of such expressions in context. We also experiment with a non-contextualized word embedding baseline, in this case word2Vec (Mikolov et al., 2013) and compare its performance with that of CWEs. The results show that CWEs outperform the non-CWEs, especially when the Idiom Principle is applied, as it improves the results by 6%. We further show that the Context2Vec model, trained based on Idiom Principle, can place potentially idiomatic expressions into distinct ‘sense’ (idiomatic/literal) regions of the embedding space, whereas Word2Vec and BERT seem to lack this capacity. The model is also capable of producing suitable substitutes for ambiguous expressions in context which is promising for downstream tasks like text simplification.

pdf bib
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
Aline Villavicencio | Benjamin Van Durme
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

pdf abs
Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
Marcely Zanon Boito | Aline Villavicencio | Laurent Besacier
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

For endangered languages, data collection campaigns have to accommodate the challenge that many of them are from oral tradition, and producing transcriptions is costly. Therefore, it is fundamental to translate them into a widely spoken language to ensure interpretability of the recordings. In this paper we investigate how the choice of translation language affects the posterior documentation work and potential automatic approaches which will work on top of the produced bilingual corpus. For answering this question, we use the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment. Our results highlight that the choice of language for translation influences the word segmentation performance, and that different lexicons are learned by using different aligned translations. Lastly, this paper proposes a hybrid approach for bilingual word segmentation, combining boundary clues extracted from a non-parametric Bayesian model (Goldwater et al., 2009a) with the attentional word segmentation neural model from Godard et al. (2018). Our results suggest that incorporating these clues into the neural models’ input representation increases their translation and alignment quality, specially for challenging language pairs.

2019

pdf bib abs
Unsupervised Compositionality Prediction of Nominal Compounds
Silvio Cordeiro | Aline Villavicencio | Marco Idiart | Carlos Ramisch
Computational Linguistics, Volume 45, Issue 1 - March 2019

Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results.

pdf bib
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Mohit Bansal | Aline Villavicencio
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

pdf bib abs
When the whole is greater than the sum of its parts: Multiword expressions and idiomaticity
Aline Villavicencio
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

Multiword expressions (MWEs) feature prominently in the mental lexicon of native speakers (Jackendoff, 1997) in all languages and domains, from informal to technical contexts (Biber et al., 1999) with about four MWEs being produced per minute of discourse (Glucksberg, 1989). MWEs come in all shapes and forms, including idioms like rock the boat (as cause problems or disturb a situation) and compound nouns like monkey business (as dishonest behaviour). Their accurate detection and understanding may often require more than knowledge about individual words and how they can be combined (Fillmore, 1979), as they may display various degrees of idiosyncrasy, including lexical, syntactic, semantic and statistical (Sag et al., 2002; Baldwin and Kim, 2010), which provide new challenges and opportunities for language processing (Constant et al., 2017). For instance, while for some combinations the meaning can be inferred from their parts like olive oil (oil made of olives) this is not always the case, as in dark horse (meaning an unknown candidate who unexpectedly succeeds), and when processing a sentence some of the challenges are to identify which words form an expression (Ramisch, 2015), and whether the expression is idiomatic (Cordeiro et al., 2019). In this talk I will give an overview of advances on the identification and treatment of multiword expressions, in particular concentrating on techniques for identifying their degree of idiomaticity.

2018

pdf bib abs
Restricted Recurrent Neural Tensor Networks: Exploiting Word Frequency and Compositionality
Alexandre Salle | Aline Villavicencio
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Increasing the capacity of recurrent neural networks (RNN) usually involves augmenting the size of the hidden layer, with significant increase of computational cost. Recurrent neural tensor networks (RNTN) increase capacity using distinct hidden layer weights for each word, but with greater costs in memory usage. In this paper, we introduce restricted recurrent neural tensor networks (r-RNTN) which reserve distinct hidden layer weights for frequent vocabulary words while sharing a single set of weights for infrequent words. Perplexity evaluations show that for fixed hidden layer sizes, r-RNTNs improve language model performance over RNNs using only a small fraction of the parameters of unrestricted RNTNs. These results hold for r-RNTNs using Gated Recurrent Units and Long Short-Term Memory.

Semantic Verbal Fluency tests have been used in the detection of certain clinical conditions, like Dementia. In particular, given a sequence of semantically related words, a large number of switches from one semantic class to another has been linked to clinical conditions. In this work, we investigate three similarity measures for automatically identifying switches in semantic chains: semantic similarity from a manually constructed resource, and word association strength and semantic relatedness, both calculated from corpora. This information is used for building classifiers to distinguish healthy controls from clinical cases with early stages of Alzheimer’s Disease and Mild Cognitive Deficits. The overall results indicate that for clinical conditions the classifiers that use these similarity measures outperform those that use a gold standard taxonomy.

pdf abs
Incorporating Subword Information into Matrix Factorization Word Embeddings
Alexandre Salle | Aline Villavicencio
Proceedings of the Second Workshop on Subword/Character LEvel Models

The positive effect of adding subword information to word embeddings has been demonstrated for predictive models. In this paper we investigate whether similar benefits can also be derived from incorporating subwords into counting models. We evaluate the impact of different types of subwords (n-grams and unsupervised morphemes), with results confirming the importance of subword information in learning representations of rare and out-of-vocabulary words.

pdf bib
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing
Marco Idiart | Alessandro Lenci | Thierry Poibeau | Aline Villavicencio
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

pdf
The brWaC Corpus: A New Open Resource for Brazilian Portuguese
Jorge A. Wagner Filho | Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

2016

pdf
Predicting the Compositionality of Nominal Compounds: Giving Word Embeddings a Hard Time
Silvio Cordeiro | Carlos Ramisch | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
How Naked is the Naked Truth? A Multilingual Lexicon of Nominal Compound Compositionality
Carlos Ramisch | Silvio Cordeiro | Leonardo Zilio | Marco Idiart | Aline Villavicencio
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations
Alexandre Salle | Aline Villavicencio | Marco Idiart
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Filtering and Measuring the Intrinsic Quality of Human Compositionality Judgments
Carlos Ramisch | Silvio Cordeiro | Aline Villavicencio
Proceedings of the 12th Workshop on Multiword Expressions

pdf bib
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning
Anna Korhonen | Alessandro Lenci | Brian Murphy | Thierry Poibeau | Aline Villavicencio
Proceedings of the 7th Workshop on Cognitive Aspects of Computational Language Learning

pdf abs
Automatic Construction of Large Readability Corpora
Jorge Alberto Wagner Filho | Rodrigo Wilkens | Aline Villavicencio
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

This work presents a framework for the automatic construction of large Web corpora classified by readability level. We compare different Machine Learning classifiers for the task of readability assessment focusing on Portuguese and English texts, analysing the impact of variables like the feature inventory used in the resulting corpus. In a comparison between shallow and deeper features, the former already produce F-measures of over 0.75 for Portuguese texts, but the use of additional features results in even better results, in most cases. For English, shallow features also perform well as do classic readability formulas. Comparing different classifiers for the task, logistic regression obtained, in general, the best results, but with considerable differences between the results for two and those for three-classes, especially regarding the intermediary class. Given the large scale of the resulting corpus, for evaluation we adopt the agreement between different classifiers as an indication of readability assessment certainty. As a result of this work, a large corpus for Brazilian Portuguese was built, including 1.7 million documents and about 1.6 billion tokens, already parsed and annotated with 134 different textual attributes, along with the agreement among the various classifiers.

pdf
UFRGS&LIF at SemEval-2016 Task 10: Rule-Based MWE Identification and Predominant-Supersense Tagging
Silvio Cordeiro | Carlos Ramisch | Aline Villavicencio
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf abs
mwetoolkit+sem: Integrating Word Embeddings in the mwetoolkit for Semantic MWE Processing
Silvio Cordeiro | Carlos Ramisch | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents mwetoolkit+sem: an extension of the mwetoolkit that estimates semantic compositionality scores for multiword expressions (MWEs) based on word embeddings. First, we describe our implementation of vector-space operations working on distributional vectors. The compositionality score is based on the cosine distance between the MWE vector and the composition of the vectors of its member words. Our generic system can handle several types of word embeddings and MWE lists, and may combine individual word representations using several composition techniques. We evaluate our implementation on a dataset of 1042 English noun compounds, comparing different configurations of the underlying word embeddings and word-composition models. We show that our vector-based scores model non-compositionality better than standard association measures such as log-likelihood.

pdf abs
Multiword Expressions in Child Language
Rodrigo Wilkens | Marco Idiart | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of this work is to introduce CHILDES-MWE, which contains English CHILDES corpora automatically annotated with Multiword Expressions (MWEs) information. The result is a resource with almost 350,000 sentences annotated with more than 70,000 distinct MWEs of various types from both longitudinal and latitudinal corpora. This resource can be used for large scale language acquisition studies of how MWEs feature in child language. Focusing on compound nouns (CN), we then verify in a longitudinal study if there are differences in the distribution and compositionality of CNs in child-directed and child-produced sentences across ages. Moreover, using additional latitudinal data, we investigate if there are further differences in CN usage and in compositionality preferences. The results obtained for the child-produced sentences reflect CN distribution and compositionality in child-directed sentences.

pdf abs
VerbLexPor: a lexical resource with semantic roles for Portuguese
Leonardo Zilio | Maria José Bocorny Finatto | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a lexical resource developed for Portuguese. The resource contains sentences annotated with semantic roles. The sentences were extracted from two domains: Cardiology research papers and newspaper articles. Both corpora were analyzed with the PALAVRAS parser and subsequently processed with a subcategorization frames extractor, so that each sentence that contained at least one main verb was stored in a database together with its syntactic organization. The annotation was manually carried out by a linguist using an annotation interface. Both the annotated and non-annotated data were exported to an XML format, which is readily available for download. The reason behind exporting non-annotated data is that there is syntactic information collected from the parser annotation in the non-annotated data, and this could be useful for other researchers. The sentences from both corpora were annotated separately, so that it is possible to access sentences either from the Cardiology or from the newspaper corpus. The full resource presents more than seven thousand semantically annotated sentences, containing 192 different verbs and more than 15 thousand individual arguments and adjuncts.

pdf abs
B2SG: a TOEFL-like Task for Portuguese
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Aline Villavicencio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Resources such as WordNet are useful for NLP applications, but their manual construction consumes time and personnel, and frequently results in low coverage. One alternative is the automatic construction of large resources from corpora like distributional thesauri, containing semantically associated words. However, as they may contain noise, there is a strong need for automatic ways of evaluating the quality of the resulting resource. This paper introduces a gold standard that can aid in this task. The BabelNet-Based Semantic Gold Standard (B2SG) was automatically constructed based on BabelNet and partly evaluated by human judges. It consists of sets of tests that present one target word, one related word and three unrelated words. B2SG contains 2,875 validated relations: 800 for verbs and 2,075 for nouns; these relations are divided among synonymy, antonymy and hypernymy. They can be used as the basis for evaluating the accuracy of the similarity relations on distributional thesauri by comparing the proximity of the target word with the related and unrelated options and observing if the related word has the highest similarity value among them. As a case study two distributional thesauri were also developed: one using surface forms from a large (1.5 billion word) corpus and the other using lemmatized forms from a smaller (409 million word) corpus. Both distributional thesauri were then evaluated against B2SG, and the one using lemmatized forms performed slightly better.

2015

pdf bib
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning
Robert Berwick | Anna Korhonen | Alessandro Lenci | Thierry Poibeau | Aline Villavicencio
Proceedings of the Sixth Workshop on Cognitive Aspects of Computational Language Learning

pdf
Distributional Thesauri for Portuguese: methodology evaluation
Rodrigo Wilkens | Leonardo Zilio | Eduardo Ferreira | Gabriel Gonçalves | Aline Villavicencio
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

pdf
VerbLexPor: um recurso léxico com anotação de papéis semânticos para o português (VerbLexPor: a lexical resource annotated with semantic roles for Portuguese)
Leonardo Zilio | Maria José Bocorny Finatto | Aline Villavicencio
Proceedings of the 10th Brazilian Symposium in Information and Human Language Technology

2014

pdf bib
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)
Alessandro Lenci | Muntsa Padró | Thierry Poibeau | Aline Villavicencio
Proceedings of the 5th Workshop on Cognitive Aspects of Computational Language Learning (CogACLL)

pdf
Nothing like Good Old Frequency: Studying Context Filters for Distributional Thesauri
Muntsa Padró | Marco Idiart | Aline Villavicencio | Carlos Ramisch
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf abs
Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them
Bruno Laranjeira | Viviane Moreira | Aline Villavicencio | Carlos Ramisch | Maria José Finatto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.

pdf abs
Identification of Multiword Expressions in the brWaC
Rodrigo Boos | Kassius Prestes | Aline Villavicencio
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Although corpus size is a well known factor that affects the performance of many NLP tasks, for many languages large freely available corpora are still scarce. In this paper we describe one effort to build a very large corpus for Brazilian Portuguese, the brWaC, generated following the Web as Corpus kool initiative. To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus. Focusing on nominal compounds, the expressions obtained from each corpus are of comparable quality and indicate that corpus origin has no impact on this task.

pdf abs
Comparing Similarity Measures for Distributional Thesauri
Muntsa Padró | Marco Idiart | Aline Villavicencio | Carlos Ramisch
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Distributional thesauri have been applied for a variety of tasks involving semantic relatedness. In this paper, we investigate the impact of three parameters: similarity measures, frequency thresholds and association scores. We focus on the robustness and stability of the resulting thesauri, measuring inter-thesaurus agreement when testing different parameter values. The results obtained show that low-frequency thresholds affect thesaurus quality more than similarity measures, with more agreement found for increasing thresholds. These results indicate the sensitivity of distributional thesauri to frequency. Nonetheless, the observed differences do not transpose over extrinsic evaluation using TOEFL-like questions. While this may be specific to the task, we argue that a careful examination of the stability of distributional resources prior to application is needed.

2013

pdf bib
Proceedings of the 9th Workshop on Multiword Expressions
Valia Kordoni | Carlos Ramisch | Aline Villavicencio
Proceedings of the 9th Workshop on Multiword Expressions

pdf
Language Acquisition and Probabilistic Models: keeping it simple
Aline Villavicencio | Marco Idiart | Robert Berwick | Igor Malioutov
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss
Robert Berwick | Anna Korhonen | Thierry Poibeau | Aline Villavicencio
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf
An annotated English child language database
Aline Villavicencio | Beracah Yankama | Rodrigo Wilkens | Marco Idiart | Robert Berwick
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf
I say have you say tem: profiling verbs in children data in English and Portuguese
Rodrigo Wilkens | Aline Villavicencio
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf
Get out but don’t fall down: verb-particle constructions in child language
Aline Villavicencio | Marco Idiart | Carlos Ramisch | Vítor Araújo | Beracah Yankama | Robert Berwick
Proceedings of the Workshop on Computational Models of Language Acquisition and Loss

pdf bib
A Broad Evaluation of Techniques for Automatic Acquisition of Multiword Expressions
Carlos Ramisch | Vitor De Araujo | Aline Villavicencio
Proceedings of ACL 2012 Student Research Workshop

pdf abs
A large scale annotated child language construction database
Aline Villavicencio | Beracah Yankama | Marco Idiart | Robert Berwick
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Large scale annotated corpora of child language can be of great value in assessing theoretical proposals regarding language acquisition models. For example, they can help determine whether the type and amount of data required by a proposed language acquisition model can actually be found in a naturalistic data sample. To this end, several recent efforts have augmented the CHILDES child language corpora with POS tagging and parsing information for languages such as English. With the increasing availability of robust NLP systems and electronic resources, these corpora can be further annotated with more detailed information about the properties of words, verb argument structure, and sentences. This paper describes such an initiative for combining information from various sources to extend the annotation of the English CHILDES corpora with linguistic, psycholinguistic and distributional information, along with an example illustrating an application of this approach to the extraction of verb alternation information. The end result, the English CHILDES Verb Construction Database, is an integrated resource containing information such as grammatical relations, verb semantic classes, and age of acquisition, enabling more targeted complex searches involving different levels of annotation that can facilitate a more detailed analysis of the linguistic input available to children.

2011

pdf bib
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Valia Kordoni | Carlos Ramisch | Aline Villavicencio
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf
Identifying and Analyzing Brazilian Portuguese Complex Predicates
Magali Sanches Duran | Carlos Ramisch | Sandra Maria Aluísio | Aline Villavicencio
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf
Identification and Treatment of Multiword Expressions Applied to Information Retrieval
Otavio Acosta | Aline Villavicencio | Viviane Moreira
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf
Fast and Flexible MWE Candidate Generation with the mwetoolkit
Vitor De Araujo | Carlos Ramisch | Aline Villavicencio
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World

pdf
Improving Lexical Alignment Using Hybrid Discriminative and Post-Processing Techniques
Paulo Schreiner | Aline Villavicencio | Leonardo Zilio | Helena M. Caseli
Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology

2010

pdf
Web-based and combined language models: a case study on noun compound identification
Carlos Ramisch | Aline Villavicencio | Christian Boitet
Coling 2010: Posters

pdf
Multiword Expressions in the wild? The mwetoolkit comes in handy
Carlos Ramisch | Aline Villavicencio | Christian Boitet
Coling 2010: Demonstrations

pdf
An Investigation on Polysemy and Lexical Organization of Verbs
Daniel Cerato Germann | Aline Villavicencio | Maity Siqueira
Proceedings of the NAACL HLT 2010 First Workshop on Computational Neurolinguistics

pdf
An Investigation on the Influence of Frequency on the Lexical Organization of Verbs
Daniel Cerato Germann | Aline Villavicencio | Maity Siqueira
Proceedings of TextGraphs-5 - 2010 Workshop on Graph-based Methods for Natural Language Processing

pdf bib
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications
Éric Laporte | Preslav Nakov | Carlos Ramisch | Aline Villavicencio
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications

pdf abs
mwetoolkit: a Framework for Multiword Expression Identification
Carlos Ramisch | Aline Villavicencio | Christian Boitet
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents the Multiword Expression Toolkit (mwetoolkit), an environment for type and language-independent MWE identification from corpora. The mwetoolkit provides a targeted list of MWE candidates, extracted and filtered according to a number of user-defined criteria and a set of standard statistical association measures. For generating corpus counts, the toolkit provides both a corpus indexation facility and a tool for integration with web search engines, while for evaluation, it provides validation and annotation facilities. The mwetoolkit also allows easy integration with a machine learning tool for the creation and application of supervised MWE extraction models if annotated data is available. In our experiment, the mwetoolkit was tested and evaluated in the context of MWE extraction in the biomedical domain. Our preliminary results show that the toolkit performs better than other approaches, especially concerning recall. Moreover, this first version can also be extended in several ways in order to improve the quality of the results.