David Vilares


2023

pdf
Another Dead End for Morphological Tags? Perturbed Inputs and Parsing
Alberto Muñoz-Ortiz | David Vilares
Findings of the Association for Computational Linguistics: ACL 2023

The usefulness of part-of-speech tags for parsing has been heavily questioned due to the success of word-contextualized parsers. Yet, most studies are limited to coarse-grained tags and high quality written content; while we know little about their influence when it comes to models in production that face lexical errors. We expand these setups and design an adversarial attack to verify if the use of morphological information by parsers: (i) contributes to error propagation or (ii) if on the other hand it can play a role to correct mistakes that word-only neural parsers make. The results on 14 diverse UD treebanks show that under such attacks, for transition- and graph-based models their use contributes to degrade the performance even faster, while for the (lower-performing) sequence labeling parsers they are helpful. We also show that if morphological tags were utopically robust against lexical perturbations, they would be able to correct parsing mistakes.

2022

pdf
The Fragility of Multi-Treebank Parsing Evaluation
Iago Alonso-Alonso | David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 29th International Conference on Computational Linguistics

Treebank selection for parsing evaluation and the spurious effects that might arise from a biased choice have not been explored in detail. This paper studies how evaluating on a single subset of treebanks can lead to weak conclusions. First, we take a few contrasting parsers, and run them on subsets of treebanks proposed in previous work, whose use was justified (or not) on criteria such as typology or data scarcity. Second, we run a large-scale version of this experiment, create vast amounts of random subsets of treebanks, and compare on them many parsers whose scores are available. The results show substantial variability across subsets and that although establishing guidelines for good treebank selection is hard, some inadequate strategies can be easily avoided.

pdf
LyS_ACoruña at SemEval-2022 Task 10: Repurposing Off-the-Shelf Tools for Sentiment Analysis as Semantic Dependency Parsing
Iago Alonso-Alonso | David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper addressed the problem of structured sentiment analysis using a bi-affine semantic dependency parser, large pre-trained language models, and publicly available translation models. For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages that can be adequately processed by cross-lingual language models. For the zero-shot setup and a given target treebank, we relied on: (i) a word-level translation of available treebanks in other languages to get noisy, unlikely-grammatical, but annotated data (we release as much of it as licenses allow), and (ii) merging those translated treebanks to obtain training data. In the post-evaluation phase, we also trained cross-lingual models that simply merged all the English treebanks and did not use word-level translations, and yet obtained better results. According to the official results, we ranked 8th and 9th in the monolingual and cross-lingual setups.

pdf
Cross-lingual Inflection as a Data Augmentation Method for Parsing
Alberto Muñoz-Ortiz | Carlos Gómez-Rodríguez | David Vilares
Proceedings of the Third Workshop on Insights from Negative Results in NLP

We propose a morphology-based method for low-resource (LR) dependency parsing. We train a morphological inflector for target LR languages, and apply it to related rich-resource (RR) treebanks to create cross-lingual (x-inflected) treebanks that resemble the target LR language. We use such inflected treebanks to train parsers in zero- (training on x-inflected treebanks) and few-shot (training on x-inflected and target language treebanks) setups. The results show that the method sometimes improves the baselines, but not consistently.

pdf
Parsing linearizations appreciate PoS tags - but some are fussy about errors
Alberto Muñoz-Ortiz | Mark Anderson | David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

PoS tags, once taken for granted as a useful resource for syntactic parsing, have become more situational with the popularization of deep learning. Recent work on the impact of PoS tags on graph- and transition-based parsers suggests that they are only useful when tagging accuracy is prohibitively high, or in low-resource scenarios. However, such an analysis is lacking for the emerging sequence labeling parsing paradigm, where it is especially relevant as some models explicitly use PoS tags for encoding and decoding. We undertake a study and uncover some trends. Among them, PoS tags are generally more useful for sequence labeling parsers than for other paradigms, but the impact of their accuracy is highly encoding-dependent, with the PoS-based head-selection encoding being best only when both tagging accuracy and resource availability are high.

2021

pdf
Not All Linearizations Are Equally Data-Hungry in Sequence Labeling Parsing
Alberto Muñoz-Ortiz | Michalina Strzyz | David Vilares
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Different linearizations have been proposed to cast dependency parsing as sequence labeling and solve the task as: (i) a head selection problem, (ii) finding a representation of the token arcs as bracket strings, or (iii) associating partial transition sequences of a transition-based parser to words. Yet, there is little understanding about how these linearizations behave in low-resource setups. Here, we first study their data efficiency, simulating data-restricted setups from a diverse set of rich-resource treebanks. Second, we test whether such differences manifest in truly low-resource setups. The results show that head selection encodings are more data-efficient and perform better in an ideal (gold) framework, but that such advantage greatly vanishes in favour of bracketing formats when the running setup resembles a real-world low-resource configuration.

pdf
On the logistical difficulties and findings of Jopara Sentiment Analysis
Marvin Agüero-Torales | David Vilares | Antonio López-Herrera
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

This paper addresses the problem of sentiment analysis for Jopara, a code-switching language between Guarani and Spanish. We first collect a corpus of Guarani-dominant tweets and discuss on the difficulties of finding quality data for even relatively easy-to-annotate tasks, such as sentiment analysis. Then, we train a set of neural models, including pre-trained language models, and explore whether they perform better than traditional machine learning ones in this low-resource setup. Transformer architectures obtain the best results, despite not considering Guarani during pre-training, but traditional machine learning models perform close due to the low-resource nature of the problem.

2020

pdf
Bracketing Encodings for 2-Planar Dependency Parsing
Michalina Strzyz | David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 28th International Conference on Computational Linguistics

We present a bracketing-based encoding that can be used to represent any 2-planar dependency tree over a sentence of length n as a sequence of n labels, hence providing almost total coverage of crossing arcs in sequence labeling parsing. First, we show that existing bracketing encodings for parsing as labeling can only handle a very mild extension of projective trees. Second, we overcome this limitation by taking into account the well-known property of 2-planarity, which is present in the vast majority of dependency syntactic structures in treebanks, i.e., the arcs of a dependency tree can be split into two planes such that arcs in a given plane do not cross. We take advantage of this property to design a method that balances the brackets and that encodes the arcs belonging to each of those planes, allowing for almost unrestricted non-projectivity (∼99.9% coverage) in sequence labeling parsing. The experiments show that our linearizations improve over the accuracy of the original bracketing encoding in highly non-projective treebanks (on average by 0.4 LAS), while achieving a similar speed. Also, they are especially suitable when PoS tags are not used as input parameters to the models.

pdf
A Unifying Theory of Transition-based and Sequence Labeling Parsing
Carlos Gómez-Rodríguez | Michalina Strzyz | David Vilares
Proceedings of the 28th International Conference on Computational Linguistics

We define a mapping from transition-based parsing algorithms that read sentences from left to right to sequence labeling encodings of syntactic trees. This not only establishes a theoretical relation between transition-based parsing and sequence-labeling parsing, but also provides a method to obtain new encodings for fast and simple sequence labeling parsing from the many existing transition-based parsers for different formalisms. Applying it to dependency parsing, we implement sequence labeling versions of four algorithms, showing that they are learnable and obtain comparable performance to existing encodings.

pdf
Discontinuous Constituent Parsing as Sequence Labeling
David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

This paper reduces discontinuous parsing to sequence labeling. It first shows that existing reductions for constituent parsing as labeling do not support discontinuities. Second, it fills this gap and proposes to encode tree discontinuities as nearly ordered permutations of the input sequence. Third, it studies whether such discontinuous representations are learnable. The experiments show that despite the architectural simplicity, under the right representation, the models are fast and accurate.

2019

pdf
Towards Making a Dependency Parser See
Michalina Strzyz | David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We explore whether it is possible to leverage eye-tracking data in an RNN dependency parser (for English) when such information is only available during training - i.e. no aggregated or token-level gaze features are used at inference time. To do so, we train a multitask learning model that parses sentences as sequence labeling and leverages gaze features as auxiliary tasks. Our method also learns to train from disjoint datasets, i.e. it can be used to test whether already collected gaze features are useful to improve the performance on new non-gazed annotated treebanks. Accuracy gains are modest but positive, showing the feasibility of the approach. It can serve as a first step towards architectures that can better leverage eye-tracking data or other complementary information available only for training sentences, possibly leading to improvements in syntactic parsing.

pdf
Viable Dependency Parsing as Sequence Labeling
Michalina Strzyz | David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We recast dependency parsing as a sequence labeling problem, exploring several encodings of dependency trees as labels. While dependency parsing by means of sequence labeling had been attempted in existing work, results suggested that the technique was impractical. We show instead that with a conventional BILSTM-based model it is possible to obtain fast and accurate parsers. These parsers are conceptually simple, not needing traditional parsing algorithms or auxiliary structures. However, experiments on the PTB and a sample of UD treebanks show that they provide a good speed-accuracy tradeoff, with results competitive with more complex approaches.

pdf
Harry Potter and the Action Prediction Challenge from Natural Language
David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We explore the challenge of action prediction from textual descriptions of scenes, a testbed to approximate whether text inference can be used to predict upcoming actions. As a case of study, we consider the world of the Harry Potter fantasy novels and inferring what spell will be cast next given a fragment of a story. Spells act as keywords that abstract actions (e.g. ‘Alohomora’ to open a door) and denote a response to the environment. This idea is used to automatically build HPAC, a corpus containing 82,836 samples and 85 actions. We then evaluate different baselines. Among the tested models, an LSTM-based approach obtains the best performance for frequent actions and large scene descriptions, but approaches such as logistic regression behave well on infrequent actions.

pdf
Better, Faster, Stronger Sequence Tagging Constituent Parsers
David Vilares | Mostafa Abdou | Anders Søgaard
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Sequence tagging models for constituent parsing are faster, but less accurate than other types of parsers. In this work, we address the following weaknesses of such constituent parsers: (a) high error rates around closing brackets of long constituents, (b) large label sets, leading to sparsity, and (c) error propagation arising from greedy decoding. To effectively close brackets, we train a model that learns to switch between tagging schemes. To reduce sparsity, we decompose the label set and use multi-task learning to jointly learn to predict sublabels. Finally, we mitigate issues from greedy decoding through auxiliary losses and sentence-level fine-tuning with policy gradient. Combining these techniques, we clearly surpass the performance of sequence tagging constituent parsers on the English and Chinese Penn Treebanks, and reduce their parsing time even further. On the SPMRL datasets, we observe even greater improvements across the board, including a new state of the art on Basque, Hebrew, Polish and Swedish.

pdf
HEAD-QA: A Healthcare Dataset for Complex Reasoning
David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.

pdf
Sequence Labeling Parsing by Learning across Representations
Michalina Strzyz | David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We use parsing as sequence labeling as a common framework to learn across constituency and dependency syntactic abstractions.To do so, we cast the problem as multitask learning (MTL). First, we show that adding a parsing paradigm as an auxiliary loss consistently improves the performance on the other paradigm. Secondly, we explore an MTL sequence labeling model that parses both representations, at almost no cost in terms of performance and speed. The results across the board show that on average MTL models with auxiliary losses for constituency parsing outperform single-task ones by 1.05 F1 points, and for dependency parsing by 0.62 UAS points.

pdf
Artificially Evolved Chunks for Morphosyntactic Analysis
Mark Anderson | David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)

2018

pdf
Grounding the Semantics of Part-of-Day Nouns Worldwide using Twitter
David Vilares | Carlos Gómez-Rodríguez
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

The usage of part-of-day nouns, such as ‘night’, and their time-specific greetings (‘good night’), varies across languages and cultures. We show the possibilities that Twitter offers for studying the semantics of these terms and its variability between countries. We mine a worldwide sample of multilingual tweets with temporal greetings, and study how their frequencies vary in relation with local time. The results provide insights into the semantics of these temporal expressions and the cultural and sociological factors influencing their usage.

pdf
Transition-based Parsing with Lighter Feed-Forward Networks
David Vilares | Carlos Gómez-Rodríguez
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

We explore whether it is possible to build lighter parsers, that are statistically equivalent to their corresponding standard version, for a wide set of languages showing different structures and morphologies. As testbed, we use the Universal Dependencies and transition-based dependency parsers trained on feed-forward networks. For these, most existing research assumes de facto standard embedded features and relies on pre-computation tricks to obtain speed-ups. We explore how these features and their size can be reduced and whether this translates into speed-ups with a negligible impact on accuracy. The experiments show that grand-daughter features can be removed for the majority of treebanks without a significant (negative or positive) LAS difference. They also show how the size of the embeddings can be notably reduced.

pdf
A Transition-Based Algorithm for Unrestricted AMR Parsing
David Vilares | Carlos Gómez-Rodríguez
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Non-projective parsing can be useful to handle cycles and reentrancy in AMR graphs. We explore this idea and introduce a greedy left-to-right non-projective transition-based parser. At each parsing configuration, an oracle decides whether to create a concept or whether to connect a pair of existing concepts. The algorithm handles reentrancy and arbitrary cycles natively, i.e. within the transition system itself. The model is evaluated on the LDC2015E86 corpus, obtaining results close to the state of the art, including a Smatch of 64%, and showing good behavior on reentrant edges.

pdf
Constituent Parsing as Sequence Labeling
Carlos Gómez-Rodríguez | David Vilares
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce a method to reduce constituent parsing to sequence labeling. For each word wt, it generates a label that encodes: (1) the number of ancestors in the tree that the words wt and wt+1 have in common, and (2) the nonterminal symbol at the lowest common ancestor. We first prove that the proposed encoding function is injective for any tree without unary branches. In practice, the approach is made extensible to all constituency trees by collapsing unary branches. We then use the PTB and CTB treebanks as testbeds and propose a set of fast baselines. We achieve 90% F-score on the PTB test set, outperforming the Vinyals et al. (2015) sequence-to-sequence parser. In addition, sacrificing some accuracy, our approach achieves the fastest constituent parsing speeds reported to date on PTB by a wide margin.

2017

pdf
A non-projective greedy dependency parser with bidirectional LSTMs
David Vilares | Carlos Gómez-Rodríguez
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The LyS-FASTPARSE team present BIST-COVINGTON, a neural implementation of the Covington (2001) algorithm for non-projective dependency parsing. The bidirectional LSTM approach by Kiperwasser and Goldberg (2016) is used to train a greedy parser with a dynamic oracle to mitigate error propagation. The model participated in the CoNLL 2017 UD Shared Task. In spite of not using any ensemble methods and using the baseline segmentation and PoS tagging, the parser obtained good results on both macro-average LAS and UAS in the big treebanks category (55 languages), ranking 7th out of 33 teams. In the all treebanks category (LAS and UAS) we ranked 16th and 12th. The gap between the all and big categories is mainly due to the poor performance on four parallel PUD treebanks, suggesting that some ‘suffixed’ treebanks (e.g. Spanish-AnCora) perform poorly on cross-treebank settings, which does not occur with the corresponding ‘unsuffixed’ treebank (e.g. Spanish). By changing that, we obtain the 11th best LAS among all runs (official and unofficial). The code is made available at https://github.com/CoNLL-UD-2017/LyS-FASTPARSE

pdf
Towards Syntactic Iberian Polarity Classification
David Vilares | Marcos Garcia | Miguel A. Alonso | Carlos Gómez-Rodríguez
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Lexicon-based methods using syntactic rules for polarity classification rely on parsers that are dependent on the language and on treebank guidelines. Thus, rules are also dependent and require adaptation, especially in multilingual scenarios. We tackle this challenge in the context of the Iberian Peninsula, releasing the first symbolic syntax-based Iberian system with rules shared across five official languages: Basque, Catalan, Galician, Portuguese and Spanish. The model is made available.

pdf
Detecting Perspectives in Political Debates
David Vilares | Yulan He
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We explore how to detect people’s perspectives that occupy a certain proposition. We propose a Bayesian modelling approach where topics (or propositions) and their associated perspectives (or viewpoints) are modeled as latent variables. Words associated with topics or perspectives follow different generative routes. Based on the extracted perspectives, we can extract the top associated sentences from text to generate a succinct summary which allows a quick glimpse of the main viewpoints in a document. The model is evaluated on debates from the House of Commons of the UK Parliament, revealing perspectives from the debates without the use of labelled data and obtaining better results than previous related solutions under a variety of evaluations.

2016

pdf
EN-ES-CS: An English-Spanish Code-Switching Twitter Corpus for Multilingual Sentiment Analysis
David Vilares | Miguel A. Alonso | Carlos Gómez-Rodríguez
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Code-switching texts are those that contain terms in two or more different languages, and they appear increasingly often in social media. The aim of this paper is to provide a resource to the research community to evaluate the performance of sentiment classification techniques on this complex multilingual environment, proposing an English-Spanish corpus of tweets with code-switching (EN-ES-CS CORPUS). The tweets are labeled according to two well-known criteria used for this purpose: SentiStrength and a trinary scale (positive, neutral and negative categories). Preliminary work on the resource is already done, providing a set of baselines for the research community.

pdf
LyS at SemEval-2016 Task 4: Exploiting Neural Activation Values for Twitter Sentiment Classification and Quantification
David Vilares | Yerai Doval | Miguel A. Alonso | Carlos Gómez-Rodríguez
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
One model, two languages: training bilingual parsers with harmonized treebanks
David Vilares | Carlos Gómez-Rodríguez | Miguel A. Alonso
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf bib
Sentiment Analysis on Monolingual, Multilingual and Code-Switching Twitter Corpora
David Vilares | Miguel A. Alonso | Carlos Gómez-Rodríguez
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

2014

pdf
LyS: Porting a Twitter Sentiment Analysis Approach from Spanish to English
David Vilares | Miguel Hermo | Miguel A. Alonso | Carlos Gómez-Rodríguez | Yerai Doval
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)