Luis Chiruzzo


2021

pdf bib
Experiments on a Guarani Corpus of News and Social Media
Santiago Góngora | Nicolás Giossa | Luis Chiruzzo
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

While Guarani is widely spoken in South America, obtaining a large amount of Guarani text from the web is hard. We present the building process of a Guarani corpus composed of a parallel Guarani-Spanish set of news articles, and a monolingual set of tweets. We perform some word embeddings experiments aiming at evaluating the quality of the Guarani split of the corpus, finding encouraging results but noticing that more diversity in text domains might be needed for further improvements.

pdf bib
Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas
Manuel Mager | Arturo Oncevay | Abteen Ebrahimi | John Ortega | Annette Rios | Angela Fan | Ximena Gutierrez-Vasques | Luis Chiruzzo | Gustavo Giménez-Lugo | Ricardo Ramos | Ivan Vladimir Meza Ruiz | Rolando Coto-Solano | Alexis Palmer | Elisabeth Mager-Hois | Vishrav Chaudhary | Graham Neubig | Ngoc Thang Vu | Katharina Kann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. The shared task featured two independent tracks, and participants submitted machine translation systems for up to 10 indigenous languages. Overall, 8 teams participated with a total of 214 submissions. We provided training sets consisting of data collected from various sources, as well as manually translated sentences for the development and test sets. An official baseline trained on this data was also provided. Team submissions featured a variety of architectures, including both statistical and neural models, and for the majority of languages, many teams were able to considerably improve over the baseline. The best performing systems achieved 12.97 ChrF higher than baseline, when averaged across languages.

pdf bib
SemEval 2021 Task 7: HaHackathon, Detecting and Rating Humor and Offense
J. A. Meaney | Steven Wilson | Luis Chiruzzo | Adam Lopez | Walid Magdy
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

SemEval 2021 Task 7, HaHackathon, was the first shared task to combine the previously separate domains of humor detection and offense detection. We collected 10,000 texts from Twitter and the Kaggle Short Jokes dataset, and had each annotated for humor and offense by 20 annotators aged 18-70. Our subtasks were binary humor detection, prediction of humor and offense ratings, and a novel controversy task: to predict if the variance in the humor ratings was higher than a specific threshold. The subtasks attracted 36-58 submissions, with most of the participants choosing to use pre-trained language models. Many of the highest performing teams also implemented additional optimization techniques, including task-adaptive training and adversarial training. The results suggest that the participating systems are well suited to humor detection, but that humor controversy is a more challenging task. We discuss which models excel in this task, which auxiliary techniques boost their performance, and analyze the errors which were not captured by the best systems.

2020

pdf bib
Statistical Deep Parsing for Spanish Using Neural Networks
Luis Chiruzzo | Dina Wonsever
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

This paper presents the development of a deep parser for Spanish that uses a HPSG grammar and returns trees that contain both syntactic and semantic information. The parsing process uses a top-down approach implemented using LSTM neural networks, and achieves good performance results in terms of syntactic constituency and dependency metrics, and also SRL. We describe the grammar, corpus and implementation of the parser. Our process outperforms a CKY baseline and other Spanish parsers in terms of global metrics and also for some specific Spanish phenomena, such as clitics reduplication and relative referents.

pdf bib
Development of a Guarani - Spanish Parallel Corpus
Luis Chiruzzo | Pedro Amarilla | Adolfo Ríos | Gustavo Giménez Lugo
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents the development of a Guarani - Spanish parallel corpus with sentence-level alignment. The Guarani sentences of the corpus use the Jopara Guarani dialect, the dialect of Guarani spoken in Paraguay, which is based on Guarani grammar and may include several Spanish loanwords or neologisms. The corpus has around 14,500 sentence pairs aligned using a semi-automatic process, containing 228,000 Guarani tokens and 336,000 Spanish tokens extracted from web sources.

pdf bib
HAHA 2019 Dataset: A Corpus for Humor Analysis in Spanish
Luis Chiruzzo | Santiago Castro | Aiala Rosá
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents the development of a corpus of 30,000 Spanish tweets that were crowd-annotated with humor value and funniness score. The corpus contains approximately 38.6% of humorous tweets with an average score of 2.04 in a scale from 1 to 5 for the humorous tweets. The corpus has been used in an automatic humor recognition and analysis competition, obtaining encouraging results from the participants.

pdf bib
A Multi-level Annotated Corpus of Scientific Papers for Scientific Document Summarization and Cross-document Relation Discovery
Ahmed AbuRa’ed | Horacio Saggion | Luis Chiruzzo
Proceedings of the 12th Language Resources and Evaluation Conference

Related work sections or literature reviews are an essential part of every scientific article being crucial for paper reviewing and assessment. The automatic generation of related work sections can be considered an instance of the multi-document summarization problem. In order to allow the study of this specific problem, we have developed a manually annotated, machine readable data-set of related work sections, cited papers (e.g. references) and sentences, together with an additional layer of papers citing the references. We additionally present experiments on the identification of cited sentences, using as input citation contexts. The corpus alongside the gold standard are made available for use by the scientific community.

2018

pdf bib
Using Context to Improve the Spanish WordNet Translation
Alfonso Methol | Guillermo López | Juan Álvarez | Luis Chiruzzo | Dina Wonsever
Proceedings of the 9th Global Wordnet Conference

We present some strategies for improving the Spanish version of WordNet, part of the MCR, selecting new lemmas for the Spanish synsets by translating the lemmas of the corresponding English synsets. We used four simple selectors that resulted in a considerable improvement of the Spanish WordNet coverage, but with relatively lower precision, then we defined two context based selectors that improved the precision of the translations.

pdf bib
A Crowd-Annotated Spanish Corpus for Humor Analysis
Santiago Castro | Luis Chiruzzo | Aiala Rosá | Diego Garat | Guillermo Moncecchi
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media

Computational Humor involves several tasks, such as humor recognition, humor generation, and humor scoring, for which it is useful to have human-curated data. In this work we present a corpus of 27,000 tweets written in Spanish and crowd-annotated by their humor value and funniness score, with about four annotations per tweet, tagged by 1,300 people over the Internet. It is equally divided between tweets coming from humorous and non-humorous accounts. The inter-annotator agreement Krippendorff’s alpha value is 0.5710. The dataset is available for general usage and can serve as a basis for humor detection and as a first step to tackle subjectivity.

pdf bib
Spanish HPSG Treebank based on the AnCora Corpus
Luis Chiruzzo | Dina Wonsever
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
What Sentence are you Referring to and Why? Identifying Cited Sentences in Scientific Literature
Ahmed AbuRa’ed | Luis Chiruzzo | Horacio Saggion
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

In the current context of scientific information overload, text mining tools are of paramount importance for researchers who have to read scientific papers and assess their value. Current citation networks, which link papers by citation relationships (reference and citing paper), are useful to quantitatively understand the value of a piece of scientific work, however they are limited in that they do not provide information about what specific part of the reference paper the citing paper is referring to. This qualitative information is very important, for example, in the context of current community-based scientific summarization activities. In this paper, and relying on an annotated dataset of co-citation sentences, we carry out a number of experiments aimed at, given a citation sentence, automatically identify a part of a reference paper being cited. Additionally our algorithm predicts the specific reason why such reference sentence has been cited out of five possible reasons.

2016

pdf bib
Some strategies for the improvement of a Spanish WordNet
Matias Herrera | Javier Gonzalez | Luis Chiruzzo | Dina Wonsever
Proceedings of the 8th Global WordNet Conference (GWC)

Although there are currently several versions of Princeton WordNet for different languages, the lack of development of some of these versions does not make it possible to use them in different Natural Language Processing applications. So is the case of the Spanish Wordnet contained in the Multilingual Central Repository (MCR), which we tried unsuccessfully to incorporate into an anaphora resolution application and also in search terms expansion. In this situation, different strategies to improve MCR Spanish WordNet coverage were put forward and tested, obtaining encouraging results. A specific process was conducted to increase the number of adverbs, and a few simple processes were applied which made it possible to increase, at a very low cost, the number of terms in the Spanish WordNet. Finally, a more complex method based on distributional semantics was proposed, using the relations between English Wordnet synsets, also returning positive results.

2013

pdf bib
Adaptation of a Rule-Based Translator to Río de la Plata Spanish
Ernesto López | Luis Chiruzzo | Dina Wonsever
Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants