Alfredo Maldonado

Also published as: Alfredo Maldonado Guerra, Alfredo Maldonado-Guerra

2020

pdf abs
English WordNet Random Walk Pseudo-Corpora
Filip Klubička | Alfredo Maldonado | Abhijit Mahalunkar | John Kelleher
Proceedings of the Twelfth Language Resources and Evaluation Conference

This is a resource description paper that describes the creation and properties of a set of pseudo-corpora generated artificially from a random walk over the English WordNet taxonomy. Our WordNet taxonomic random walk implementation allows the exploration of different random walk hyperparameters and the generation of a variety of different pseudo-corpora. We find that different combinations of parameters result in varying statistical properties of the generated pseudo-corpora. We have published a total of 81 pseudo-corpora that we have used in our previous research, but have not exhausted all possible combinations of hyperparameters, which is why we have also published a codebase that allows the generation of additional WordNet taxonomic pseudo-corpora as needed. Ultimately, such pseudo-corpora can be used to train taxonomic word embeddings, as a way of transferring taxonomic knowledge into a word embedding space.

2019

pdf abs
Synthetic, yet natural: Properties of WordNet random walk corpora and the impact of rare words on embedding performance
Filip Klubička | Alfredo Maldonado | Abhijit Mahalunkar | John Kelleher
Proceedings of the 10th Global Wordnet Conference

Creating word embeddings that reflect semantic relationships encoded in lexical knowledge resources is an open challenge. One approach is to use a random walk over a knowledge graph to generate a pseudo-corpus and use this corpus to train embeddings. However, the effect of the shape of the knowledge graph on the generated pseudo-corpora, and on the resulting word embeddings, has not been studied. To explore this, we use English WordNet, constrained to the taxonomic (tree-like) portion of the graph, as a case study. We investigate the properties of the generated pseudo-corpora, and their impact on the resulting embeddings. We find that the distributions in the psuedo-corpora exhibit properties found in natural corpora, such as Zipf’s and Heaps’ law, and also observe that the proportion of rare words in a pseudo-corpus affects the performance of its embeddings on word similarity.

pdf abs
Measuring Gender Bias in Word Embeddings across Domains and Discovering New Gender Bias Word Categories
Kaytlin Chaloner | Alfredo Maldonado
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

Prior work has shown that word embeddings capture human stereotypes, including gender bias. However, there is a lack of studies testing the presence of specific gender bias categories in word embeddings across diverse domains. This paper aims to fill this gap by applying the WEAT bias detection method to four sets of word embeddings trained on corpora from four different domains: news, social networking, biomedical and a gender-balanced corpus extracted from Wikipedia (GAP). We find that some domains are definitely more prone to gender bias than others, and that the categories of gender bias present also vary for each set of word embeddings. We detect some gender bias in GAP. We also propose a simple but novel method for discovering new bias categories by clustering word embeddings. We validate this method through WEAT’s hypothesis testing mechanism and find it useful for expanding the relatively small set of well-known gender bias word categories commonly used in the literature.

2018

pdf abs
ADAPT at SemEval-2018 Task 9: Skip-Gram Word Embeddings for Unsupervised Hypernym Discovery in Specialised Corpora
Alfredo Maldonado | Filip Klubička
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes a simple but competitive unsupervised system for hypernym discovery. The system uses skip-gram word embeddings with negative sampling, trained on specialised corpora. Candidate hypernyms for an input word are predicted based based on cosine similarity scores. Two sets of word embedding models were trained separately on two specialised corpora: a medical corpus and a music industry corpus. Our system scored highest in the medical domain among the competing unsupervised systems but performed poorly on the music industry domain. Our system does not depend on any external data other than raw specialised corpora.

pdf abs
CRF-Seq and CRF-DepTree at PARSEME Shared Task 2018: Detecting Verbal MWEs using Sequential and Dependency-Based Approaches
Erwan Moreau | Ashjan Alsulaimani | Alfredo Maldonado | Carl Vogel
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper describes two systems for detecting Verbal Multiword Expressions (VMWEs) which both competed in the closed track at the PARSEME VMWE Shared Task 2018. CRF-DepTree-categs implements an approach based on the dependency tree, intended to exploit the syntactic and semantic relations between tokens; CRF-Seq-nocategs implements a robust sequential method which requires only lemmas and morphosyntactic tags. Both systems ranked in the top half of the ranking, the latter ranking second for token-based evaluation. The code for both systems is published under the GNU General Public License version 3.0 and is available at http://github.com/erwanm/adapt-vmwe18.

2017

pdf abs
Detection of Verbal Multi-Word Expressions via Conditional Random Fields with Syntactic Dependency Features and Semantic Re-Ranking
Alfredo Maldonado | Lifeng Han | Erwan Moreau | Ashjan Alsulaimani | Koel Dutta Chowdhury | Carl Vogel | Qun Liu
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

A description of a system for identifying Verbal Multi-Word Expressions (VMWEs) in running text is presented. The system mainly exploits universal syntactic dependency features through a Conditional Random Fields (CRF) sequence model. The system competed in the Closed Track at the PARSEME VMWE Shared Task 2017, ranking 2nd place in most languages on full VMWE-based evaluation and 1st in three languages on token-based evaluation. In addition, this paper presents an option to re-rank the 10 best CRF-predicted sequences via semantic vectors, boosting its scores above other systems in the competition. We also show that all systems in the competition would struggle to beat a simple lookup baseline system and argue for a more purpose-specific evaluation scheme.

2016

pdf abs
Open Data Vocabularies for Assigning Usage Rights to Data Resources from Translation Projects
David Lewis | Kaniz Fatema | Alfredo Maldonado | Brian Walshe | Arturo Calvo
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

An assessment of the intellectual property requirements for data used in machine-aided translation is provided based on a recent EC-funded legal review. This is compared against the capabilities offered by current linked open data standards from the W3C for publishing and sharing translation memories from translation projects, and proposals for adequately addressing the intellectual property needs of stakeholders in translation projects using open data vocabularies are suggested.