Martin Benjamin


2017

pdf
Towards Producing Human-Validated Translation Resources for the Fula language through WordNet Linking
Khalil Mrini | Martin Benjamin
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology

We propose methods to link automatically parsed linguistic data to the WordNet. We apply these methods on a trilingual dictionary in Fula, English and French. Dictionary entry parsing is used to collect the linguistic data. Then we connect it to the Open Multilingual WordNet (OMW) through two attempts, and use confidence scores to quantify accuracy. We obtained 11,000 entries in parsing and linked about 58% to the OMW on the first attempt, and an additional 14% in the second one. These links are due to be validated by Fula speakers before being added to the Kamusi Project’s database.

2016

pdf
Problems and Procedures to Make Wordnet Data (Retro)Fit for a Multilingual Dictionary
Martin Benjamin
Proceedings of the 8th Global WordNet Conference (GWC)

The data compiled through many Wordnet projects can be a rich source of seed information for a multilingual dictionary. However, the original Princeton WordNet was not intended as a dictionary per se, and spawning other languages from it introduces inherent ambiguity that confounds precise inter-lingual linking. This paper discusses a new presentation of existing Wordnet data that displays joints (distance between predicted links) and substitution (degree of equivalence between confirmed pairs) as a two-tiered horizontal ontology. Improvements to make Wordnet data function as lexicography include term-specific English definitions where the topical synset glosses are inadequate, validation of mappings between each member of an English synset and each member of the synsets from other languages, removal of erroneous translation terms, creation of own-language definitions for the many languages where those are absent, and validation of predicted links between non-English pairs. The paper describes the current state and future directions of a system to crowdsource human review and expansion of Wordnet data, using gamification to build consensus validated, dictionary caliber data for languages now in the Global WordNet as well as new languages that do not have formal Wordnet projects of their own.

2015

pdf
Kamusi pre-D-source-side disambiguation and a sense aligned multilingual lexicon
Martin Benjamin | Amar Mukunda | Jeff Allen
Proceedings of Translating and the Computer 37

2014

pdf
Elephant Beer and Shinto Gates: Managing Similar Concepts in a Multilingual Database
Martin Benjamin
Proceedings of the Seventh Global Wordnet Conference

pdf
Small Languages, Big Data: Multilingual Computational Tools and Techniques for the Lexicography of Endangered Languages
Martin Benjamin | Paula Radetzky
Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf
Collaboration in the Production of a Massively Multilingual Lexicon
Martin Benjamin
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper discusses the multiple approaches to collaboration that the Kamusi Project is employing in the creation of a massively multilingual lexical resource. The project’s data structure enables the inclusion of large amounts of rich data within each sense-specific entry, with transitive concept-based links across languages. Data collection involves mining existing data sets, language experts using an online editing system, crowdsourcing, and games with a purpose. The paper discusses the benefits and drawbacks of each of these elements, and the steps the project is taking to account for those. Special attention is paid to guiding crowd members with targeted questions that produce results in a specific format. Collaboration is seen as an essential method for generating large amounts of linguistic data, as well as for validating the data so it can be considered trustworthy.