Chester Palen-Michel


2022

pdf
Multilingual Open Text Release 1: Public Domain News in 44 Languages
Chester Palen-Michel | June Kim | Constantine Lignos
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a Multilingual Open Text (MOT), a new multilingual corpus containing text in 44 languages, many of which have limited existing text resources for natural language processing. The first release of the corpus contains over 2.8 million news articles and an additional 1 million short snippets (photo captions, video descriptions, etc.) published between 2001–2022 and collected from Voice of America’s news websites. We describe our process for collecting, filtering, and processing the data. The source material is in the public domain, our collection is licensed using a creative commons license (CC BY 4.0), and all software used to create the corpus is released under the MIT License. The corpus will be regularly updated as additional documents are published.

pdf
Toward More Meaningful Resources for Lower-resourced Languages
Constantine Lignos | Nolan Holley | Chester Palen-Michel | Jonne Sälevä
Findings of the Association for Computational Linguistics: ACL 2022

In this position paper, we describe our perspective on how meaningful resources for lower-resourced languages should be developed in connection with the speakers of those languages. Before advancing that position, we first examine two massively multilingual resources used in language technology development, identifying shortcomings that limit their usefulness. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be, requiring non-trivial effort to correct. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand-annotated data. We then discuss the importance of creating annotations for lower-resourced languages in a thoughtful and ethical way that includes the language speakers as part of the development process. We conclude with recommended guidelines for resource development.

2021

pdf
SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation
Chester Palen-Michel | Nolan Holley | Constantine Lignos
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

To address a looming crisis of unreproducible evaluation for named entity recognition, we propose guidelines and introduce SeqScore, a software package to improve reproducibility. The guidelines we propose are extremely simple and center around transparency regarding how chunks are encoded and scored. We demonstrate that despite the apparent simplicity of NER evaluation, unreported differences in the scoring procedure can result in changes to scores that are both of noticeable magnitude and statistically significant. We describe SeqScore, which addresses many of the issues that cause replication failures.

pdf
MasakhaNER: Named Entity Recognition for African Languages
David Ifeoluwa Adelani | Jade Abbott | Graham Neubig | Daniel D’souza | Julia Kreutzer | Constantine Lignos | Chester Palen-Michel | Happy Buzaaba | Shruti Rijhwani | Sebastian Ruder | Stephen Mayhew | Israel Abebe Azime | Shamsuddeen H. Muhammad | Chris Chinenye Emezue | Joyce Nakatumba-Nabende | Perez Ogayo | Aremu Anuoluwapo | Catherine Gitau | Derguene Mbaye | Jesujoba Alabi | Seid Muhie Yimam | Tajuddeen Rabiu Gwadabe | Ignatius Ezeani | Rubungo Andre Niyongabo | Jonathan Mukiibi | Verrah Otiende | Iroro Orife | Davis David | Samba Ngom | Tosin Adewumi | Paul Rayson | Mofetoluwa Adeyemi | Gerald Muriuki | Emmanuel Anebi | Chiamaka Chukwuneke | Nkiruka Odu | Eric Peter Wairagala | Samuel Oyerinde | Clemencia Siro | Tobius Saul Bateesa | Temilola Oloyede | Yvonne Wambui | Victor Akinode | Deborah Nabagereka | Maurice Katusiime | Ayodele Awokoya | Mouhamadane MBOUP | Dibora Gebreyohannes | Henok Tilaye | Kelechi Nwaike | Degaga Wolde | Abdoulaye Faye | Blessing Sibanda | Orevaoghene Ahia | Bonaventure F. P. Dossou | Kelechi Ogueji | Thierno Ibrahima DIOP | Abdoulaye Diallo | Adewale Akinfaderin | Tendai Marengereke | Salomey Osei
Transactions of the Association for Computational Linguistics, Volume 9

Abstract We take a step towards addressing the under- representation of the African continent in NLP research by bringing together different stakeholders to create the first large, publicly available, high-quality dataset for named entity recognition (NER) in ten African languages. We detail the characteristics of these languages to help researchers and practitioners better understand the challenges they pose for NER tasks. We analyze our datasets and conduct an extensive empirical evaluation of state- of-the-art methods across both supervised and transfer learning settings. Finally, we release the data, code, and models to inspire future research on African NLP.1

2019

pdf
SARAL: A Low-Resource Cross-Lingual Domain-Focused Information Retrieval System for Effective Rapid Document Triage
Elizabeth Boschee | Joel Barry | Jayadev Billa | Marjorie Freedman | Thamme Gowda | Constantine Lignos | Chester Palen-Michel | Michael Pust | Banriskhem Kayang Khonglah | Srikanth Madikeri | Jonathan May | Scott Miller
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

With the increasing democratization of electronic media, vast information resources are available in less-frequently-taught languages such as Swahili or Somali. That information, which may be crucially important and not available elsewhere, can be difficult for monolingual English speakers to effectively access. In this paper we present an end-to-end cross-lingual information retrieval (CLIR) and summarization system for low-resource languages that 1) enables English speakers to search foreign language repositories of text and audio using English queries, 2) summarizes the retrieved documents in English with respect to a particular information need, and 3) provides complete transcriptions and translations as needed. The SARAL system achieved the top end-to-end performance in the most recent IARPA MATERIAL CLIR+summarization evaluations. Our demonstration system provides end-to-end open query retrieval and summarization capability, and presents the original source text or audio, speech transcription, and machine translation, for two low resource languages.