Daniel Van Strien
Also published as: Daniel van Strien
2024
Documenting Geographically and Contextually Diverse Language Data Sources
Angelina McMillan-Major | Francesco De Toni | Zaid Alyafeai | Stella Biderman | Kimbo Chen | Gérard Dupont | Hady Elsahar | Chris Emezue | Alham Fikri Aji | Suzana Ilić | Nurulaqilla Khamis | Colin Leong | Maraim Masoud | Aitor Soroa | Pedro Ortiz Suarez | Daniel van Strien | Zeerak Talat | Yacine Jernite
Northern European Journal of Language Technology, Volume 10
Angelina McMillan-Major | Francesco De Toni | Zaid Alyafeai | Stella Biderman | Kimbo Chen | Gérard Dupont | Hady Elsahar | Chris Emezue | Alham Fikri Aji | Suzana Ilić | Nurulaqilla Khamis | Colin Leong | Maraim Masoud | Aitor Soroa | Pedro Ortiz Suarez | Daniel van Strien | Zeerak Talat | Yacine Jernite
Northern European Journal of Language Technology, Volume 10
Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Lucie Lucie-Aimée | Angela Fan | Tajuddeen Gwadabe | Isaac Johnson | Fabio Petroni | Daniel van Strien
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
Lucie Lucie-Aimée | Angela Fan | Tajuddeen Gwadabe | Isaac Johnson | Fabio Petroni | Daniel van Strien
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia
2022
Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0
Francesco De Toni | Christopher Akiki | Javier De La Rosa | Clémentine Fourrier | Enrique Manjavacas | Stefan Schweter | Daniel Van Strien
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
Francesco De Toni | Christopher Akiki | Javier De La Rosa | Clémentine Fourrier | Enrique Manjavacas | Stefan Schweter | Daniel Van Strien
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts.
Search
Fix author
Co-authors
- Francesco De Toni 2
- Alham Fikri Aji 1
- Christopher Akiki 1
- Zaid Alyafeai 1
- Stella Biderman 1
- Kimbo Chen 1
- Javier De La Rosa 1
- Gérard Dupont 1
- Hady Elsahar 1
- Chris Chinenye Emezue 1
- Angela Fan 1
- Clémentine Fourrier 1
- Tajuddeen Gwadabe 1
- Suzana Ilic 1
- Yacine Jernite 1
- Isaac Johnson 1
- Nurulaqilla Khamis 1
- Colin Leong 1
- Lucie Lucie-Aimée 1
- Enrique Manjavacas 1
- Maraim Masoud 1
- Angelina McMillan-Major 1
- Pedro Ortiz Suarez 1
- Fabio Petroni 1
- Stefan Schweter 1
- Aitor Soroa 1
- Zeerak Talat 1