Cui Ding
2026
The MultiplEYE Text Corpus: Towards a Diverse and Ever-Expanding Multilingual Text Corpus
Ramunė Kasperė | Anna Bondar | Sergiu Nisioi | Maja Stegenwallner-Schütz | Hanne B. Søndergaard Knudsen | Ana Matić | Eva Pavlinušić Vilus | Dorota Klimek-Jankowska | Chiara Tschirner | Not Battesta Soliva | Deborah N. Jakobi | Cui Ding | Dima Abu Romi | Cengiz Acarturk | Matilda Agdler | Anton Marius Alexandru | Mohd Faizan Ansari | Annalisa Arcidiacono | Elizabete Ausma Velta Barisa | Ana Bautista | Lisa Beinborn | Yevgeni Berzak | Nedeljka Bjelanović | Anna Isabelle Bothmann | Jan Brasser | Caterina Cacioli | Anila Çepani | Ilze Ceple | Adelina Cerpja | Dalí Chirino | Jan Chromý | Alessandro Corona Mendozza | Iria de-Dios-Flores | Nazik Dinçtopal Deniz | Ana Došen | Kristian Elersič | Inmaculada Fajardo | Zigmunds Freibergs | Angelina Ganebnaya | Shan Gao | Jéssica Gomes | Annjo Klungervik Greenall | Alba Haveriku | Miao He | Anamaria Hodivoianu | Yu-Yin Hsu | Amanda Isaksen | Andreia Janeiro | Kristine Jensen de López | Aleksandar Jevremovic | Vojislav Jovanovic | Hanna Kędzierska | Nik Kharlamov | Sara Kosutar | Nelda Kote | Vanja Kovic | Izabela Krejtz | Thyra Krosness | Oleksandra Kuvshynova | Eilam Lavy | Ella Lion | Marta Łockiewicz | Kaidi Lõo | Paula Luegi | Mircea Mihai Marin | Clara Martin | Svitlana Matvieieva | Diane C. Mézière | Xavier Mínguez-López | Valeriia Modina | Jurgita Motiejūnienė | Marie-Luise Müller | Tolgonai Nasipbek kyzy | Jamal Abdul Nasir | Johanne S. K. Nedergård | Ayşegül Özkan | Patrizia Paggio | Marijan Palmović | Maria Christina Panagiotopoulou | Alberto Parola | Helena Pérez | Klaudia Petersen | Anja Podlesek | Eva Pospíšilová | Marta Praulina | Mikuláš Preininger | Loredana Pungă | Diego Rossini | Špela Rot | Habib Sani Yahaya | Irina A. Sekerina | Anne Gabija Skadina | Jordi Solé-Casals | Lonneke van der Plas | Saara M. Varjopuro | Spyridoula Varlokosta | João Veríssimo | Oskari Juhapekka Virtanen | Nemanja Vračar | Mila Vulchanova | Ahmad Mustapha Wali | Peizheng Wu | Nilgün Yücel | Stefan Frank | Nora Hollenstein | Lena Jäger
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Ramunė Kasperė | Anna Bondar | Sergiu Nisioi | Maja Stegenwallner-Schütz | Hanne B. Søndergaard Knudsen | Ana Matić | Eva Pavlinušić Vilus | Dorota Klimek-Jankowska | Chiara Tschirner | Not Battesta Soliva | Deborah N. Jakobi | Cui Ding | Dima Abu Romi | Cengiz Acarturk | Matilda Agdler | Anton Marius Alexandru | Mohd Faizan Ansari | Annalisa Arcidiacono | Elizabete Ausma Velta Barisa | Ana Bautista | Lisa Beinborn | Yevgeni Berzak | Nedeljka Bjelanović | Anna Isabelle Bothmann | Jan Brasser | Caterina Cacioli | Anila Çepani | Ilze Ceple | Adelina Cerpja | Dalí Chirino | Jan Chromý | Alessandro Corona Mendozza | Iria de-Dios-Flores | Nazik Dinçtopal Deniz | Ana Došen | Kristian Elersič | Inmaculada Fajardo | Zigmunds Freibergs | Angelina Ganebnaya | Shan Gao | Jéssica Gomes | Annjo Klungervik Greenall | Alba Haveriku | Miao He | Anamaria Hodivoianu | Yu-Yin Hsu | Amanda Isaksen | Andreia Janeiro | Kristine Jensen de López | Aleksandar Jevremovic | Vojislav Jovanovic | Hanna Kędzierska | Nik Kharlamov | Sara Kosutar | Nelda Kote | Vanja Kovic | Izabela Krejtz | Thyra Krosness | Oleksandra Kuvshynova | Eilam Lavy | Ella Lion | Marta Łockiewicz | Kaidi Lõo | Paula Luegi | Mircea Mihai Marin | Clara Martin | Svitlana Matvieieva | Diane C. Mézière | Xavier Mínguez-López | Valeriia Modina | Jurgita Motiejūnienė | Marie-Luise Müller | Tolgonai Nasipbek kyzy | Jamal Abdul Nasir | Johanne S. K. Nedergård | Ayşegül Özkan | Patrizia Paggio | Marijan Palmović | Maria Christina Panagiotopoulou | Alberto Parola | Helena Pérez | Klaudia Petersen | Anja Podlesek | Eva Pospíšilová | Marta Praulina | Mikuláš Preininger | Loredana Pungă | Diego Rossini | Špela Rot | Habib Sani Yahaya | Irina A. Sekerina | Anne Gabija Skadina | Jordi Solé-Casals | Lonneke van der Plas | Saara M. Varjopuro | Spyridoula Varlokosta | João Veríssimo | Oskari Juhapekka Virtanen | Nemanja Vračar | Mila Vulchanova | Ahmad Mustapha Wali | Peizheng Wu | Nilgün Yücel | Stefan Frank | Nora Hollenstein | Lena Jäger
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present the MultiplEYE Text Corpus, a large-scale, document-level, multi-parallel resource designed to advance cross-linguistic research on reading and language processing. The corpus provides paragraph-level alignment for texts in 39 languages spanning seven language families and seven scripts. Unlike many existing multilingual corpora, a substantial number of documents were originally written in languages other than English, reducing English-centric bias and supporting more typologically diverse investigations. The texts are carefully selected to balance linguistic richness with experimental feasibility, particularly for eye-tracking-while-reading studies. Developed within a multi-lab initiative, the MultiplEYE Text Corpus follows unified translation, alignment, and experimental design guidelines to ensure cross-linguistic comparability. Its inclusion of texts varying in type and difficulty enables research on discourselevel processing, genre effects, and individual differences across a wide range of languages. The text corpus and accompanying metadata provide a robust foundation for multilingual psycholinguistic and computational modeling research. Data and materials are publicly available at https://doi.org/10.23668/psycharchives.21750.
2025
Modeling Bottom-up Information Quality during Language Processing
Cui Ding | Yanning Yin | Lena Ann Jäger | Ethan Wilcox
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Cui Ding | Yanning Yin | Lena Ann Jäger | Ethan Wilcox
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Contemporary theories model language processing as integrating both top-down expectations and bottom-up inputs. One major prediction of such models is that the quality of the bottom-up inputs modulates ease of processing—noisy inputs should lead to difficult and effortful comprehension. We test this prediction in the domain of reading. First, we propose an information-theoretic operationalization for the “quality” of bottom-up information as the mutual information (MI) between visual information and word identity. We formalize this prediction in a mathematical model of reading as a Bayesian update. Second, we test our operationalization by comparing participants’ reading times in conditions where words’ information quality has been reduced, either by occluding their top or bottom half, with full words. We collect data in English and Chinese. We then use multimodal language models to estimate the mutual information between visual inputs and words. We use these data to estimate the specific effect of reduced information quality on reading times. Finally, we compare how information is distributed across visual forms. In English and Chinese, the upper half contains more information about word identity than the lower half. However, the asymmetry is more pronounced in English, a pattern which is reflected in the reading times.
ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords
Sina Ahmadi | Micha David Hess | Elena Álvarez-Mellado | Alessia Battisti | Cui Ding | Anne Göhring | Yingqiang Gao | Zifan Jiang | Andrianos Michail | Peshmerge Morad | Joel Niklaus | Maria Christina Panagiotopoulou | Stefano Perrella | Juri Opitz | Anastassia Shaitarova | Rico Sennrich
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sina Ahmadi | Micha David Hess | Elena Álvarez-Mellado | Alessia Battisti | Cui Ding | Anne Göhring | Yingqiang Gao | Zifan Jiang | Andrianos Michail | Peshmerge Morad | Joel Niklaus | Maria Christina Panagiotopoulou | Stefano Perrella | Juri Opitz | Anastassia Shaitarova | Rico Sennrich
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lexical borrowing, the adoption of words from one language into another, is a ubiquitous linguistic phenomenon influenced by geopolitical, societal, and technological factors. This paper introduces ConLoan–a novel contrastive dataset comprising sentences with and without loanwords across 10 languages. Through systematic evaluation using this dataset, we investigate how state-of-the-art machine translation and language models process loanwords compared to their native alternatives. Our experiments reveal that these systems show systematic preferences for loanwords over native terms and exhibit varying performance across languages. These findings provide valuable insights for developing more linguistically robust NLP systems.
Using Information Theory to Characterize Prosodic Typology: The Case of Tone, Pitch-Accent and Stress-Accent
Ethan Wilcox | Cui Ding | Giovanni Acampa | Tiago Pimentel | Alex Warstadt | Tamar I Regev
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ethan Wilcox | Cui Ding | Giovanni Acampa | Tiago Pimentel | Alex Warstadt | Tamar I Regev
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper argues that the relationship between lexical identity and prosody—one well-studied parameter of linguistic variation—can be characterized using information theory. We predict that languages that use prosody to make lexical distinctions should exhibit a higher mutual information between word identity and prosody, compared to languages that don’t. We test this hypothesis in the domain of pitch, which is used to make lexical distinctions in tonal languages, like Cantonese. We use a dataset of speakers reading sentences aloud in ten languages across five language families to estimate the mutual information between the text and their pitch curves. We find that, across languages, pitch curves display similar amounts of entropy. However, these curves are easier to predict given their associated text in the tonal languages, compared to pitch- and stress-accent languages, and thus the mutual information is higher in these languages, supporting our hypothesis. Our results support perspectives that view linguistic typology as gradient, rather than categorical.
Search
Fix author
Co-authors
- Lena Ann Jäger 2
- Maria Christina Panagiotopoulou 2
- Ethan Wilcox 2
- Jamal Abdul Nasir 1
- Dima Abu Romi 1
- Giovanni Acampa 1
- Cengiz Acarturk 1
- Matilda Agdler 1
- Sina Ahmadi 1
- Anton Marius Alexandru 1
- Mohd Faizan Ansari 1
- Annalisa Arcidiacono 1
- Hanne B. Søndergaard Knudsen 1
- Elizabete Ausma Velta Barisa 1
- Not Battesta Soliva 1
- Alessia Battisti 1
- Ana Bautista 1
- Lisa Beinborn 1
- Yevgeni Berzak 1
- Nedeljka Bjelanović 1
- Anna Bondar 1
- Anna Isabelle Bothmann 1
- Jan Brasser 1
- Caterina Cacioli 1
- Ilze Ceple 1
- Adelina Cerpja 1
- Dalí Chirino 1
- Jan Chromý 1
- Alessandro Corona Mendozza 1
- Nazik Dinctopal Deniz 1
- Ana Došen 1
- Kristian Elersič 1
- Inmaculada Fajardo 1
- Stefan L. Frank 1
- Zigmunds Freibergs 1
- Angelina Ganebnaya 1
- Yingqiang Gao 1
- Shan Gao 1
- Jéssica Gomes 1
- Annjo Klungervik Greenall 1
- Anne Göhring 1
- Alba Haveriku 1
- Miao He 1
- Micha David Hess 1
- Anamaria Hodivoianu 1
- Nora Hollenstein 1
- Yu-Yin Hsu 1
- Amanda Isaksen 1
- Deborah N. Jakobi 1
- Andreia Janeiro 1
- Kristine Jensen de López 1
- Aleksandar Jevremovic 1
- Zifan Jiang 1
- Vojislav Jovanovic 1
- Ramunė Kasperė 1
- Nik Kharlamov 1
- Dorota Klimek-Jankowska 1
- Nelda Kote 1
- Vanja Kovic 1
- Sara Košutar 1
- Izabela Krejtz 1
- Thyra Krosness 1
- Oleksandra Kuvshynova 1
- Hanna Kędzierska 1
- Eilam Lavy 1
- Ella Lion 1
- Paula Luegi 1
- Kaidi Lõo 1
- Mircea Mihai Marin 1
- Clara Martin 1
- Ana Matić 1
- Svitlana Matvieieva 1
- Andrianos Michail 1
- Valeriia Modina 1
- Peshmerge Morad 1
- Jurgita Motiejūnienė 1
- Diane C. Mézière 1
- Xavier Mínguez-López 1
- Marie-Luise Müller 1
- Tolgonai Nasipbek kyzy 1
- Johanne S. K. Nedergård 1
- Joel Niklaus 1
- Sergiu Nisioi 1
- Juri Opitz 1
- Patrizia Paggio 1
- Marijan Palmović 1
- Alberto Parola 1
- Eva Pavlinušić Vilus 1
- Stefano Perrella 1
- Klaudia Petersen 1
- Tiago Pimentel 1
- Anja Podlesek 1
- Eva Pospíšilová 1
- Marta Praulina 1
- Mikuláš Preininger 1
- Loredana Pungă 1
- Helena Pérez 1
- Tamar I Regev 1
- Diego Rossini 1
- Špela Rot 1
- Habib Sani Yahaya 1
- Irina A. Sekerina 1
- Rico Sennrich 1
- Anastassia Shaitarova 1
- Anne Gabija Skadina 1
- Jordi Solé-Casals 1
- Maja Stegenwallner-Schütz 1
- Chiara Tschirner 1
- Saara M. Varjopuro 1
- Spyridoula Varlokosta 1
- João Veríssimo 1
- Oskari Juhapekka Virtanen 1
- Nemanja Vračar 1
- Mila Vulchanova 1
- Ahmad Mustapha Wali 1
- Alex Warstadt 1
- Peizheng Wu 1
- Yanning Yin 1
- Nilgün Yücel 1
- Iria de-Dios-Flores 1
- Lonneke van der Plas 1
- Elena Álvarez-Mellado 1
- Anila Çepani 1
- Ayşegül Özkan 1
- Marta Łockiewicz 1