Miao He


2026

The MultiplEYE Text Corpus: Towards a Diverse and Ever-Expanding Multilingual Text Corpus
Ramunė Kasperė | Anna Bondar | Sergiu Nisioi | Maja Stegenwallner-Schütz | Hanne B. Søndergaard Knudsen | Ana Matić | Eva Pavlinušić Vilus | Dorota Klimek-Jankowska | Chiara Tschirner | Not Battesta Soliva | Deborah N. Jakobi | Cui Ding | Dima Abu Romi | Cengiz Acarturk | Matilda Agdler | Anton Marius Alexandru | Mohd Faizan Ansari | Annalisa Arcidiacono | Elizabete Ausma Velta Barisa | Ana Bautista | Lisa Beinborn | Yevgeni Berzak | Nedeljka Bjelanović | Anna Isabelle Bothmann | Jan Brasser | Caterina Cacioli | Anila Çepani | Ilze Ceple | Adelina Cerpja | Dalí Chirino | Jan Chromý | Alessandro Corona Mendozza | Iria de-Dios-Flores | Nazik Dinçtopal Deniz | Ana Došen | Kristian Elersič | Inmaculada Fajardo | Zigmunds Freibergs | Angelina Ganebnaya | Shan Gao | Jéssica Gomes | Annjo Klungervik Greenall | Alba Haveriku | Miao He | Anamaria Hodivoianu | Yu-Yin Hsu | Amanda Isaksen | Andreia Janeiro | Kristine Jensen de López | Aleksandar Jevremovic | Vojislav Jovanovic | Hanna Kędzierska | Nik Kharlamov | Sara Kosutar | Nelda Kote | Vanja Kovic | Izabela Krejtz | Thyra Krosness | Oleksandra Kuvshynova | Eilam Lavy | Ella Lion | Marta Łockiewicz | Kaidi Lõo | Paula Luegi | Mircea Mihai Marin | Clara Martin | Svitlana Matvieieva | Diane C. Mézière | Xavier Mínguez-López | Valeriia Modina | Jurgita Motiejūnienė | Marie-Luise Müller | Tolgonai Nasipbek kyzy | Jamal Abdul Nasir | Johanne S. K. Nedergård | Ayşegül Özkan | Patrizia Paggio | Marijan Palmović | Maria Christina Panagiotopoulou | Alberto Parola | Helena Pérez | Klaudia Petersen | Anja Podlesek | Eva Pospíšilová | Marta Praulina | Mikuláš Preininger | Loredana Pungă | Diego Rossini | Špela Rot | Habib Sani Yahaya | Irina A. Sekerina | Anne Gabija Skadina | Jordi Solé-Casals | Lonneke van der Plas | Saara M. Varjopuro | Spyridoula Varlokosta | João Veríssimo | Oskari Juhapekka Virtanen | Nemanja Vračar | Mila Vulchanova | Ahmad Mustapha Wali | Peizheng Wu | Nilgün Yücel | Stefan Frank | Nora Hollenstein | Lena Jäger
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present the MultiplEYE Text Corpus, a large-scale, document-level, multi-parallel resource designed to advance cross-linguistic research on reading and language processing. The corpus provides paragraph-level alignment for texts in 39 languages spanning seven language families and seven scripts. Unlike many existing multilingual corpora, a substantial number of documents were originally written in languages other than English, reducing English-centric bias and supporting more typologically diverse investigations. The texts are carefully selected to balance linguistic richness with experimental feasibility, particularly for eye-tracking-while-reading studies. Developed within a multi-lab initiative, the MultiplEYE Text Corpus follows unified translation, alignment, and experimental design guidelines to ensure cross-linguistic comparability. Its inclusion of texts varying in type and difficulty enables research on discourselevel processing, genre effects, and individual differences across a wide range of languages. The text corpus and accompanying metadata provide a robust foundation for multilingual psycholinguistic and computational modeling research. Data and materials are publicly available at https://doi.org/10.23668/psycharchives.21750.

2025

This paper addresses the important yet underexplored task of **multi-class sentiment analysis (MCSA)**, which remains challenging due to the subtle semantic differences between adjacent sentiment categories and the scarcity of high-quality annotated data. To tackle these challenges, we propose **RD-MCSA** (**R**ationales and **D**emonstrations-based **M**ulti-**C**lass **S**entiment **A**nalysis), an In-Context Learning (ICL) framework designed to enhance MCSA performance under limited supervision by integrating classification rationales with adaptively selected demonstrations. First, semantically grounded classification rationales are generated from a representative, class-balanced subset of annotated samples selected using a tailored balanced coreset algorithm. These rationales are then paired with demonstrations chosen through a similarity-based mechanism powered by a **multi-kernel Gaussian process (MK-GP)**, enabling large language models (LLMs) to more effectively capture fine-grained sentiment distinctions. Experiments on five benchmark datasets demonstrate that RD-MCSA consistently outperforms both supervised baselines and standard ICL methods across various evaluation metrics.
Search
Co-authors
Venues
Fix author