Toru Urakawa
2025
Unsupervised Sentence Readability Estimation Based on Parallel Corpora for Text Simplification
Rina Miyata
|
Toru Urakawa
|
Hideaki Tamori
|
Tomoyuki Kajiwara
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)
We train a relative sentence readability estimator from a corpus without absolute sentence readability.Since sentence readability depends on the reader’s knowledge, objective and absolute readability assessments require costly annotation by experts.Therefore, few corpora have absolute sentence readability, while parallel corpora for text simplification with relative sentence readability between two sentences are available for many languages.With multilingual applications in mind, we propose a method to estimate relative sentence readability based on parallel corpora for text simplification.Experimental results on ranking a set of English sentences by readability show that our method outperforms existing unsupervised methods and is comparable to supervised methods based on absolute sentence readability.
2024
A Japanese News Simplification Corpus with Faithfulness
Toru Urakawa
|
Yuya Taguchi
|
Takuro Niitsuma
|
Hideaki Tamori
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Text Simplification enhances the readability of texts for specific audiences. However, automated models may introduce unwanted content or omit essential details, necessitating a focus on maintaining faithfulness to the original input. Furthermore, existing simplified corpora contain instances of low faithfulness. Motivated by this issue, we present a new Japanese simplification corpus designed to prioritize faithfulness. Our collection comprises 7,075 paired sentences simplified from newspaper articles. This process involved collaboration with language education experts who followed guidelines balancing readability and faithfulness. Through corpus analysis, we confirmed that our dataset preserves the content of the original text, including personal names, dates, and city names. Manual evaluation showed that our corpus robustly maintains faithfulness to the original text, surpassing other existing corpora. Furthermore, evaluation by non-native readers confirmed its readability to the target audience. Through the experiment of fine-tuning and in-context learning, we demonstrated that our corpus enhances faithful sentence simplification.