Kosuke Sato


2025

pdf bib
Can Language Models Handle a Non-Gregorian Calendar? The Case of the Japanese wareki
Mutsumi Sasaki | Go Kamoda | Ryosuke Takahashi | Kosuke Sato | Kentaro Inui | Keisuke Sakaguchi | Benjamin Heinzerling
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Temporal reasoning and knowledge are essential capabilities for language models (LMs).While much prior work has analyzed and improved temporal reasoning in LMs, most studies have focused solely on the Gregorian calendar.However, many non-Gregorian systems, such as the Japanese, Hijri, and Hebrew calendars, are in active use and reflect culturally grounded conceptions of time.If and how well current LMs can accurately handle such non-Gregorian calendars has not been evaluated so far.Here, we present a systematic evaluation of how well language models handle one such non-Gregorian system: the Japanese *wareki*.We create datasets that require temporal knowledge and reasoning in using *wareki* dates. Evaluating open and closed LMs, we find that some models can perform calendar conversions, but GPT-4o, Deepseek V3, and even Japanese-centric models struggle with Japanese calendar arithmetic and knowledge involving *wareki* dates.Error analysis suggests corpus frequency of Japanese calendar expressions and a Gregorian bias in the model’s knowledge as possible explanations.Our results show the importance of developing LMs that are better equipped for culture-specific tasks such as calendar understanding.

2024

pdf bib
Multi-Criteria Evaluation Framework of Selecting Response-worthy Chats in Live Streaming
Zhantao Lai | Kosuke Sato
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Live streaming, a dynamic medium that merges real-time audiovisual content with interactive text-based chat, presents unique challenges for maintaining viewer engagement and ensuring streamers’ well-being. This study introduces a multi-criteria evaluation framework designed to identify response-worthy chats during live streaming. We proposed a system that evaluates chats based on sentiment polarity and intensity, contextual relevance, and topic uniqueness. We also constructed a dataset annotated by human reviewers who validates the framework, demonstrating a closer alignment with human preferences compared to single-criterion baselines. This framework not only supports the development of more responsive and engaging live streaming environments but also contributes to the broader field of dialog systems by highlighting the distinct needs of real-time, large-scale conversational contexts.