Xiao Xiao


2026

Large Language Models (LLMs) excel at mathematical reasoning in English, but their performance in low-resource languages remains underexplored. This gap is particularly critical in the Indonesian context, where equitable access to AI systems depends on robust multilingual reasoning across diverse local languages.We introduce MATH-IDN, a multilingual benchmark for mathematical problem solving in Indonesian, Javanese, Sundanese, and Buginese, with English as a reference, following the MATH dataset. We evaluate multiple open-source LLMs, including math-specialized, Southeast-Asian-adapted, and general-purpose models, under a zero-shot chain-of-thought setting. Results show that MATH-IDN presents a challenging and discriminative benchmark, revealing substantial performance gaps in low-resource languages, particularly Buginese, and highlighting key limitations in current multilingual reasoning capabilities. Our data and code are available at https://github.com/aialt/MATH-IND.
Generating presentation videos from scientific papers is challenging due to the need for long-document discourse planning and cross-lingual grounding. Existing Paper2Video systems are largely monolingual and often rely on single-pass pipelines, which can limit the coherence and informativeness of the resulting presentations.We present mPresenter, a multilingual agentic Paper2Video system that decomposes the task into planning, audience-oriented critique, layout-aware slide generation, and multilingual figure interpretation, enabling iterative refinement at the discourse level. To facilitate reproducible evaluation, we also introduce mPreBench, a multilingual benchmark that evaluates presentation videos via question answering as a proxy for effective information transfer. Experimental results indicate that mPresenter improves question-answering accuracy relative to prior systems, while maintaining affordable cost and latency.

2024

Peut-on enseigner l’intonation française en classe avec une synthèse vocale contrôlée gestuellement sur une tablette ? La fréquence fondamentale et la durée de quatre phrases déclaratives, quatre questions polaires, quatre énoncés exprimant l’incrédulité (1 à 4 syllabes) de deux apprenantes ukrainiennes débutantes en français ont été comparées avant et après quatre entraînements hebdomadaires. Les apprenantes devaient écouter un enregistrement de référence, puis visualiser le modèle sur la tablette, tracer l’intonation manuellement, écouter le résultat synthétisé, et tracer et écouter leur tracé sans guide. Elles produisaient initialement des phrases déclaratives avec une intonation ascendante, et ont différencié les déclarations et les questions polaires après l’entraînement. L’expression de l’incrédulité s’est améliorée pour l’une. L’autre a montré quelques difficultés à maîtriser cette technologie. Cette première étude de cas utilisant la synthèse vocale contrôlée gestuellement est une approche prometteuse permettant plus de pratique de l’intonation en classe.
Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action-thought annotations. Experiments show that fine-tuning a 1B model (i.e. AUTO-UI-base) on our AitZ dataset achieves on-par performance with CogAgent-Chat-18B.