Yuuki Tachioka
2026
d-itlab at SemEval-2026 Task 12: Per-Option Surprisal and Multi-Stage Gating for Precision-Oriented Causal Reasoning
Yasunori Terao | Yuuki Tachioka
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Yasunori Terao | Yuuki Tachioka
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
We describe the system submitted by d-itlab to SemEval-2026 Task~12 (Abductive Event Reasoning), which requires selecting the most plausible direct cause(s) of an observed event from candidate options grounded in reference documents. Our approach combines (i) per-option multi-stage LLM inference that evaluates each option independently with progressively stricter verification, (ii) surprisal-based features obtained by teacher-forcing candidate sentences and measuring token-level negative log-likelihood, and (iii) an XGBoost ensemble trained on these heterogeneous features to produce a precision-oriented final prediction. In the official test set, our system scored 0.91, ranking third among 116 participating teams.
Diagnosing LLMs via Information Spectrum Analysis: Tail Behavior and the Effects of Side Information
Yuuki Tachioka
Findings of the Association for Computational Linguistics: ACL 2026
Yuuki Tachioka
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) exhibit non-stationary generation: their output distributions shift with prompts, retrieved documents, and decoding conditions. Under such variability, average likelihood metrics can obscure heterogeneous behaviors across samples, especially in high-surprisal tails where failures often occur. We propose an information-spectrum-based diagnostic framework that treats LLMs as general sources without assuming stationarity, ergodicity, or the asymptotic equipartition property. We define sequence-level self-information density (coding rate; mean surprisal) and construct an empirical information spectrum from finite samples, enabling operational estimates of spectrum quantiles and width. We further introduce an information gain spectrum, a teacher-forced likelihood-based measure that evaluates the same generated sequence with and without side information. Across multiple Japanese LLMs and QA settings, we observe that correctness differences are often more visible in the high-surprisal tail than in the mean coding rate, and that side information can reshape tail behavior in heterogeneous ways across sequences. We also observe that instruction tuning changes the spectrum structure, making tail statistics and spectrum width more predictive of correctness than the mean coding rate. Overall, our analysis illustrates how spectrum-based diagnostics complement average-based metrics for understanding conditional generation.