Tomoki Tsujimura


2026

To reliably interpret the evolving context of an LLM as a reasoning trace, the underlying belief of the LLM needs to transition consistently with the progression of the context.We focus on evaluating whether the beliefs held by a model remain consistent before and after the extension of the context.Previous research on consistency evaluation typically uses datasets with ground-truth answers, which is problematic because task-solving ability acts as a confounding factor, obscuring the direct evaluation of consistency.Furthermore, evaluating cases where inconsistency stems from multiple errors poses difficulties.We propose a new evaluation method to assess the consistency of LLMs in a multiple-choice question answering format, designed so that any option chosen is correct, allowing for the evaluation of the proposed belief consistency.It also supports isolation of errors such as reasoning failures and biases.We reveal that the belief consistency does not improve solely with model size scaling,whereas continual pre-training on code and mathematics text improves it.Furthermore, models trained on code and mathematics text show a seemingly contradictory result of increased logical failures, indicating that belief consistency and superficial consistency are not necessarily directly linked.

2017

This paper describes our TTI-COIN system that participated in SemEval-2017 Task 10. We investigated appropriate embeddings to adapt a neural end-to-end entity and relation extraction system LSTM-ER to this task. We participated in the full task setting of the entity segmentation, entity classification and relation classification (scenario 1) and the setting of relation classification only (scenario 3). The system was directly applied to the scenario 1 without modifying the codes thanks to its generality and flexibility. Our evaluation results show that the choice of appropriate pre-trained embeddings affected the performance significantly. With the best embeddings, our system was ranked third in the scenario 1 with the micro F1 score of 0.38. We also confirm that our system can produce the micro F1 score of 0.48 for the scenario 3 on the test data, and this score is close to the score of the 3rd ranked system in the task.