Chaoran Liu


2026

We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior resources such as NeuBAROCO and JFLD, which emphasize general or belief-aligned logic, BIS Reasoning 1.0 systematically introduces logically valid yet belief-inconsistent syllogisms to expose belief bias—the tendency to accept believable conclusions irrespective of validity. We benchmark a representative suite of cutting-edge models—including OpenAI GPT-5 variants, GPT-4o, Qwen, and prominent Japanese LLMs—under a uniform, zero-shot protocol. Reasoning-centric models achieve near-perfect accuracy on BIS Reasoning 1.0 (e.g., Qwen3-32B ≈99% and GPT-5-mini up to ≈99.7%), while GPT-4o attains around 80%. Earlier Japanese-specialized models underperform, often well below 60%, whereas the latest llm-jp-3.1-13b-instruct4 markedly improves to the mid-80% range. These results indicate that robustness to belief-inconsistent inputs is driven more by explicit reasoning optimization than by language specialization or scale alone. Our analysis further shows that even top-tier systems falter when logical validity conflicts with intuitive or factual beliefs, and that performance is sensitive to prompt design and inference-time reasoning effort. We discuss implications for safety-critical domains—law, healthcare, and scientific literature—where strict logical fidelity must override intuitive belief to ensure reliability.
We investigate whether large language models (LLMs) can improve through recursive training on self-generated text, a topic where prior studies report conflicting outcomes: some find evidence of performance gains (i.e., self-improvement), while others observe performance degradation (i.e., model collapse). To clarify this discrepancy, we use the OLMo-2 models as non-toy LLMs and perform multiple rounds of continual pre-training using self-generated text with different prompting strategies and data filtering. Our experiments show that naive recursive self-training does not improve either perplexity or downstream task performance, regardless of model size. These results suggest that model collapse observed in naive recursive training is inherent to the training procedure itself, while self-improvement likely owes its success not to the model’s autonomous refinement but to human-designed, strategic synthetic pipelines that inject external intelligence.
Large language models (LLMs) improve with more training data, but practical limits on data collection increasingly constrain further scaling. Advances in instruction-following LLMs have enabled controlled, high-quality text generation, making synthetic data a promising remedy. However, its effectiveness for pre-training non-English LLMs remains underexplored. We study this question for Japanese in a fixed token budget setting in which organic Japanese Web text constitutes only a small share, while far more organic English Web text and instruction-following LLMs capable of generating fluent Japanese are available. We compare three strategies to fill the data shortfall: generating synthetic Japanese text, repeating the limited Japanese Web text, and using English Web text. Experiments show that synthetic Japanese corpora outperform both baselines and approach the performance achieved when the entire token budget is filled with additional organic Japanese Web text.