Joshua Ong Jun Leang
2026
PiCSAR: Probabilistic Confidence Selection and Ranking for Reasoning Chains
Joshua Ong Jun Leang | Zheng Zhao | Aryo Pradipta Gema | Sohee Yang | Wai-Chung Kwan | Xuanli He | Wenda Li | Pasquale Minervini | Eleonora Giunchiglia | Shay B Cohen
Findings of the Association for Computational Linguistics: ACL 2026
Joshua Ong Jun Leang | Zheng Zhao | Aryo Pradipta Gema | Sohee Yang | Wai-Chung Kwan | Xuanli He | Wenda Li | Pasquale Minervini | Eleonora Giunchiglia | Shay B Cohen
Findings of the Association for Computational Linguistics: ACL 2026
Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection and Ranking for Reasoning Chains (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. This method utilises both the scores of the reasoning path (*reasoning confidence*) and the final answer (*answer confidence*). PiCSAR achieves substantial gains across several benchmarks (+11.7 on AIME2024, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 20 out of 25 comparisons. Our analysis reveals that correct reasoning chains exhibit higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
2025
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning
Joshua Ong Jun Leang | Aryo Pradipta Gema | Shay B. Cohen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Joshua Ong Jun Leang | Aryo Pradipta Gema | Shay B. Cohen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present **Chain of Mathematically Annotated Thought (CoMAT)**, which enhances reasoning through two stages: *Symbolic Conversion* (converting natural language queries into symbolic form) and *Reasoning Execution* (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks.
Are We Done with MMLU?
Aryo Pradipta Gema | Joshua Ong Jun Leang | Giwon Hong | Alessio Devoto | Alberto Carlo Maria Mancino | Rohit Saxena | Xuanli He | Yu Zhao | Xiaotang Du | Mohammad Reza Ghasemi Madani | Claire Barale | Robert McHardy | Joshua Harris | Jean Kaddour | Emile Van Krieken | Pasquale Minervini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Aryo Pradipta Gema | Joshua Ong Jun Leang | Giwon Hong | Alessio Devoto | Alberto Carlo Maria Mancino | Rohit Saxena | Xuanli He | Yu Zhao | Xiaotang Du | Mohammad Reza Ghasemi Madani | Claire Barale | Robert McHardy | Joshua Harris | Jean Kaddour | Emile Van Krieken | Pasquale Minervini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error annotation protocol. Then, we create MMLU-Redux, which is a subset of 5,700 manually re-annotated questions across all 57 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU’s error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation.
Theorem Prover as a Judge for Synthetic Data Generation
Joshua Ong Jun Leang | Giwon Hong | Wenda Li | Shay B. Cohen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Joshua Ong Jun Leang | Giwon Hong | Wenda Li | Shay B. Cohen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce *iterative autoformalisation*, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce *Theorem Prover as a Judge (TP-as-a-Judge)*, a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present *Reinforcement Learning from Theorem Prover Feedback (RLTPF),* a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying *TP-as-a-Judge* and *RLTPF* improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.