2025
pdf
bib
abs
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning
Joshua Ong Jun Leang
|
Aryo Pradipta Gema
|
Shay B Cohen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Mathematical reasoning remains a significant challenge for large language models (LLMs), despite progress in prompting techniques such as Chain-of-Thought (CoT). We present **Chain of Mathematically Annotated Thought (CoMAT)**, which enhances reasoning through two stages: *Symbolic Conversion* (converting natural language queries into symbolic form) and *Reasoning Execution* (deriving answers from symbolic representations). CoMAT operates entirely with a single LLM and without external solvers. Across four LLMs, CoMAT outperforms traditional CoT on six out of seven benchmarks, achieving gains of 4.48% on MMLU-Redux (MATH) and 4.58% on GaoKao MCQ. In addition to improved performance, CoMAT ensures faithfulness and verifiability, offering a transparent reasoning process for complex mathematical tasks.
pdf
bib
abs
GRADA: Graph-based Reranking against Adversarial Documents Attack
Jingjie Zheng
|
Aryo Pradipta Gema
|
Giwon Hong
|
Xuanli He
|
Pasquale Minervini
|
Youcheng Sun
|
Qiongkai Xu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Retrieval Augmented Generation (RAG) frameworks can improve the factual accuracy of large language models (LLMs) by integrating external knowledge from retrieved documents, thereby overcoming the limitations of models’ static intrinsic knowledge. However, these systems are susceptible to adversarial attacks that manipulate the retrieval process by introducing documents that are adversarial yet semantically similar to the query. Notably, while these adversarial documents resemble the query, they exhibit weak similarity to benign documents in the retrieval set. Thus, we propose a simple yet effective **G**raph-based **R**eranking against **A**dversarial **D**ocument **A**ttacks (GRADA) framework aiming at preserving retrieval quality while significantly reducing the success of adversaries. Our study evaluates the effectiveness of our approach through experiments conducted on six LLMs: GPT-3.5-Turbo, GPT-4o, Llama3.1-8b-Instruct, Llama3.1-70b-Instruct, Qwen2.5-7b-Instruct and Qwen2.5-14b-Instruct. We use three datasets to assess performance, with results from the Natural Questions dataset demonstrating up to an 80% reduction in attack success rates while maintaining minimal loss in accuracy.
pdf
bib
abs
Self-Training Large Language Models for Tool-Use Without Demonstrations
Ne Luo
|
Aryo Pradipta Gema
|
Xuanli He
|
Emile Van Krieken
|
Pietro Lesci
|
Pasquale Minervini
Findings of the Association for Computational Linguistics: NAACL 2025
Large language models (LLMs) remain prone to factual inaccuracies and computational errors, including hallucinations and mistakes in mathematical reasoning. Recent work augmented LLMs with tools to mitigate these shortcomings, but often requires curated gold tool-use demonstrations. In this paper, we investigate whether LLMs can learn to use tools without demonstrations. First, we analyse zero-shot prompting strategies to guide LLMs in tool utilisation. Second, we propose a self-training method to synthesise tool-use traces using the LLM itself. We compare supervised fine-tuning and preference fine-tuning techniques for fine-tuning the model on datasets constructed using existing Question Answering (QA) datasets, i.e., TriviaQA and GSM8K. Experiments show that tool-use enhances performance on a long-tail knowledge task: 3.7% on PopQA, which is used solely for evaluation, but leads to mixed results on other datasets, i.e., TriviaQA, GSM8K, and NQ-Open. Our findings highlight the potential and challenges of integrating external tools into LLMs without demonstrations.
pdf
bib
abs
Are We Done with MMLU?
Aryo Pradipta Gema
|
Joshua Ong Jun Leang
|
Giwon Hong
|
Alessio Devoto
|
Alberto Carlo Maria Mancino
|
Rohit Saxena
|
Xuanli He
|
Yu Zhao
|
Xiaotang Du
|
Mohammad Reza Ghasemi Madani
|
Claire Barale
|
Robert McHardy
|
Joshua Harris
|
Jean Kaddour
|
Emile Van Krieken
|
Pasquale Minervini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Maybe not. We identify and analyse errors in the popular Massive Multitask Language Understanding (MMLU) benchmark. Even though MMLU is widely adopted, our analysis demonstrates numerous ground truth errors that obscure the true capabilities of LLMs. For example, we find that 57% of the analysed questions in the Virology subset contain errors. To address this issue, we introduce a comprehensive framework for identifying dataset errors using a novel error annotation protocol. Then, we create MMLU-Redux, which is a subset of 5,700 manually re-annotated questions across all 57 MMLU subjects. Using MMLU-Redux, we demonstrate significant discrepancies with the model performance metrics that were originally reported. Our results strongly advocate for revising MMLU’s error-ridden questions to enhance its future utility and reliability as a benchmark. Therefore, we open up MMLU-Redux for additional annotation.
pdf
bib
abs
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao
|
Alessio Devoto
|
Giwon Hong
|
Xiaotang Du
|
Aryo Pradipta Gema
|
Hongru Wang
|
Xuanli He
|
Kam-Fai Wong
|
Pasquale Minervini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context—this phenomenon, known as context-memory knowledge conflicts, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use inference-time intervention strategies to resolve it. In this work, we propose SpARE, a training-free representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. SpARE identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that SpARE can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods (+10%) as well as contrastive decoding methods (+15%).