Zinan Tang

2025

While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LMs involved framework, GRA, that aggregates specialized roles across small LMs to iterative refinement and quality control typically achieved by a single large LM. In this collaborative framework, multiple small LMs assume distinct roles—Generator, Reviewer, and Adjudicator—to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LMs can achieve data-level parity with distillation from large LMs. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents.

Large language models (LLMs) have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model’s reflective ability. Though some studies attempted to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes.In this work, we propose to enhance LLM’s reasoning ability by Learning from Errors for MatheMatical Advancement (LEMMA). LEMMA constructs data consists of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an _error-type grounded mistake augmentation_ method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. By fine-tuning on the constructed dataset, the model is able to _self-correct errors autonomously_ within the generation process _without relying on external critique models_. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong models with less than 90k data.

pdf bib abs
Big Escape Benchmark: Evaluating Human-Like Reasoning in Language Models via Real-World Escape Room Challenges
Zinan Tang | QiYao Sun
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Large Language Models (LLMs) have recently demonstrated remarkable reasoning capabilities across a wide range of tasks. While many benchmarks have been developed on specific academic subjects, coding, or constrained visual tasks, they often fail to fully capture the breadth, diversity, and dynamic nature of real-world human reasoning. Further, the creation of high-quality, complex multimodal reasoning benchmarks typically requires significant manual effort and expert annotation, which is costly and time-consuming.To address these limitations, we introduce Big Escape Bench, a novel multimodal reasoning benchmark derived from popular reality shows and television programs. Big Escape Bench leverages unique characteristics of TV content, providing a rich source of challenging and realistic multimodal reasoning problems. Key advantages include: questions guaranteed to be human-solvable and of moderate difficulty; problems reflecting diverse, real-world scenarios and knowledge domains; high inherent quality due to content generated by professional program teams.Notably, we develop an automated pipeline to construct the data from these programs into a standardized benchmark format, significantly reducing the manual effort compared to traditional dataset construction. We have conducted extensive experiments to evaluate state-of-the-art (SOTA) LLMs and Multimodal Large Language Models (MLLMs) on Big Escape Bench. Our results reveal a surprising performance gap: while the questions are easily solved by human viewers (about 60% in accuracy), the performance of even the most advanced models (best 40.50% in accuracy) is significantly lower than human-level accuracy. This highlights that despite recent progress, MLLMs still face substantial challenges in robustly performing the kind of diverse, dynamic, and context-dependent reasoning that is trivial for humans in routine situations. Big Escape Bench serves as a valuable tool for identifying current limitations of MLLMs and fostering future research towards more human-like multimodal reasoning.

Co-authors

Xin Gao 1

Wei Wu 1

Venues

Fix author