2025
pdf
bib
abs
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
Jiarui Lu
|
Thomas Holleis
|
Yizhe Zhang
|
Bernhard Aumayer
|
Feng Nan
|
Haoping Bai
|
Shuang Ma
|
Shen Ma
|
Mengyu Li
|
Guoli Yin
|
Zirui Wang
|
Ruoming Pang
Findings of the Association for Computational Linguistics: NAACL 2025
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over arbitrary trajectory. We show that open source and proprietary models has a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights to tool-use LLM capabilities. Datasets and evaluation scripts of ToolSandbox are released at <placeholder>.
pdf
bib
abs
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Guoli Yin
|
Haoping Bai
|
Shuang Ma
|
Feng Nan
|
Yanchao Sun
|
Zhaoyang Xu
|
Shen Ma
|
Jiarui Lu
|
Xiang Kong
|
Aonan Zhang
|
Dian Ang Yap
|
Yizhe Zhang
|
Karsten Ahnert
|
Vik Kamath
|
Mathias Berglund
|
Dominic Walsh
|
Tobias Gindele
|
Juergen Wiest
|
Zhengfeng Lai
|
Xiaoming Simon Wang
|
Jiulong Shan
|
Meng Cao
|
Ruoming Pang
|
Zirui Wang
Findings of the Association for Computational Linguistics: NAACL 2025
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underlying skills that drive these outcomes. This lack of granularity makes it difficult to deeply discern where failures stem from. Additionally, setting up these environments requires considerable effort, and issues of unreliability and reproducibility sometimes arise, especially in interactive tasks. To address these limitations, we introduce the Massive Multitask Agent Understanding (MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need for complex environment setups. It evaluate models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics, and covering five essential capabilities: Understanding, Reasoning, Planning, Problem-solving, and Self-correction. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents. By testing 20 representative models on MMAU, we provide deep and insightful analyses. Ultimately, MMAU not only sheds light on the capabilities and limitations of LLM agents but also enhances the interpretability of their performance.
2023
pdf
bib
abs
STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
Chen Chen
|
Bowen Zhang
|
Liangliang Cao
|
Jiguang Shen
|
Tom Gunter
|
Albin Jose
|
Alexander Toshev
|
Yantao Zheng
|
Jonathon Shlens
|
Ruoming Pang
|
Yinfei Yang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Image and text retrieval is one of the foundational tasks in the vision and language domain with multiple real-world applications. State-of-the-art contrastive approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings and calculate the similarity in the dense embedding space as the matching score. On the other hand, sparse semantic features like bag-of-words models are more interpretable, but believed to suffer from inferior accuracy than dense representations. In this work, we show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations. We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space. Each token in the space is a (sub-)word in the vocabulary, which is not only interpretable but also easy to integrate with existing information retrieval systems. STAIR model significantly outperforms a CLIP model with +4.9% and +4.3% absolute Recall@1 improvement on COCO-5k text→image and image→text retrieval respectively. It also achieved better performance on both of ImageNet zero-shot and linear probing compared to CLIP.
2019
pdf
bib
abs
Monotonic Infinite Lookback Attention for Simultaneous Machine Translation
Naveen Arivazhagan
|
Colin Cherry
|
Wolfgang Macherey
|
Chung-Cheng Chiu
|
Semih Yavuz
|
Ruoming Pang
|
Wei Li
|
Colin Raffel
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Simultaneous machine translation begins to translate each source sentence before the source speaker is finished speaking, with applications to live and streaming scenarios. Simultaneous systems must carefully schedule their reading of the source sentence to balance quality against latency. We present the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far. We do so by introducing Monotonic Infinite Lookback (MILk) attention, which maintains both a hard, monotonic attention head to schedule the reading of the source sentence, and a soft attention head that extends from the monotonic head back to the beginning of the source. We show that MILk’s adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values.