Terry Jingchen Zhang


2026

Post-cutoff performance decay has been widely interpreted as a temporal signal for benchmark contamination.We critically examine this belief and demonstrate that this temporal signal is highly sensitive to how benchmark questions are constructed.Specifically, we show that LLM-generated questions can produce remarkably different temporal patterns compared to fill-in-the-blank questions directly retrieved from the very same materials.We validated this finding on previous benchmarks that reported clear post-cutoff performance decay such as LiveCodeBench and further showed simple LLM transformation could effectively remove this temporal pattern when evaluated on the same models.We also provide a mechanistic understanding of our observation using influence function analysis.Overall, this work offers a new perspective on the sensitivity of temporal contamination signal and highlights the need for more robust contamination detection methods for reliable AI evaluation.
Large language models are increasingly deployed in multi-agent systems to overcome context limitations by distributing information across agents. However, whether LLM-based agents can reliably coordinate when each observes only a fragment of the global problem remains unclear. Existing benchmarks often prescribe agent roles or interaction patterns, conflating coordination ability with role-based priors. We introduce SILO-BENCH, a role-free benchmark for evaluating free-form collaboration under information silos. The benchmark comprises 30 algorithmic tasks with exact ground-truth answers, organized into 3 complexity levels based on optimal communication complexity: aggregation, mesh, and global shuffle. To systematically probe coordination capabilities, we instantiate 54 configurations by varying 3 communication protocols, 6 agent scales and 3 frontier LLMs, conducting 1,620 experiments. We evaluate agent behavior along three dimensions: Success Rate, Token Consumption, and Communication Density. Our experiments reveal a fundamental Communication-Reasoning Gap: agents communicate actively, yet fail to translate interaction into effective distributed computation. Performance collapses as complexity increases, with Level-III tasks achieving zero success beyond 50 agents. These findings demonstrate that current LLMs cannot escape information silos through coordination alone. SILO-BENCH provides a foundation for tracking progress toward genuinely collaborative multi-agent systems. The code is available at https://github.com/jwyjohn/acl26-silo-bench.

2025

The rapid development of Large Language Models (LLMs) opens up the possibility of using them aspersonal tutors. This has led to the development of several intelligent tutoring systems and learning assistants that use LLMs as back-ends with various degrees of engineering. In this study, we seek to compare human tutors with LLM tutorsin terms of engagement, empathy, scaffolding, and conciseness. We ask human tutors to compare the performance of an LLM tutor with that of a human tutor in teaching grade-school math word problems on these qualities. We find that annotators with teaching experience perceive LLMs as showing higher performance than human tutors in all 4 metrics. The biggest advantage is in empathy, where 80% of our annotators prefer the LLM tutor more often than the human tutors. Our study paints a positive picture of LLMs as tutors and indicates that these models can be used to reduce the load on human teachers in the future.