Li Zhang
Papers on this page may belong to the following people: Li Zhang, Li Zhang, Li Zhang (AWS), Li Zhang (Birmingham), Li Zhang (Google), Li Zhang (Google), Li Zhang (IBM-china), Li Zhang (Nankai), Li Zhang (Newcastle, UK), Li Zhang (State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications), Li Zhang (Teesside University), Li Zhang (China Telecom Research Institute), Li Zhang (UC San Diego), Li Zhang (UK), Li Zhang (University of Pennsylvania), Li Zhang (Wuhan)
2026
Quantifying the Impact of Structured Output Format on Large Language Models through Causal Inference
Han Yuan | Yue Zhao | Li Zhang | Wuqiong Luo | Zheng Ma
Findings of the Association for Computational Linguistics: EACL 2026
Han Yuan | Yue Zhao | Li Zhang | Wuqiong Luo | Zheng Ma
Findings of the Association for Computational Linguistics: EACL 2026
Structured output from large language models (LLMs) has enhanced efficiency in processing generated information and is increasingly adopted in industrial applications. Prior studies have investigated the impact of structured output on LLMs’ generation quality, often presenting one-way findings. Some suggest that structured format enhances completeness and factual accuracy, while others argue that it restricts the reasoning capacity of LLMs and leads to reductions in standard evaluation metrics. Potential limitations of these assessments include restricted testing scenarios, weakly controlled comparative settings, and reliance on coarse metrics. In this work, we present a refined analysis using causal inference. Based on one assumed and two guaranteed constraints, we derive five potential causal structures characterizing the influence of structured output on LLMs’ generation: (1) collider without m-bias, (2) collider with m-bias, (3) single cause from instruction, (4) single cause from output format, and (5) independence. Across seven public and one developed reasoning tasks, we find that coarse metrics report positive, negative, or neutral effects of structured output on GPT-4o’s generation. However, causal inference reveals no causal impact in 43 out of 48 scenarios. In the remaining 5, 3 involve multifaceted causal structures influenced by concrete instructions. Further experiments show that OpenAI-o3 are more resilient to output formats than general-purpose GPT-4o and GPT-4.1, highlighting an unaware advantage of reasoning models.
Language Model as Planner and Formalizer under Constraints
Cassie Huang | Stuti Mohan | Ziyi Yang | Stefanie Tellex | Li Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cassie Huang | Stuti Mohan | Ziyi Yang | Stefanie Tellex | Li Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
LLMs have been widely used in planning, either as planners to generate action sequences end-to-end, or as formalizers to represent the planning domain and problem in a formal language that can derive plans deterministically. However, both lines of work rely on standard benchmarks that include only generic and simplistic environmental specifications, leading to potential overestimation of the planning ability of LLMs and safety concerns in downstream tasks. We bridge this gap by augmenting widely used planning benchmarks with manually annotated, fine-grained, and rich natural language constraints spanning four formally defined categories. Over 4 state-of-the-art reasoning LLMs, 4 formal languages, and 4 datasets, we show that the introduction of one-sentence constraints consistently halves performance, indicating current LLMs’ lack of robustness and an avenue for future research.