Zhe Su
2026
PlotGen-Bench: Evaluating VLMs on Generating Visualization Code from Diverse Plots across Multiple Libraries
Yi Zhao | Zhen Yang | Shuaiqi Duan | Wenmeng Yu | Zhe Su | Jibing Gong | Jie Tang
Findings of the Association for Computational Linguistics: ACL 2026
Yi Zhao | Zhen Yang | Shuaiqi Duan | Wenmeng Yu | Zhe Su | Jibing Gong | Jie Tang
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in vision–language models (VLMs) have expanded their multimodal code generation capabilities, yet their ability to generate executable visualization code from plots, especially for complex 3D, animated, plot-to-plot transformations, or multi-library scenarios, remains underexplored. To address this gap, we introduce PlotGen-Bench, a comprehensive benchmark for evaluating plot-to-code generation under realistic and complex visualization scenarios. The benchmark spans 9 major categories, 30 subcategories, and 3 core tasks—plot replication, plot transformation, and multi-library generation, covering both 2D, 3D and animated plots across 5 widely used visualization libraries. Through systematic evaluation of state-of-the-art open- and closed-source VLMs, we find that open-source models still lag considerably behind in visual fidelity and semantic consistency, despite achieving comparable code executability. Moreover, all models exhibit substantial degradation on reasoning-intensive tasks such as chart type conversion and animation generation. PlotGen-Bench establishes a rigorous foundation for advancing research toward more capable and reliable VLMs for visualization authoring and code synthesis, with all data and code available at https://plotgen.github.io.
Glyph: Scaling Context Windows via Visual-Text Compression
Jiale Cheng | Yusen Liu | Xinyu Zhang | Yulin Fei | Wenyi Hong | Ruiliang Lyu | Weihan Wang | Zhe Su | Xiaotao Gu | Xiao Liu | Yushi Bai | Jie Tang | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiale Cheng | Yusen Liu | Xinyu Zhang | Yulin Fei | Wenyi Hong | Ruiliang Lyu | Weihan Wang | Zhe Su | Xiaotao Gu | Xiao Liu | Yushi Bai | Jie Tang | Hongning Wang | Minlie Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) conventionally represent text as sequences of discrete tokens, making long-context scaling largely a matter of processing more tokens more efficiently.We instead explore a complementary direction: increasing how much original context each token represents.To this end, we introduce Glyph, a framework that renders long texts into compact visual pages and processes them with a vision-language model (VLM), allowing a fixed context window to cover substantially more text.To make visual compression practical, Glyph combines continual pre-training on rendered long-text data, an LLM-driven genetic search to identify rendering configurations that balance compression and task performance, and post-training with supervised fine-tuning and reinforcement learning.Across multiple long-context benchmarks, Glyph achieves 3–4× token compression while maintaining performance comparable to strong text-only LLMs such as Qwen3-8B, with over 4× faster prefilling and decoding and 2× faster supervised fine-tuning.Under more aggressive compression, a VLM with a 128K context window can handle tasks that would otherwise require up to 1M input tokens.Our code and model are released at https://github.com/thu-coai/Glyph.
2025
AI-LieDar : Examine the Trade-off Between Utility and Truthfulness in LLM Agents
Zhe Su | Xuhui Zhou | Sanketh Rangreji | Anubha Kabra | Julia Mendelsohn | Faeze Brahman | Maarten Sap
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Zhe Su | Xuhui Zhou | Sanketh Rangreji | Anubha Kabra | Julia Mendelsohn | Faeze Brahman | Maarten Sap
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Truthfulness (adherence to factual accuracy) and utility (satisfying human needs and instructions) are both fundamental aspects of Large Language Models, yet these goals often conflict (e.g., sell a car with known flaws), making it challenging to achieve both in real-world deployments. We propose AI-LieDar, a framework to study how LLM-based agents navigate these scenarios in an multi-turn interactive setting. We design a set of real-world scenarios where language agents are instructed to achieve goals that are in conflict with being truthful during a multi-turn conversation with simulated human agents. To evaluate the truthfulness at large scale, we develop a truthfulness detector inspired by psychological literature to assess the agents’ responses. Our experiment demonstrates that all models are truthful less than 50% of the time, although truthfulness and goal achievement (utility) rates vary across models. We further test the steerability of LLMs towards truthfulness, finding that models can be directed to be deceptive, and even truth-steered models still lie. These findings reveal the complex nature of truthfulness in LLMs and underscore the importance of further research to ensure the safe and reliable deployment of LLMs and AI agents.
SOTOPIA-S4: a user-friendly system for flexible, customizable, and large-scale social simulation
Xuhui Zhou | Zhe Su | Sophie Feng | Jiaxu Zhou | Jen-tse Huang | Hsien-Te Kao | Spencer Lynch | Svitlana Volkova | Tongshuang Wu | Anita Woolley | Hao Zhu | Maarten Sap
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Xuhui Zhou | Zhe Su | Sophie Feng | Jiaxu Zhou | Jen-tse Huang | Hsien-Te Kao | Spencer Lynch | Svitlana Volkova | Tongshuang Wu | Anita Woolley | Hao Zhu | Maarten Sap
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Social simulation through large language model (LLM) agents is a promising approach to explore and validate social science hypotheses.We present SOTOPIA-S4, a fast, flexible, and scalable social simulation system that addresses the technical barriers of current frameworks while enabling practitioners to generate realistic, multi-turn and multi-party interactions with customizable evaluation metrics for hypothesis testing. SOTOPIA-S4 comes as a pip package that contains a simulation engine, an API server with flexible RESTful APIs for simulation management, and a web interface that enables both technical and non-technical users to design, run, and analyze simulations without programming. We demonstrate the usefulness of SOTOPIA-S4 with two use cases involving dyadic hiring negotiation scenarios and multi-party planning scenarios.
2024
Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs
Xuhui Zhou | Zhe Su | Tiwalayo Eisape | Hyunwoo Kim | Maarten Sap
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Xuhui Zhou | Zhe Su | Tiwalayo Eisape | Hyunwoo Kim | Maarten Sap
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent advances in large language models (LLM) have enabled richer social simulations, allowing for the study of various social phenomena. However, most recent work has used a more omniscient perspective on these simulations (e.g., single LLM to generate all interlocutors), which is fundamentally at odds with the non-omniscient, information asymmetric interactions that involve humans and AI agents in the real world. To examine these differences, we develop an evaluation framework to simulate social interactions with LLMs in various settings (omniscient, non-omniscient). Our experiments show that LLMs perform better in unrealistic, omniscient simulation settings but struggle in ones that more accurately reflect real-world conditions with information asymmetry. Moreover, we illustrate the limitations inherent in learning from omniscient simulations. Our findings indicate that addressing information asymmetry remains a fundamental challenge for LLM-based agents.
2023
Uncovering and Categorizing Social Biases in Text-to-SQL
Yan Liu | Yan Gao | Zhe Su | Xiaokang Chen | Elliott Ash | Jian-Guang Lou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yan Liu | Yan Gao | Zhe Su | Xiaokang Chen | Elliott Ash | Jian-Guang Lou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large pre-trained language models are acknowledged to carry social bias towards different demographics, which can further amplify existing stereotypes in our society and cause even more harm. Text-to-SQL is an important task, models of which are mainly adopted by administrative industries, where unfair decisions may lead to catastrophic consequences. However, existing Text-to-SQL models are trained on clean, neutral datasets, such as Spider and WikiSQL. This, to some extent, cover up social bias in models under ideal conditions, which nevertheless may emerge in real application scenarios. In this work, we aim to uncover and mitigate social bias in Text-to-SQL models. We summarize the categories of social bias that may occur in structural data for Text-to-SQL models. We build test benchmarks and reveal that models with similar task accuracy can contain social bias at very different rates. We show how to take advantage of our methodology to assess and mitigate social bias in the downstream Text-to-SQL task.
Search
Fix author
Co-authors
- Maarten Sap 3
- Xuhui Zhou 3
- Jie Tang 2
- Elliott Ash 1
- Yushi Bai 1
- Faeze Brahman 1
- Xiaokang Chen 1
- Jiale Cheng 1
- Shuaiqi Duan 1
- Tiwalayo Eisape 1
- Yulin Fei 1
- Sophie Feng 1
- Yan Gao 1
- Jibing Gong 1
- Xiaotao Gu 1
- Wenyi Hong 1
- Jen-tse Huang 1
- Minlie Huang 1
- Anubha Kabra 1
- Hsien-Te Kao 1
- Hyunwoo Kim 1
- Yan Liu 1
- Yusen Liu 1
- Xiao Liu 1
- Jian-Guang Lou 1
- Spencer Lynch 1
- Ruiliang Lyu 1
- Julia Mendelsohn 1
- Sanketh Rangreji 1
- Svitlana Volkova 1
- Weihan Wang 1
- Hongning Wang 1
- Anita Woolley 1
- Tongshuang Wu 1
- Zhen Yang 1
- Wenmeng Yu 1
- Xinyu Zhang 1
- Yi Zhao 1
- Jiaxu Zhou 1
- Hao Zhu 1