Ying-Cong Chen
2026
Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention
Zhen Yang | Mingyang Zhang | Feng Chen | Ganggui Ding | Liang Hou | Xin Tao | Ying-Cong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhen Yang | Mingyang Zhang | Feng Chen | Ganggui Ding | Liang Hou | Xin Tao | Ying-Cong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized—only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model’s KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks—e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0—while remaining highly efficient.
2025
PreGenie: An Agentic Framework for High-quality Visual Presentation Generation
Xiaojie Xu | Xinli Xu | Sirui Chen | Haoyu Chen | Fan Zhang | Ying-Cong Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Xiaojie Xu | Xinli Xu | Sirui Chen | Haoyu Chen | Fan Zhang | Ying-Cong Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Visual presentations are vital for effective communication. Early attempts to automate their creation using deep learning often faced issues such as poorly organized layouts, inaccurate text summarization, and a lack of image understanding, leading to mismatched visuals and text. These limitations restrict their application in formal contexts like business and scientific research. To address these challenges, we propose PreGenie, an agentic and modular framework powered by multimodal large language models (MLLMs) for generating high-quality visual presentations.PreGenie is built on the Slidev presentation framework, where slides are rendered from Markdown code. It operates in two stages: (1) Analysis and Initial Generation, which summarizes multimodal input and generates initial code, and (2) Review and Re-generation, which iteratively reviews intermediate code and rendered slides to produce final, high-quality presentations. Each stage leverages multiple MLLMs that collaborate and share information. Comprehensive experiments demonstrate that PreGenie excels in multimodal understanding, outperforming existing models in both aesthetics and content consistency, while aligning more closely with human design preferences.
Orchestrating Audio: Multi-Agent Framework for Long-Video Audio Synthesis
Yehang Zhang | Xinli Xu | Xiaojie Xu | Doudou Zhang | Li Liu | Ying-Cong Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yehang Zhang | Xinli Xu | Xiaojie Xu | Doudou Zhang | Li Liu | Ying-Cong Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Video-to-audio synthesis, which generates synchronized audio for visual content, critically enhances viewer immersion and narrative coherence in film and interactive media. However, video-to-audio dubbing for long-form content remains an unsolved challenge due to dynamic semantic shifts, audio diversity and the absence of dedicated datasets. While existing methods excel in short videos, they falter in long scenarios (e.g., movies) due to fragmented synthesis and inadequate cross-scene consistency. We propose LVAS-Agent, a multi-agent framework that offers a coordinated, multi-component approach to long-video audio generation. Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, audio design and audio synthesis. To enable systematic evaluation, we introduce LVAS-Bench, the first benchmark with 207 professionally curated long videos spanning diverse scenarios. Experiments show that our method outperforms state-of-the-art V2A models in overall audio synthesis quality.