2025
pdf
bib
abs
Growing Through Experience: Scaling Episodic Grounding in Language Models
Chunhui Zhang
|
Sirui Wang
|
Zhongyu Ouyang
|
Xiangchi Yuan
|
Soroush Vosoughi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Language models (LMs) require effective episodic grounding—the ability to learn from and apply past experiences—to perform well at physical planning tasks. While current approaches struggle with scalability and integration of episodic memory, which is particularly limited for medium-sized LMs (7B parameters), larger LMs (70-405B) offer untapped potential through their hierarchical representations and extensive pre-trained knowledge. Therefore, to unlock larger LMs’ potential for grounding, we present a scalable weak-to-strong episodic learning framework that efficiently transfers episodic behaviors from smaller to larger LMs. It uses Monte Carlo tree search for structured experience collection with a novel distillation method that preserves LM capabilities while incorporating episodic memory. This enables larger LMs to leverage their inherent advantages for improved physical planning. Experiments show our solution outperforms top proprietary LMs by 3.45% across diverse planning and question-answering tasks. Layer-wise probing reveals systematic improvements in task alignment, particularly in later LM layers. It shows stable generalization to even unseen scenarios, even as planning steps increase, whereas baselines deteriorate sharply beyond a complexity threshold of four planning steps.
pdf
bib
abs
Temporal Working Memory: Query-Guided Segment Refinement for Enhanced Multimodal Understanding
Xingjian Diao
|
Chunhui Zhang
|
Weiyi Wu
|
Zhongyu Ouyang
|
Peijun Qing
|
Ming Cheng
|
Soroush Vosoughi
|
Jiang Gui
Findings of the Association for Computational Linguistics: NAACL 2025
Multimodal foundation models (MFMs) have demonstrated significant success in tasks such as visual captioning, question answering, and image-text retrieval. However, these models face inherent limitations due to their finite internal capacity, which restricts their ability to process extended temporal sequences—an essential requirement for comprehensive video and audio analysis. To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs. It selectively retains task-relevant information across temporal dimensions, ensuring that critical details are preserved throughout the processing of video and audio content. The TWM uses a query-guided attention approach to focus on the most informative multimodal segments within temporal sequences. By retaining only the most relevant content, TWM optimizes the use of the model’s limited capacity, enhancing its temporal modeling ability. This plug-and-play module can be easily integrated into existing MFMs. With our TWM, nine state-of-the-art models exhibit significant performance improvements across tasks such as video captioning, question answering, and video-text retrieval. By enhancing temporal modeling, TWM extends the capability of MFMs to handle complex, time-sensitive data effectively. Our code is available at https://github.com/xid32/NAACL_2025_TWM.
pdf
bib
abs
Visibility as Survival: Generalizing NLP for Native Alaskan Language Identification
Ivory Yang
|
Chunhui Zhang
|
Yuxin Wang
|
Zhongyu Ouyang
|
Soroush Vosoughi
Findings of the Association for Computational Linguistics: ACL 2025
Indigenous languages remain largely invisible in commercial language identification (LID) systems, a stark reality exemplified by Google Translate’s LangID tool, which supports over 100 languages but excludes all 150 Indigenous languages of North America. This technological marginalization is particularly acute for Alaska’s 20 Native languages, all of which face endangerment despite their rich linguistic heritage. We present GenAlaskan, a framework demonstrating how both large language models and specialized classifiers can effectively identify these languages with minimal data. Working closely with Native Alaskan community members, we create Akutaq-2k, a carefully curated dataset of 2000 sentences spanning all 20 languages, named after the traditional Yup’ik dessert, symbolizing the blending of diverse elements. We design few-shot prompting on proprietary and open-source LLMs, achieving nearly perfect accuracy with just 40 examples per language. While initial zero-shot attempts show limited success, our systematic attention head pruning revealed critical architectural components for accurate language differentiation, providing insights into model decision-making for low-resource languages. Our results challenge the notion that effective Indigenous language identification requires massive resources or corporate infrastructure, demonstrating that targeted technological interventions can drive meaningful progress in preserving endangered languages in the digital age.
pdf
bib
abs
Pretrained Image-Text Models are Secretly Video Captioners
Chunhui Zhang
|
Yiren Jian
|
Zhongyu Ouyang
|
Soroush Vosoughi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Developing video captioning models is computationally expensive. The dynamic nature of video also complicates the design of multimodal models that can effectively caption these sequences. However, we find that by using minimal computational resources and without complex modifications to address video dynamics, an image-based model can be repurposed to outperform several specialised video captioning systems. Our adapted model demonstrates top-tier performance on major benchmarks, ranking 2nd on MSR-VTT and MSVD, and 3rd on VATEX. We transform it into a competitive video captioner by post-training a typical image captioning model BLIP-2 with only 6,000 video-text pairs and simply concatenating frames—significantly fewer data than other methods, which use 2.5 to 144 million pairs. From a resource optimization perspective, this video captioning study focuses on three fundamental factors: optimizing model scale, maximizing data efficiency, and incorporating reinforcement learning. This extensive study demonstrates that a lightweight, image-based adaptation strategy can rival state-of-the-art video captioning systems, offering a practical solution for low-resource scenarios.
2024
pdf
bib
abs
Working Memory Identifies Reasoning Limits in Language Models
Chunhui Zhang
|
Yiren Jian
|
Zhongyu Ouyang
|
Soroush Vosoughi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
This study explores the inherent limitations of large language models (LLMs) from a scaling perspective, focusing on the upper bounds of their cognitive capabilities. We integrate insights from cognitive science to quantitatively examine how LLMs perform on n-back tasks—a benchmark used to assess working memory, which involves temporarily holding and manipulating information. Our findings reveal that despite the increased model size, LLMs still face significant challenges in holding and processing information effectively, especially under complex task conditions. We also assess various prompting strategies, revealing their diverse impacts on LLM performance. The results highlight the struggle of current LLMs to autonomously discover optimal problem-solving patterns without heavily relying on manually corrected prompts. To move beyond these constraints, fundamental improvements in the planning and search of LLMs are essential for them to reason autonomously. Improving these capabilities will reduce the reliance on external corrections and enable LLMs to become more autonomous in their problem-solving processes.
pdf
bib
abs
Learning Musical Representations for Music Performance Question Answering
Xingjian Diao
|
Chunhui Zhang
|
Tingxuan Wu
|
Ming Cheng
|
Zhongyu Ouyang
|
Weiyi Wu
|
Jiang Gui
Findings of the Association for Computational Linguistics: EMNLP 2024
Music performances are representative scenarios for audio-visual modeling. Unlike common scenarios with sparse audio, music performances continuously involve dense audio signals throughout. While existing multimodal learning methods on the audio-video QA demonstrate impressive capabilities on general scenarios, they are incapable of dealing with fundamental problems within the music performances: they underexplore the interaction between the multimodal signals in performance, and fail to consider the distinctive characteristics of instruments and music. Therefore, existing methods tend to inaccurately answer questions regarding musical performances. To bridge the above research gaps, first, given the intricate multimodal interconnectivity inherent to music data, our primary backbone is designed to incorporate multimodal interactions within the context of music; second, to enable the model to learn music characteristics, we annotate and release rhythmic and music sources in the current music datasets; third, for time-aware audio-visual modelling, we align the model’s music predictions with the temporal dimension. Our experiments show state-of-the-art effects on the Music AVQA datasets. Our code is available at: https://github.com/xid32/Amuse.