Zhan Su
2026
A Dual-View Analysis of Multiple Languages in Colonial Newspapers
Zhan Su | Xiaoya Chen | Fengran Mo | Ida L. Vos | Prayag Tiwari | Yazhou Zhang | Qian Zheng | Nat\'alia da Silva Perez
Findings of the Association for Computational Linguistics: ACL 2026
Zhan Su | Xiaoya Chen | Fengran Mo | Ida L. Vos | Prayag Tiwari | Yazhou Zhang | Qian Zheng | Nat\'alia da Silva Perez
Findings of the Association for Computational Linguistics: ACL 2026
Historical newspapers from the colonial period offer valuable evidence of how racializing language evolved over time. However, there are challenges in studying this type of historical data: 1) Data scarcity: acquiring large, annotated historical datasets is difficult, hindering the possibility of analyzing racialization comprehensively; 2) Digitized materials frequently contain Optical Character Recognition (OCR) errors and other types of noise that complicate text extraction and computational analysis; 3) Colonial newspapers are often multilingual and written in archaic prose, hindering the effectiveness of NLP tools developed for modern, single language texts. This paper addresses these challenges by conducting a dual-view, jointly studying multilingual event extraction and temporal semantic shift tasks. Specifically, we introduce a contextual question answering (CQA) and a visual question answering (VQA) derived from eighteenth- and nineteenth-century colonial newspapers. Content-wise, we focus on how enslaved people were described by enslavers as well as how they articulated their own condition through QA pairs of newspapers written in Dutch, English-French, and Spanish. Our results show that LLMs are still limited for low-resource VQA tasks. For temporal semantic change, we train temporal word embedding with a compass. The study concludes that racialization is a fluid process of linguistic recalibration where the decline of slavery merely shifted the language of control onto new categories of labor and identity.
Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion
Shunian Chen | Xinyuan Xie | Zheshu Chen | Owen Lee | Liyan Zhao | Zhan Su | Qilin Sun | Benyou Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shunian Chen | Xinyuan Xie | Zheshu Chen | Owen Lee | Liyan Zhao | Zhan Su | Qilin Sun | Benyou Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
High-quality, large-scale audio captioning is crucial for advancing audio understanding, yet current automated methods often generate captions that lack fine-grained detail and contextual accuracy, primarily due to their reliance on limited unimodal or superficial multimodal information. Drawing inspiration from human auditory perception, which adeptly integrates cross-modal cues and performs sophisticated auditory scene analysis, we introduce a novel two-stage automated pipeline. This pipeline first employs specialized pretrained models to extract diverse contextual cues (e.g., speech, music, general sounds, and visual information from associated video). A large language model (LLM) then synthesizes these rich, multimodal inputs to generate detailed and context-aware audio captions. Key contributions of this work include: (1) the proposed scalable method for fine-grained audio caption generation; (2) FusionAudio, a new large-scale dataset comprising 1.2 million such detailed captions, combined with 6 million QA pairs; and (3) enhanced audio models developed using FusionAudio, specifically a CLAP-based audio encoder with superior audio-text alignment and instruction following. This paper paves the way for more nuanced and accurate automated understanding of complex audio environments.
2024
History-Aware Conversational Dense Retrieval
Fengran Mo | Chen Qu | Kelong Mao | Tianyu Zhu | Zhan Su | Kaiyu Huang | Jian-Yun Nie
Findings of the Association for Computational Linguistics: ACL 2024
Fengran Mo | Chen Qu | Kelong Mao | Tianyu Zhu | Zhan Su | Kaiyu Huang | Jian-Yun Nie
Findings of the Association for Computational Linguistics: ACL 2024
Conversational search facilitates complex information retrieval by enabling multi-turn interactions between users and the system. Supporting such interactions requires a comprehensive understanding of the conversational inputs to formulate a good search query based on historical information. In particular, the search query should include the relevant information from the previous conversation turns.However, current approaches for conversational dense retrieval primarily rely on fine-tuning a pre-trained ad-hoc retriever using the whole conversational search session, which can be lengthy and noisy. Moreover, existing approaches are limited by the amount of manual supervision signals in the existing datasets.To address the aforementioned issues, we propose a **H**istory-**A**ware **Conv**ersational **D**ense **R**etrieval (HAConvDR) system, which incorporates two ideas: context-denoised query reformulation and automatic mining of supervision signals based on the actual impact of historical turns.Experiments on two public conversational search datasets demonstrate the improved history modeling capability of HAConvDR, in particular for long conversations with topic shifts.