Xiaoyuan Zhang

2025

pdf bib abs
Can LLMs Understand Unvoiced Speech? Exploring EMG-to-Text Conversion with LLMs
Payal Mohapatra | Akash Pandey | Xiaoyuan Zhang | Qi Zhu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Unvoiced electromyography (EMG) is an effective communication tool for individuals unable to produce vocal speech. However, most prior methods rely on paired voiced and unvoiced EMG signals, along with speech data, for unvoiced EMG-to-text conversion, which is not practical for these individuals. Given the rise of large language models (LLMs) in speech recognition, we explore their potential to understand unvoiced speech. To this end, we address the challenge of learning from unvoiced EMG alone and propose a novel EMG adaptor module that maps EMG features to an LLM’s input space, achieving an average word error rate of 0.49 on a closed-vocabulary unvoiced EMG-to-text task. Even with a conservative data availability of just six minutes, our approach improves performance over specialized models by nearly 20%. While LLMs have been shown to be extendable to new language modalities—such as audio—understanding articulatory biosignals, like unvoiced EMG, is more challenging. This work takes a crucial first step toward enabling LLMs to comprehend unvoiced speech using surface EMG.

pdf bib abs
TS-CLIP: Time Series Understanding by CLIP
Ziwen Chen | Xiaoyuan Zhang | Ming Zhu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Contrastive Language–Image Pre-training (CLIP) has recently demonstrated remarkable success in aligning vision and language. Aligning time series with text leverages the rich semantic cues of language to enhance interpretability and generalization, addressing a largely underexplored area of research. Although applying the CLIP training paradigm to time-series and language pairs is promising, it may result in label collapse due to the sparse semantic annotations and the absence of visual cues in time-series data. To address this, we introduce Time Series CLIP (TS-CLIP), a novel approach that tackles label collapse using a synonym bank mechanism. Synonym bank exploits word analogy phenomena to generate potential synonym embeddings as alignment targets. Specifically, the synonym bank facilitates aligning time series with a word distribution instead of a precise textual description. We conducted extensive zero-shot and few-shot experiments on 128 sub-datasets from the UCR archive. The results show that TS-CLIP achieves state-of-the-art (SOTA) performance in zero-shot settings on 51 datasets. Comprehensive ablation studies and visualization analyzes reveal that TS-CLIP effectively aligns time series with natural language. To the best of our knowledge, this is the first foundational model to achieve general time series and natural language alignment. TS-CLIP introduces a new paradigm for the semantic understanding of time series and opens the possibility of integrating the time series modality into multimodal large models.

Developing Large Language Model (LLM) agents that exhibit human-like behavior, encompassing not only individual heterogeneity rooted in unique user profiles but also adaptive response to socially connected neighbors, is a significant research challenge. Social media platforms, with their diverse user data and explicit social structures, provide an ideal testbed for such investigations. This paper introduces EvoBot, an **Evo**lving LLM-based social **Bot** that significantly enhances human-like generative capabilities through a novel adversarial learning framework. EvoBot is initialized by Supervised Fine-Tuning (SFT) on representative data from social media and then iteratively refines its generation of sophisticated, human-like content via Direct Preference Optimization (DPO). This refinement is guided by feedback from a co-adapting **Detector** which concurrently improves its ability to distinguish EvoBot from humans, thereby creating an increasingly challenging learning environment for EvoBot. Experiments demonstrate that EvoBot generates content aligned with diverse user profiles, increasingly bypassing the co-adapting Detector through human-like expression. Moreover, it exhibits strong social responsiveness, more accurately modeling real-world opinion dynamics and information spread in multi-agent simulations. The framework also yields a more robust Detector, underscoring its broader utility for both advanced agent development and related detection tasks. The code is available at https://anonymous.4open.science/r/EvoBot-036D.

Co-authors

Qi Zhu 1

Venues

emnlp2
acl1

Fix author