Chen Yang


2025

"This paper presents the results of the FIE2025, a shared task aimed at evaluating the ability of Large Language Models (LLMs) to perform factivity inference on Chinese texts: whether LLMs can correctly discern the veridical information of propositions encoded in the complement clauses. The responses to the task mirror the extent to which LLMs can grasp the implicit truth judgments made by human speakers through texts, as well as their subjective stances. Such a capability is crucial for autonomous inference in intelligent agents and for achieving fluid human–AI interaction. The task was hosted on the Alibaba Tianchi platform and evaluated through two tracks: with and without finetuning. A mixed dataset was constructed, combining both synthetic sentences and authentic corpus instances. The dataset comprises a total of about 3,000 items labeled by expert linguists, including 845 (300+545) manually created items and 2,143 (700+1,443) items selected from existing corpus. 404 results proposed by 74 teams were successfully submitted to Tianchi system. Overall, under current technological conditions, the key to successful factivity inference lies in whether LLMs effectively identify different types of predicates and various contextual conditions from the given texts. Models that support long-context prompt inputs tend to achieve the best inference performance when provided with numerous shots. This shared task deepened our understanding of the factivity phenomenon in Chinese, expanded the influence of factivity research within the field of natural language processing, and provided an exploratory precedent for future activities focusing on factivity inference in Chinese and potentially other languages."
We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose **RECALL**, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.
Recent advances in large language models (LLMs) have driven significant progress in end-to-end spoken dialogue models (SDMs). In contrast to text-based LLMs, the evaluation framework for SDMs should encompass both cognitive dimensions (e.g., logical reasoning, knowledge) and speech-related aspects (e.g., paralinguistic cues, audio quality). However, there is still a lack of comprehensive evaluations for SDMs in speech-to-speech (S2S) scenarios. To address this gap, we propose **URO-Bench**, an extensive benchmark for SDMs. Notably, URO-Bench is the first S2S benchmark that covers evaluations about multilingualism, multi-round dialogues, and paralinguistics. Our benchmark is divided into two difficulty levels: basic track and pro track, each comprising 20 test sets, evaluating the spoken dialogue model’s abilities in **U**nderstanding, **R**easoning, and **O**ral conversation. Evaluations on our proposed benchmark reveal that current open-source SDMs perform rather well in daily QA tasks, but lag behind their backbone LLMs in terms of instruction-following ability and also suffer from catastrophic forgetting. Their performance in advanced evaluations of paralinguistic information and audio understanding remains subpar, highlighting the need for further research in this direction. We hope that URO-Bench can facilitate the development of spoken dialogue models by providing a multifaceted evaluation of existing models and helping to track progress in this area.

2024

In this paper, we introduce Holistic Semantic Embedding and Global Contrast (HS-GC), an end-to-end approach to learn the instance- and cluster-level representation. Specifically, for instance-level representation learning, we introduce a new loss function that exploits different layers of semantic information in a deep neural network to provide a more holistic semantic text representation. Contrastive learning is applied to these representations to improve the model’s ability to represent text instances. Additionally, for cluster-level representation learning we propose two strategies that utilize global update to construct cluster centers from a global view. The extensive experimental evaluation on five text datasets shows that our method outperforms the state-of-the-art model. Particularly on the SearchSnippets dataset, our method leads by 4.4% in normalized mutual information against the latest comparison method. On the StackOverflow and TREC datasets, our method improves the clustering accuracy of 5.9% and 3.2%, respectively.