2025
pdf
bib
abs
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
Jianlyu Chen
|
Nan Wang
|
Chaofan Li
|
Bo Wang
|
Shitao Xiao
|
Han Xiao
|
Hao Liao
|
Defu Lian
|
Zheng Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.
pdf
bib
abs
Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis
Junzhuo Li
|
Bo Wang
|
Xiuze Zhou
|
Peijie Jiang
|
Jia Liu
|
Xuming Hu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mistral-7B). Results show MoE models achieve 31% higher per-layer efficiency via a “mid-activation, late-amplification” pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a “basic-refinement” framework—shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow Olmoe suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.
pdf
bib
abs
ECC: Synergizing Emotion, Cause and Commonsense for Empathetic Dialogue Generation
Xu Wang
|
Bo Wang
|
Yihong Tang
|
Dongming Zhao
|
Jing Liu
|
Ruifang He
|
Yuexian Hou
Proceedings of the 31st International Conference on Computational Linguistics
Empathy improves human-machine dialogue systems by enhancing the user’s experience. While traditional models have aimed to detect and express users’ emotions from dialogue history, they neglect the crucial and complex interactions among emotion, emotion causes, and commonsense. To address this, we introduce the ECC (Emotion, Cause, and Commonsense) framework, which leverages specialized encoders to capture the key features of emotion, cause, and commonsense and collaboratively models these through a Conditional Variational Auto-Encoder. ECC further employs novel loss functions to refine the interplay of three factors and generates empathetic responses using an energy-based model supported by ODE sampling. Empirical results on the EmpatheticDialogues dataset demonstrate that ECC outperforms existing baselines, offering a robust solution for empathetic dialogue generation.
pdf
bib
abs
RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems
Yihong Tang
|
Bo Wang
|
Xu Wang
|
Dongming Zhao
|
Jing Liu
|
Ruifang He
|
Yuexian Hou
Proceedings of the 31st International Conference on Computational Linguistics
Role-playing systems powered by large language models (LLMs) have become increasingly influential in emotional communication applications. However, these systems are susceptible to character hallucinations, where the model deviates from predefined character roles and generates responses that are inconsistent with the intended persona. This paper presents the first systematic analysis of character hallucination from an attack perspective, introducing the RoleBreak framework. Our framework identifies two core mechanisms—query sparsity and role-query conflict—as key factors driving character hallucination. Leveraging these insights, we construct a novel dataset, RoleBreakEval, to evaluate existing hallucination mitigation techniques. Our experiments reveal that even enhanced models trained to minimize hallucination remain vulnerable to attacks. To address these vulnerabilities, we propose a novel defence strategy, the Narrator Mode, which generates supplemental context through narration to mitigate role-query conflicts and improve query generalization. Experimental results demonstrate that Narrator Mode significantly outperforms traditional refusal-based strategies by reducing hallucinations, enhancing fidelity to character roles and queries, and improving overall narrative coherence.
pdf
bib
abs
Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
Xuefen Li
|
Bo Wang
|
Ge Shi
|
Chong Feng
|
Jiahao Teng
Proceedings of the 31st International Conference on Computational Linguistics
Existing video LLMs typically excel at capturing the overall description of a video but lack the ability to demonstrate an understanding of temporal dynamics and a fine-grained grasp of localized content within the video. In this paper, we propose a Time-Perception Enhanced Video Grounding via Boundary Perception and Temporal Reasoning aimed at mitigating LLMs’ difficulties in understanding the discrepancies between video and text temporality. Specifically, to address the inherent biases in current datasets, we design a series of boundary-perception tasks to enable LLMs to capture accurate video temporality. To tackle LLMs’ insufficient understanding of temporal information, we develop specialized tasks for boundary perception and temporal relationship reasoning to deepen LLMs’ perception of video temporality. Our experimental results show significant improvements across three datasets: ActivityNet, Charades, and DiDeMo (achieving up to 11.2% improvement on R@0.3), demonstrating the effectiveness of our proposed temporal awareness-enhanced data construction method.
pdf
bib
abs
REARANK: Reasoning Re-ranking Agent via Reinforcement Learning
Le Zhang
|
Bo Wang
|
Xipeng Qiu
|
Siva Reddy
|
Aishwarya Agrawal
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present REARANK, a large language model (LLM)-based listwise reasoning rerank- ing agent. REARANK explicitly reasons be- fore reranking, significantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular informa- tion retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in- domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results under- score the effectiveness of our approach and highlight how reinforcement learning can en- hance LLM reasoning capabilities in reranking.
pdf
bib
abs
Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation
Junzhuo Li
|
Bo Wang
|
Xiuze Zhou
|
Xuming Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Mixture-of-Experts (MoE) models offer immense capacity via sparsely gated expert subnetworks, yet adapting them to multiple domains without catastrophic forgetting remains an open challenge. Existing approaches either incur prohibitive computation, suffer cross-domain interference, or require separate runs per domain. We propose DES-MoE, a dynamic expert specialization framework for multi-domain adaptation of Mixture-of-Experts models. DES-MoE addresses catastrophic forgetting through three innovations: (1) an adaptive router balancing pre-trained knowledge retention and task-specific updates via distillation, (2) real-time expert-domain correlation mapping to isolate domain-specific gradients, and (3) a three-phase adaptive fine-tuning schedule that progressively freezes non-specialized parameters. Evaluated on six domains (math, code, law, etc.), DES-MoE matches single-domain ESFT performance while training one unified model, reduces forgetting by 89% compared to full fine-tuning as domains scale from 2 to 6, and achieves 68% faster convergence than conventional methods. Our work establishes dynamic expert isolation as a scalable paradigm for multi-task MoE adaptation.
pdf
bib
abs
R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning
Yuan Li
|
Qi Luo
|
Xiaonan Li
|
Bufan Li
|
Qinyuan Cheng
|
Bo Wang
|
Yining Zheng
|
Yuxin Wang
|
Zhangyue Yin
|
Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2025
Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows.To address these limitations, we propose R3-RAG, which uses Reinforcement learning to make the LLM learn how to Reason and Retrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment.Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer.Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers.
pdf
bib
abs
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
Michael Günther
|
Saba Sturua
|
Mohammad Kalim Akram
|
Isabelle Mohr
|
Andrei Ungureanu
|
Bo Wang
|
Sedigheh Eslami
|
Scott Martens
|
Maximilian Werk
|
Nan Wang
|
Han Xiao
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
We introduce jina-embeddings-v4, a 3.8 billion parameter embedding model that unifies text and image representations, with a novel architecture supporting both single-vector and multi-vector embeddings. It achieves high performance on both single-modal and cross-modal retrieval tasks, and is particularly strong in processing visually rich content such as tables, charts, diagrams, and mixed-media formats that incorporate both image and textual information. We also introduce JVDR, a novel benchmark for visually rich document retrieval that includes more diverse materials and query types than previous efforts. We use JVDR to show that jina-embeddings-v4 greatly improves on state-of-the-art performance for these kinds of tasks.
pdf
bib
abs
gowithnlp at SemEval-2025 Task 10: Leveraging Entity-Centric Chain of Thought and Iterative Prompt Refinement for Multi-Label Classification
Bo Wang
|
Ruichen Song
|
Xiangyu Wang
|
Ge Shi
|
Linmei Hu
|
Heyan Huang
|
Chong Feng
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
This paper presents our system for Subtask 10 of Entity Framing, which focuses on assigning one or more hierarchical roles to named entities in news articles. Our approach iteratively refines prompts and utilizes the Entity-Centric Chain of Thought to complete the task. Specifically, to minimize ambiguity in label definitions, we use the model’s predictions as supervisory signals, iteratively refining the category definitions. Furthermore, to minimize the interference of irrelevant information during inference, we incorporate entity-related information into the CoT framework, allowing the model to focus more effectively on entity-centric reasoning. Our system achieved the highest ranking on the leaderboard in the Russian main role classification and the second in English, with an accuracy of 0.8645 and 0.9362, respectively. We discuss the impact of several components of our multilingual classification approach, highlighting their effectiveness.
pdf
bib
abs
Cognitive Mirroring for DocRE: A Self-Supervised Iterative Reflection Framework with Triplet-Centric Explicit and Implicit Feedback
Xu Han
|
Bo Wang
|
Yueheng Sun
|
Dongming Zhao
|
Zongfeng Qu
|
Ruifang He
|
Yuexian Hou
|
Qinghua Hu
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)
Large language models (LLMs) have advanced document-level relation extraction (DocRE), but DocRE is more complex than sentence-level relation extraction (SentRE), facing challenges like diverse relation types, coreference resolution and long-distance dependencies. Traditional pipeline methods, which detect relations before generating triplets, often propagate errors and harm performance. Meanwhile, fine-tuning methods require extensive human-annotated data, and in-context learning (ICL) underperforms compared to supervised approaches. We propose an iterative reflection framework for DocRE, inspired by human non-linear reading cognition. The framework leverages explicit and implicit relations between triplets to provide feedback for LLMs refinement. Explicit feedback uses logical rules-based reasoning, while implicit feedback reconstructs triplets into documents for comparison. This dual-process iteration mimics human semantic cognition, enabling dynamic optimization through self-generated supervision. For the first time, this achieves zero-shot performance comparable to fully supervised models. Experiments show our method surpasses existing LLM-based approaches and matches state-of-the-art BERT-based methods.
2024
pdf
bib
abs
WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction
Augustin Toma
|
Ronald Xie
|
Steven Palayew
|
Patrick Lawler
|
Bo Wang
Proceedings of the 6th Clinical Natural Language Processing Workshop
Medical errors in clinical text pose significant risks to patient safety. The MEDIQA-CORR 2024 shared task focuses on detecting and correcting these errors across three subtasks: identifying the presence of an error, extracting the erroneous sentence, and generating a corrected sentence. In this paper, we present our approach that achieved top performance in all three subtasks. For the MS dataset, which contains subtle errors, we developed a retrieval-based system leveraging external medical question-answering datasets. For the UW dataset, reflecting more realistic clinical notes, we created a pipeline of modules to detect, localize, and correct errors. Both approaches utilized the DSPy framework for optimizing prompts and few-shot examples in large language model (LLM) based programs. Our results demonstrate the effectiveness of LLM based programs for medical error correction. However, our approach has limitations in addressing the full diversity of potential errors in medical documentation. We discuss the implications of our work and highlight future research directions to advance the robustness and applicability of medical error detection and correction systems.
pdf
bib
abs
WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models
Ronald Xie
|
Steven Palayew
|
Augustin Toma
|
Gary Bader
|
Bo Wang
Proceedings of the 6th Clinical Natural Language Processing Workshop
This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification. These two solutions scored 1st and 2nd place respectively on the competition leaderboard, substantially outperforming the next best solution. Additionally, we discuss insights gained from post-competition experiments. While the performance of these two described solutions have significant room for improvement due to the difficulty of the shared task and the challenging nature of medical visual question answering in general, we identify the multi-stage LLM approach and the CLIP image classification approach as promising avenues for further investigation.
pdf
bib
abs
Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk
Zhiyuan Zeng
|
Qipeng Guo
|
Xiaoran Liu
|
Zhangyue Yin
|
Wentao Shu
|
Mianqiu Huang
|
Bo Wang
|
Yunhua Zhou
|
Linlin Li
|
Qun Liu
|
Xipeng Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The evolution of Large Language Models (LLMs) has led to significant advancements, with models like Claude and Gemini capable of processing contexts up to 1 million tokens. However, efficiently handling long sequences remains challenging, particularly during the prefilling stage when input lengths exceed GPU memory capacity. Traditional methods often segment sequence into chunks and compress them iteratively with fixed-size memory. However, our empirical analysis shows that the fixed-size memory results in wasted computational and GPU memory resources. Therefore, we introduces Incremental Memory (IM), a method that starts with a small memory size and gradually increases it, optimizing computational efficiency. Additionally, we propose Decremental Chunk based on Incremental Memory (IMDC), which reduces chunk size while increasing memory size, ensuring stable and lower GPU memory usage. Our experiments demonstrate that IMDC is consistently faster (1.45x) and reduces GPU memory consumption by 23.3% compared to fixed-size memory, achieving comparable performance on the LongBench Benchmark.
pdf
bib
abs
SparkRA: A Retrieval-Augmented Knowledge Service System Based on Spark Large Language Model
Dayong Wu
|
Jiaqi Li
|
Baoxin Wang
|
Honghong Zhao
|
Siyuan Xue
|
Yanjie Yang
|
Zhijun Chang
|
Rui Zhang
|
Li Qian
|
Bo Wang
|
Shijin Wang
|
Zhixiong Zhang
|
Guoping Hu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large language models (LLMs) have shown remarkable achievements across various language tasks. To enhance the performance of LLMs in scientific literature services, we developed the scientific literature LLM (SciLit-LLM) through pre-training and supervised fine-tuning on scientific literature, building upon the iFLYTEK Spark LLM. Furthermore, we present a knowledge service system Spark Research Assistant (SparkRA) based on our SciLit-LLM. SparkRA is accessible online and provides three primary functions: literature investigation, paper reading, and academic writing. As of July 30, 2024, SparkRA has garnered over 50,000 registered users, with a total usage count exceeding 1.3 million.
pdf
bib
abs
A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation
Yachao Zhao
|
Bo Wang
|
Yan Wang
|
Dongming Zhao
|
Xiaojia Jin
|
Jijun Zhang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
While extensive work has examined the explicit and implicit biases in large language models (LLMs), little research explores the relation between these two types of biases. This paper presents a comparative study of the explicit and implicit biases in LLMs grounded in social psychology. Social psychology distinguishes between explicit and implicit biases by whether the bias can be self-recognized by individuals. Aligning with this conceptualization, we propose a self-evaluation-based two-stage measurement of explicit and implicit biases within LLMs. First, the LLM is prompted to automatically fill templates with social targets to measure implicit bias toward these targets, where the bias is less likely to be self-recognized by the LLM. Then, the LLM is prompted to self-evaluate the templates filled by itself to measure explicit bias toward the same targets, where the bias is more likely to be self-recognized by the LLM. Experiments conducted on state-of-the-art LLMs reveal human-like inconsistency between explicit and implicit occupational gender biases. This work bridges a critical gap where prior studies concentrate solely on either explicit or implicit bias. We advocate that future work highlight the relation between explicit and implicit biases in LLMs.
pdf
bib
abs
Continuous Relational Diffusion Driven Topic Model with Multi-grained Text for Microblog
Chenhao Wu
|
Ruifang He
|
Chang Liu
|
Bo Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Topic model is a statistical model that leverages unsupervised learning to mine hidden topics in document collections. The data sparsity and colloquialism of social texts make it difficult to accurately mine the topics. Traditional methods assume that there are only 0/1-state relationships between the two parties in the social networks, but the relationship status in real life is more complicated, such as continuously changing relationships with different degrees of intimacy. This paper proposes a continuous relational diffusion driven topic model (CRTM) with multi-grained text for microblog to realize the continuous representation of the relationship state and make up for the context and structural information lost by previous representation methods. Multi-grained text representation learning distinguishes the impact of formal and informal expression on the topics further and alleviates colloquialism problems. Specifically, based on the original social network, the reconstructed social network with continuous relationship status is obtained by using information diffusion technology. The graph convolution model is utilized to learn node embeddings through the new social network. Finally, the neural variational inference is applied to generate topics according to continuous relationships. We validate CRTM on three real datasets, and the experimental results show the effectiveness of the scheme.
pdf
bib
abs
Emotion Recognition in Conversation via Dynamic Personality
Yan Wang
|
Bo Wang
|
Yachao Zhao
|
Dongming Zhao
|
Xiaojia Jin
|
Jijun Zhang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Emotion recognition in conversation (ERC) is a field that aims to classify the emotion of each utterance within conversational contexts. This presents significant challenges, particularly in handling emotional ambiguity across various speakers and contextual factors. Existing ERC approaches have primarily focused on modeling conversational contexts while incorporating only superficial speaker attributes such as names, memories, and interactions. Recent works introduce personality as an essential deep speaker factor for emotion recognition, but relies on static personality, overlooking dynamic variability during conversations. Advances in personality psychology conceptualize personality as dynamic, proposing that personality states can change across situations. In this paper, we introduce ERC-DP, a novel model considering the dynamic personality of speakers during conversations. ERC-DP accounts for past utterances from the same speaker as situation impacting dynamic personality. It combines personality modeling with prompt design and fine-grained classification modules. Through a series of comprehensive experiments, ERC-DP demonstrates superior performance on three benchmark conversational datasets.
pdf
bib
abs
Global and Local Hierarchical Prompt Tuning Framework for Multi-level Implicit Discourse Relation Recognition
Lei Zeng
|
Ruifang He
|
Haowen Sun
|
Jing Xu
|
Chang Liu
|
Bo Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Multi-level implicit discourse relation recognition (MIDRR) is a challenging task to recognize the hierarchical discourse relations between the arguments with the absence of connectives. Recent methods tend to incorporate the static hierarchical structure containing all senses (defined as global hierarchy) into prompt tuning through a path prompt template or hierarchical label refining. Howerver, hierarchical modeling is independent of the verbalizer, resulting in a failure to effectively utilize the output probability distribution information of verbalizer. Besides, they ignore the utilization of the dynamic hierarchical label sequence for each instance (defined as local hierarchy) in prompt tuning. In this paper, we propose a global and local hierarchical prompt tuning (GLHPT) framework, which utilize prior knowledge of PLMs while better incorporating hierarchical information from two aspects. We leverage bottom-up propagated probability as the global hierarchy to inject it into multi-level verbalizer (MLV). Furthermore, we design a local hierarchy-driven contrastive learning (LHCL) to improve the probability distribution of MLV. Finally, our model achieves competitive results on two benchmacks.
pdf
bib
abs
Representation Degeneration Problem in Prompt-based Models for Natural Language Understanding
Qingyan Zhao
|
Ruifang He
|
Jinpeng Zhang
|
Chang Liu
|
Bo Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Prompt-based fine-tuning (PF), by aligning with the training objective of pre-trained language models (PLMs), has shown improved performance on many few-shot natural language understanding (NLU) benchmarks. However, the word embedding space of PLMs exhibits anisotropy, which is called the representation degeneration problem. In this paper, we explore the self-similarity within the same context and identify the anisotropy of the feature embedding space in PF model. Given that the performance of PF models is dependent on feature embeddings, we inevitably pose the hypothesis: this anisotropy limits the performance of the PF models. Based on our experimental findings, we propose CLMA, a Contrastive Learning framework based on the [MASK] token and Answers, to alleviate the anisotropy in the embedding space. By combining our proposed counter-intuitive SSD, a Supervised Signal based on embedding Distance, our approach outperforms mainstream methods on the many NLU benchmarks in the few-shot experimental settings. In subsequent experiments, we analyze the capability of our method to capture deep semantic cues and the impact of the anisotropy in the feature embedding space on the performance of the PF model.
pdf
bib
abs
Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Rohan Jha
|
Bo Wang
|
Michael Günther
|
Georgios Mastrapas
|
Saba Sturua
|
Isabelle Mohr
|
Andreas Koukounas
|
Mohammad Kalim Wang
|
Nan Wang
|
Han Xiao
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model’s retrieval performance and cut storage requirements by up to 50%. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
2023
pdf
bib
abs
WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models
John Giorgi
|
Augustin Toma
|
Ronald Xie
|
Sondra Chen
|
Kevin An
|
Grace Zheng
|
Bo Wang
Proceedings of the 5th Clinical Natural Language Processing Workshop
This paper describes our submission to the MEDIQA-Chat 2023 shared task for automatic clinical note generation from doctor-patient conversations. We report results for two approaches: the first fine-tunes a pre-trained language model (PLM) on the shared task data, and the second uses few-shot in-context learning (ICL) with a large language model (LLM). Both achieve high performance as measured by automatic metrics (e.g. ROUGE, BERTScore) and ranked second and first, respectively, of all submissions to the shared task. Expert human scrutiny indicates that notes generated via the ICL-based approach with GPT-4 are preferred about as often as human-written notes, making it a promising path toward automated note generation from doctor-patient conversations.
pdf
bib
abs
Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval
John Giorgi
|
Luca Soldaini
|
Bo Wang
|
Gary Bader
|
Kyle Lo
|
Lucy Wang
|
Arman Cohan
Findings of the Association for Computational Linguistics: EMNLP 2023
Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input. In practice, this document set is not always available; it would need to be retrieved given an information need, i.e. a question or topic statement, a setting we dub “open-domain’ MDS. We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retrievers and summarizers. Via extensive automatic and human evaluation, we determine: (1) state-of-the-art summarizers suffer large reductions in performance when applied to open-domain MDS, (2) additional training in the open-domain setting can reduce this sensitivity to imperfect retrieval, and (3) summarizers are insensitive to the retrieval of duplicate documents and the order of retrieved documents, but highly sensitive to other errors, like the retrieval of irrelevant documents. Based on our results, we provide practical guidelines to enable future work on open-domain MDS, e.g. how to choose the number of retrieved documents to summarize. Our results suggest that new retrieval and summarization methods and annotated resources for training and evaluation are necessary for further progress in the open-domain setting.
pdf
bib
abs
Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
Michael Günther
|
Louis Milliken
|
Jonathan Geuter
|
Georgios Mastrapas
|
Bo Wang
|
Han Xiao
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets.It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model’s awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
2022
pdf
bib
abs
Dataset Debt in Biomedical Language Modeling
Jason Fries
|
Natasha Seelam
|
Gabriel Altay
|
Leon Weber
|
Myungsun Kang
|
Debajyoti Datta
|
Ruisi Su
|
Samuele Garda
|
Bo Wang
|
Simon Ott
|
Matthias Samwald
|
Wojciech Kusa
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
Large-scale language modeling and natural language prompting have demonstrated exciting capabilities for few and zero shot learning in NLP. However, translating these successes to specialized domains such as biomedicine remains challenging, due in part to biomedical NLP’s significant dataset debt – the technical costs associated with data that are not consistently documented or easily incorporated into popular machine learning frameworks at scale. To assess this debt, we crowdsourced curation of datasheets for 167 biomedical datasets. We find that only 13% of datasets are available via programmatic access and 30% lack any documentation on licensing and permitted reuse. Our dataset catalog is available at:
https://tinyurl.com/bigbio22.
pdf
bib
abs
A sequence-to-sequence approach for document-level relation extraction
John Giorgi
|
Gary Bader
|
Bo Wang
Proceedings of the 21st Workshop on Biomedical Language Processing
Motivated by the fact that many relations cross the sentence boundary, there has been increasing interest in document-level relation extraction (DocRE). DocRE requires integrating information within and across sentences, capturing complex interactions between mentions of entities. Most existing methods are pipeline-based, requiring entities as input. However, jointly learning to extract entities and relations can improve performance and be more efficient due to shared parameters and training steps. In this paper, we develop a sequence-to-sequence approach, seq2rel, that can learn the subtasks of DocRE (entity extraction, coreference resolution and relation extraction) end-to-end, replacing a pipeline of task-specific components. Using a simple strategy we call entity hinting, we compare our approach to existing pipeline-based methods on several popular biomedical datasets, in some cases exceeding their performance. We also report the first end-to-end results on these datasets for future comparison. Finally, we demonstrate that, under our model, an end-to-end approach outperforms a pipeline-based approach. Our code, data and trained models are available at
https://github.com/johngiorgi/seq2rel. An online demo is available at
https://share.streamlit.io/johngiorgi/seq2rel/main/demo.py.
pdf
bib
abs
CR-GIS: Improving Conversational Recommendation via Goal-aware Interest Sequence Modeling
Jinfeng Zhou
|
Bo Wang
|
Zhitong Yang
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 29th International Conference on Computational Linguistics
Conversational recommendation systems (CRS) aim to determine a goal item by sequentially tracking users’ interests through multi-turn conversation. In CRS, implicit patterns of user interest sequence guide the smooth transition of dialog utterances to the goal item. However, with the convenient explicit knowledge of knowledge graph (KG), existing KG-based CRS methods over-rely on the explicit separate KG links to model the user interests but ignore the rich goal-aware implicit interest sequence patterns in a dialog. In addition, interest sequence is also not fully used to generate smooth transited utterances. We propose CR-GIS with a parallel star framework. First, an interest-level star graph is designed to model the goal-aware implicit user interest sequence. Second, a hierarchical Star Transformer is designed to guide the multi-turn utterances generation with the interest-level star graph. Extensive experiments verify the effectiveness of CR-GIS in achieving more accurate recommended items with more fluent and coherent dialog utterances.
pdf
bib
abs
TopKG: Target-oriented Dialog via Global Planning on Knowledge Graph
Zhitong Yang
|
Bo Wang
|
Jinfeng Zhou
|
Yue Tan
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 29th International Conference on Computational Linguistics
Target-oriented dialog aims to reach a global target through multi-turn conversation. The key to the task is the global planning towards the target, which flexibly guides the dialog concerning the context. However, existing target-oriented dialog works take a local and greedy strategy for response generation, where global planning is absent. In this work, we propose global planning for target-oriented dialog on a commonsense knowledge graph (KG). We design a global reinforcement learning with the planned paths to flexibly adjust the local response generation model towards the global target. We also propose a KG-based method to collect target-oriented samples automatically from the chit-chat corpus for model training. Experiments show that our method can reach the target with a higher success rate, fewer turns, and more coherent responses.
pdf
bib
abs
Multi-Attribute Controlled Text Generation with Contrastive-Generator and External-Discriminator
Guisheng Liu
|
Yi Li
|
Yanqing Guo
|
Xiangyang Luo
|
Bo Wang
Proceedings of the 29th International Conference on Computational Linguistics
Though existing researches have achieved impressive results in controlled text generation, they focus mainly on single-attribute control. However, in applications like automatic comments, the topic and sentiment need to be controlled simultaneously. In this work, we propose a new framework for multi-attribute controlled text generation. To achieve this, we design a contrastive-generator that can effectively generate texts with more attributes. In order to increase the convergence of the text on the desired attributes, we adopt an external-discriminator to distinguish whether the generated text holds the desired attributes. Moreover, we propose top-n weighted decoding to further improve the relevance of texts to attributes. Automated evaluations and human evaluations show that our framework achieves remarkable controllability in multi-attribute generation while keeping the text fluent and diverse. It also yields promising performance on zero-shot generation.
pdf
bib
abs
Aligning Recommendation and Conversation via Dual Imitation
Jinfeng Zhou
|
Bo Wang
|
Minlie Huang
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Human conversations of recommendation naturally involve the shift of interests which can align the recommendation actions and conversation process to make accurate recommendations with rich explanations. However, existing conversational recommendation systems (CRS) ignore the advantage of user interest shift in connecting recommendation and conversation, which leads to an ineffective loose coupling structure of CRS. To address this issue, by modeling the recommendation actions as recommendation paths in a knowledge graph (KG), we propose DICR (Dual Imitation for Conversational Recommendation), which designs a dual imitation to explicitly align the recommendation paths and user interest shift paths in a recommendation module and a conversation module, respectively. By exchanging alignment signals, DICR achieves bidirectional promotion between recommendation and conversation modules and generates high-quality responses with accurate recommendations and coherent explanations. Experiments demonstrate that DICR outperforms the state-of-the-art models on recommendation and conversation performance with automatic, human, and novel explainability metrics.
pdf
bib
abs
CodeExp: Explanatory Code Document Generation
Haotian Cui
|
Chenglong Wang
|
Junjie Huang
|
Jeevana Priya Inala
|
Todd Mytkowicz
|
Bo Wang
|
Jianfeng Gao
|
Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2022
Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger-scale unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.
pdf
bib
abs
Template-based Abstractive Microblog Opinion Summarization
Iman Munire Bilal
|
Bo Wang
|
Adam Tsakalidis
|
Dong Nguyen
|
Rob Procter
|
Maria Liakata
Transactions of the Association for Computational Linguistics, Volume 10
We introduce the task of microblog opinion summarization (MOS) and share a dataset of 3100 gold-standard opinion summaries to facilitate research in this domain. The dataset contains summaries of tweets spanning a 2-year period and covers more topics than any other public Twitter summarization dataset. Summaries are abstractive in nature and have been created by journalists skilled in summarizing news articles following a template separating factual information (main story) from author opinions. Our method differs from previous work on generating gold-standard summaries from social media, which usually involves selecting representative posts and thus favors extractive summarization models. To showcase the dataset’s utility and challenges, we benchmark a range of abstractive and extractive state-of-the-art summarization models and achieve good performance, with the former outperforming the latter. We also show that fine-tuning is necessary to improve performance and investigate the benefits of using different sample sizes.
2021
pdf
bib
abs
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
John Giorgi
|
Osvald Nitski
|
Bo Wang
|
Gary Bader
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Sentence embeddings are an important component of many natural language processing (NLP) systems. Like word embeddings, sentence embeddings are typically learned on large text corpora and then transferred to various downstream tasks, such as clustering and retrieval. Unlike word embeddings, the highest performing solutions for learning sentence embeddings require labelled data, limiting their usefulness to languages and domains where labelled data is abundant. In this paper, we present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Inspired by recent advances in deep metric learning (DML), we carefully design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data. When used to extend the pretraining of transformer-based language models, our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders. Importantly, our experiments suggest that the quality of the learned embeddings scale with both the number of trainable parameters and the amount of unlabelled training data. Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.
pdf
bib
abs
Evaluation of Thematic Coherence in Microblogs
Iman Munire Bilal
|
Bo Wang
|
Maria Liakata
|
Rob Procter
|
Adam Tsakalidis
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Collecting together microblogs representing opinions about the same topics within the same timeframe is useful to a number of different tasks and practitioners. A major question is how to evaluate the quality of such thematic clusters. Here we create a corpus of microblog clusters from three different domains and time windows and define the task of evaluating thematic coherence. We provide annotation guidelines and human annotations of thematic coherence by journalist experts. We subsequently investigate the efficacy of different automated evaluation metrics for the task. We consider a range of metrics including surface level metrics, ones for topic model coherence and text generation metrics (TGMs). While surface level metrics perform well, outperforming topic coherence metrics, they are not as consistent as TGMs. TGMs are more reliable than all other metrics considered for capturing thematic coherence in microblog clusters due to being less sensitive to the effect of time windows.
pdf
bib
abs
CRFR: Improving Conversational Recommender Systems via Flexible Fragments Reasoning on Knowledge Graphs
Jinfeng Zhou
|
Bo Wang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Although paths of user interests shift in knowledge graphs (KGs) can benefit conversational recommender systems (CRS), explicit reasoning on KGs has not been well considered in CRS, due to the complex of high-order and incomplete paths. We propose CRFR, which effectively does explicit multi-hop reasoning on KGs with a conversational context-based reinforcement learning model. Considering the incompleteness of KGs, instead of learning single complete reasoning path, CRFR flexibly learns multiple reasoning fragments which are likely contained in the complete paths of interests shift. A fragments-aware unified model is then designed to fuse the fragments information from item-oriented and concept-oriented KGs to enhance the CRS response with entities and words from the fragments. Extensive experiments demonstrate CRFR’s SOTA performance on recommendation, conversation and conversation interpretability.
pdf
bib
abs
Eliminating Sentiment Bias for Aspect-Level Sentiment Classification with Unsupervised Opinion Extraction
Bo Wang
|
Tao Shen
|
Guodong Long
|
Tianyi Zhou
|
Yi Chang
Findings of the Association for Computational Linguistics: EMNLP 2021
Aspect-level sentiment classification (ALSC) aims at identifying the sentiment polarity of a specified aspect in a sentence. ALSC is a practical setting in aspect-based sentiment analysis due to no opinion term labeling needed, but it fails to interpret why a sentiment polarity is derived for the aspect. To address this problem, recent works fine-tune pre-trained Transformer encoders for ALSC to extract an aspect-centric dependency tree that can locate the opinion words. However, the induced opinion words only provide an intuitive cue far below human-level interpretability. Besides, the pre-trained encoder tends to internalize an aspect’s intrinsic sentiment, causing sentiment bias and thus affecting model performance. In this paper, we propose a span-based anti-bias aspect representation learning framework. It first eliminates the sentiment bias in the aspect embedding by adversarial learning against aspects’ prior sentiment. Then, it aligns the distilled opinion candidates with the aspect by span-based dependency modeling to highlight the interpretable opinion terms. Our method achieves new state-of-the-art performance on five benchmarks, with the capability of unsupervised opinion extraction.
2020
pdf
bib
abs
Information Extraction from Swedish Medical Prescriptions with Sig-Transformer Encoder
John Pougué Biyong
|
Bo Wang
|
Terry Lyons
|
Alejo Nevado-Holgado
Proceedings of the 3rd Clinical Natural Language Processing Workshop
Relying on large pretrained language models such as Bidirectional Encoder Representations from Transformers (BERT) for encoding and adding a simple prediction layer has led to impressive performance in many clinical natural language processing (NLP) tasks. In this work, we present a novel extension to the Transformer architecture, by incorporating signature transform with the self-attention model. This architecture is added between embedding and prediction layers. Experiments on a new Swedish prescription data show the proposed architecture to be superior in two of the three information extraction tasks, comparing to baseline models. Finally, we evaluate two different embedding approaches between applying Multilingual BERT and translating the Swedish text to English then encode with a BERT model pretrained on clinical notes.
2019
pdf
bib
abs
DeepGeneMD: A Joint Deep Learning Model for Extracting Gene Mutation-Disease Knowledge from PubMed Literature
Feifan Liu
|
Xiaoyu Zheng
|
Bo Wang
|
Catarina Kiefe
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks
Understanding the pathogenesis of genetic diseases through different gene activities and their relations to relevant diseases is important for new drug discovery and drug repositioning. In this paper, we present a joint deep learning model in a multi-task learning paradigm for gene mutation-disease knowledge extraction, DeepGeneMD, which adapts the state-of-the-art hierarchical multi-task learning framework for joint inference on named entity recognition (NER) and relation extraction (RE) in the context of the AGAC (Active Gene Annotation Corpus) track at 2019 BioNLP Open Shared Tasks (BioNLP-OST). It simultaneously extracts gene mutation related activities, diseases, and their relations from the published scientific literature. In DeepGeneMD, we explore the task decomposition to create auxiliary subtasks so that more interactions between different learning subtasks can be leveraged in model training. Our model achieves the average F1 score of 0.45 on recognizing gene activities and disease entities, ranking 2nd in the AGAC NER task; and the average F1 score of 0.35 on extracting relations, ranking 1st in the AGAC RE task.
2018
pdf
bib
abs
OpenNMT System Description for WNMT 2018: 800 words/sec on a single-core CPU
Jean Senellart
|
Dakun Zhang
|
Bo Wang
|
Guillaume Klein
|
Jean-Pierre Ramatchandirin
|
Josep Crego
|
Alexander Rush
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation
We present a system description of the OpenNMT Neural Machine Translation entry for the WNMT 2018 evaluation. In this work, we developed a heavily optimized NMT inference model targeting a high-performance CPU system. The final system uses a combination of four techniques, all of them lead to significant speed-ups in combination: (a) sequence distillation, (b) architecture modifications, (c) precomputation, particularly of vocabulary, and (d) CPU targeted quantization. This work achieves the fastest performance of the shared task, and led to the development of new features that have been integrated to OpenNMT and available to the community.
2017
pdf
bib
abs
TDParse: Multi-target-specific sentiment recognition on Twitter
Bo Wang
|
Maria Liakata
|
Arkaitz Zubiaga
|
Rob Procter
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
Existing target-specific sentiment recognition methods consider only a single target per tweet, and have been shown to miss nearly half of the actual targets mentioned. We present a corpus of UK election tweets, with an average of 3.09 entities per tweet and more than one type of sentiment in half of the tweets. This requires a method for multi-target specific sentiment recognition, which we develop by using the context around a target as well as syntactic dependencies involving the target. We present results of our method on both a benchmark corpus of single targets and the multi-target election corpus, showing state-of-the art performance in both corpora and outperforming previous approaches to multi-target sentiment task as well as deep learning models for single-target sentiment.
pdf
bib
abs
TOTEMSS: Topic-based, Temporal Sentiment Summarisation for Twitter
Bo Wang
|
Maria Liakata
|
Adam Tsakalidis
|
Spiros Georgakopoulos Kolaitis
|
Symeon Papadopoulos
|
Lazaros Apostolidis
|
Arkaitz Zubiaga
|
Rob Procter
|
Yiannis Kompatsiaris
Proceedings of the IJCNLP 2017, System Demonstrations
We present a system for time sensitive, topic based summarisation of the sentiment around target entities and topics in collections of tweets. We describe the main elements of the system and illustrate its functionality with two examples of sentiment analysis of topics related to the 2017 UK general election.
pdf
bib
SYSTRAN Purely Neural MT Engines for WMT2017
Yongchao Deng
|
Jungi Kim
|
Guillaume Klein
|
Catherine Kobus
|
Natalia Segal
|
Christophe Servan
|
Bo Wang
|
Dakun Zhang
|
Josep Crego
|
Jean Senellart
Proceedings of the Second Conference on Machine Translation
2015
pdf
bib
WarwickDCS: From Phrase-Based to Target-Specific Sentiment Recognition
Richard Townsend
|
Adam Tsakalidis
|
Yiwei Zhou
|
Bo Wang
|
Maria Liakata
|
Arkaitz Zubiaga
|
Alexandra Cristea
|
Rob Procter
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2010
pdf
bib
All in Strings: a Powerful String-based Automatic MT Evaluation Metric with Multiple Granularities
Junguo Zhu
|
Muyun Yang
|
Bo Wang
|
Sheng Li
|
Tiejun Zhao
Coling 2010: Posters
2009
pdf
bib
A Statistical Machine Translation Model Based on a Synthetic Synchronous Grammar
Hongfei Jiang
|
Muyun Yang
|
Tiejun Zhao
|
Sheng Li
|
Bo Wang
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
pdf
bib
References Extension for the Automatic Evaluation of MT by Syntactic Hybridization
Bo Wang
|
Tiejun Zhao
|
Muyun Yang
|
Sheng Li
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009
2008
pdf
bib
Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points
Ming Zhou
|
Bo Wang
|
Shujie Liu
|
Mu Li
|
Dongdong Zhang
|
Tiejun Zhao
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
pdf
bib
Bootstrapping Both Product Features and Opinion Words from Chinese Customer Reviews with Cross-Inducing
Bo Wang
|
Houfeng Wang
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I