Jing Li

Other people with similar names: Jing Li , Jing Li

2025

Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model’s (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs’ mathematical reasoning through error generalization.

Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.

pdf bib abs
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
Jingqian Zhao | Bingbing Wang | Geng Tu | Yice Zhang | Qianlong Wang | Bin Liang | Jing Li | Ruifeng Xu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training.Current studies mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose CoreEval, a Contamination-resilient Evaluation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant and up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism in a Chain-of-Thought manner to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.

pdf bib abs
DrFrattn: Directly Learn Adaptive Policy from Attention for Simultaneous Machine Translation
Libo Zhao | Jing Li | Ziqian Zeng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Simultaneous machine translation (SiMT) necessitates a robust read/write (R/W) policy to determine the optimal moments for translation, thereby balancing translation quality and latency. Effective timing in translation can align source and target tokens accurately. The attention mechanism within translation models inherently provides valuable alignment information. Building on this, previous research has attempted to modify the attention mechanism’s structure to leverage its alignment properties during training, employing multi-task learning to derive the read/write policy. However, this multi-task learning approach may compromise the efficacy of the attention mechanism itself. This raises a natural question: why not directly learn the read/write policy from the well-trained attention mechanism? In this study, we propose DrFrattn, a method that directly learns adaptive policies from the attention mechanism. Experimental results across various benchmarks demonstrate that our approach achieves an improved balance between translation accuracy and latency.

pdf bib abs
VIVA+: Human-Centered Situational Decision-Making
Zhe Hu | Yixiao Ren | Guanzhong Liu | Jing Li | Yu Yin
Findings of the Association for Computational Linguistics: EMNLP 2025

Multimodal Large Language Models (MLLMs) show promising results for embodied agents in operating meaningfully in complex, human-centered environments. Yet, evaluating their capacity for nuanced, human-like reasoning and decision-making remains challenging. In this work, we introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations. VIVA+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning. Together, these dimensions provide a systematic framework for assessing a model’s ability to perceive, reason, and act in socially meaningful ways. We evaluate the latest commercial and open-source models on VIVA+, where we reveal distinct performance patterns and highlight significant challenges. We further explore targeted training and multi-step reasoning strategies, which yield consistent performance improvements. Finally, our in-depth analysis highlights current model limitations and provides actionable insights for advancing MLLMs toward more robust, context-aware, and socially adept decision-making in real-world settings.

pdf bib abs
Large Language Models as Reader for Bias Detection
Xuan Luo | Jing Li | Zhong Wenzhong | Geng Tu | Ruifeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2025

Detecting bias in media content is crucial for maintaining information integrity and promoting inclusivity. Traditional methods analyze text from the writer’s perspective, which analyzes textual features directly from the writer’s intent, leaving the reader’s perspective underexplored. This paper investigates whether Large Language Models (LLMs) can be leveraged as readers for bias detection by generating reader-perspective comments. Experiments are conducted on the BASIL (news bias) and BeyondGender (gender bias) datasets with LLMs Gemma-7B, Phi-3-3.8B, Llama3.1-8B, Llama3.1-70B, and GPT4. The results demonstrate the effectiveness of reader-perspective comments for open-source LLMs, achieving performance comparable to GPT4’s. The findings highlight the significance of emotion-related comments, which are generally more beneficial than value-related ones in bias detection. In addition, experiments on Llamas show that comment selection ensures consistent performance regardless of model sizes and comment combinations. This study is particularly beneficial for small-size open-source LLMs.

2024

pdf bib abs
PsFuture: A Pseudo-Future-based Zero-Shot Adaptive Policy for Simultaneous Machine Translation
Libo Zhao | Jing Li | Ziqian Zeng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Simultaneous Machine Translation (SiMT) requires target tokens to be generated in real-time as streaming source tokens are consumed. Traditional approaches to SiMT typically require sophisticated architectures and extensive parameter configurations for training adaptive read/write policies, which in turn demand considerable computational power and memory. We propose PsFuture, the first zero-shot adaptive read/write policy for SiMT, enabling the translation model to independently determine read/write actions without the necessity for additional training. Furthermore, we introduce a novel training strategy, Prefix-to-Full (P2F), specifically tailored to adjust offline translation models for SiMT applications, exploiting the advantages of the bidirectional attention mechanism inherent in offline models. Experiments across multiple benchmarks demonstrate that our zero-shot policy attains performance on par with strong baselines and the P2F method can further enhance performance, achieving an outstanding trade-off between translation quality and latency.

pdf bib abs
VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values
Zhe Hu | Yixiao Ren | Jing Li | Yu Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.

pdf bib abs
LLM4Decompile: Decompiling Binary Code with Large Language Models
Hanzhuo Tan | Qi Luo | Jing Li | Yuqun Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results.

As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research issue. Previous red teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.

pdf bib abs
IndiVec: An Exploration of Leveraging Large Language Models for Media Bias Detection with Fine-Grained Bias Indicators
Luyang Lin | Lingzhi Wang | Xiaoyan Zhao | Jing Li | Kam-Fai Wong
Findings of the Association for Computational Linguistics: EACL 2024

This study focuses on media bias detection, crucial in today’s era of influential social media platforms shaping individual attitudes and opinions. In contrast to prior work that primarily relies on training specific models tailored to particular datasets, resulting in limited adaptability and subpar performance on out-of-domain data, we introduce a general bias detection framework, IndiVec, built upon large language models. IndiVec begins by constructing a fine-grained media bias database, leveraging the robust instruction-following capabilities of large language models and vector database techniques. When confronted with new input for bias detection, our framework automatically selects the most relevant indicator from the vector database and employs majority voting to determine the input’s bias label. IndiVec excels compared to previous methods due to its adaptability (demonstrating consistent performance across diverse datasets from various sources) and explainability (providing explicit top-k indicators to interpret bias predictions). Experimental results on four political bias datasets highlight IndiVec’s significant superiority over baselines. Furthermore, additional experiments and analysis provide profound insights into the framework’s effectiveness.

pdf bib
Muffin: Mitigating Unhelpfulness in Emotional Support Conversations with Multifaceted AI Feedback
Jiashuo Wang | Chunpu Xu | Chak Tou Leong | Wenjie Li | Jing Li
Findings of the Association for Computational Linguistics: ACL 2024

pdf bib abs
RePALM: Popular Quote Tweet Generation via Auto-Response Augmentation
Erxin Yu | Jing Li | Chunpu Xu
Findings of the Association for Computational Linguistics: ACL 2024

A quote tweet enables users to share others’ content while adding their own commentary. In order to enhance public engagement through quote tweets, we investigate the task of generating popular quote tweets. This task aims to produce quote tweets that garner higher popularity, as indicated by increased likes, replies, and retweets. Despite the impressive language generation capabilities of large language models (LLMs), there has been limited research on how LLMs can effectively learn the popularity of text to better engage the public. Therefore, we introduce a novel approach called Response-augmented Popularity-Aligned Language Model (RePALM), which aligns language generation with popularity by leveraging insights from augmented auto-responses provided by readers. We utilize the Proximal Policy Optimization framework with a dual-reward mechanism to jointly optimize for the popularity of the quote tweet and its consistency with the auto-responses. In our experiments, we collected two datasets consisting of quote tweets containing external links and those referencing others’ tweets. Extensive results demonstrate the superiority of RePALM over advanced language models that do not incorporate response augmentation.

Social media bot detection is increasingly crucial with the rise of social media platforms. Existing methods predominantly construct social networks as graph and utilize graph neural networks (GNNs) for bot detection. However, most of these methods focus on how to improve the performance of GNNs while neglecting the community structure within social networks. Moreover, GNNs based methods still face problems such as poor model generalization due to the relatively small scale of the dataset and over-smoothness caused by information propagation mechanism. To address these problems, we propose the Community-Aware Heterogeneous Graph Contrastive Learning framework (i.e., CACL), which constructs social network as heterogeneous graph with multiple node types and edge types, and then utilizes community-aware module to mine both hard positive samples and hard negative samples for supervised graph contrastive learning with adaptive graph enhancement algorithms. Extensive experiments demonstrate that our framework addresses the previously mentioned challenges and outperforms competitive baselines on three social media bot benchmarks.

pdf bib abs
Generative Deduplication For Socia Media Data Selection
Xianming Li | Jing Li
Findings of the Association for Computational Linguistics: EMNLP 2024

Social media data exhibits severe redundancy caused by its noisy nature. It leads to increased training time and model bias in its processing. To address this issue, we propose a novel Generative Deduplication framework for social media data selection by removing semantically duplicate data. While related work involves data selection in the task-specific training, our model functions as an efficient pre-processing method to universally enhance social media NLP pipelines. Specifically, we train a generative model via self-supervised learning to predict keyword to capture the semantics of noisy social media text for deduplication. Meanwhile, time-dimensional Gaussian noise is added to improve training complexity and avoid learning trivial features. Extensive experiments suggest that our model can better reduce training samples while improving performance than baselines. The results show our model’s potential to broadly advance social media language understanding in effectiveness and efficiency.

pdf bib abs
Who Responded to Whom: The Joint Effects of Latent Topics and Discourse in Conversation Structure
Lu Ji | Lei Chen | Jing Li | Zhongyu Wei | Qi Zhang | Xuanjing Huang
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)

Vast amount of online conversations are produced on a daily basis, resulting in a pressing need to automatic conversation understanding. As a basis to structure a discussion, we identify the responding relations in the conversation discourse, which link response utterances to their initiations. To figure out who responded to whom, here we explore how the consistency of topic contents and dependency of discourse roles indicate such interactions, whereas most prior work ignore the effects of latent factors underlying word occurrences. We propose a neural model to learn latent topics and discourse in word distributions, and predict pairwise initiation-response links via exploiting topic consistency and discourse dependency. Experimental results on both English and Chinese conversations show that our model significantly outperforms the previous state of the arts.

pdf bib abs
Cantonese Natural Language Processing in the Transformers Era
Rong Xiang | Ming Liao | Jing Li
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)

Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures.In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models.We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality.

2022

pdf bib abs
Doctor Recommendation in Online Health Forums via Expertise Learning
Xiaoxin Lu | Yubo Zhang | Jing Li | Shi Zong
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Huge volumes of patient queries are daily generated on online health forums, rendering manual doctor allocation a labor-intensive task. To better help patients, this paper studies a novel task of doctor recommendation to enable automatic pairing of a patient to a doctor with relevant expertise. While most prior work in recommendation focuses on modeling target users from their past behavior, we can only rely on the limited words in a query to infer a patient’s needs for privacy reasons. For doctor modeling, we study the joint effects of their profiles and previous dialogues with other patients and explore their interactions via self-learning. The learned doctor embeddings are further employed to estimate their capabilities of handling a patient query with a multi-head attention mechanism. For experiments, a large-scale dataset is collected from Chunyu Yisheng, a Chinese online health forum, where our model exhibits the state-of-the-art results, outperforming baselines only consider profiles and past dialogues to characterize a doctor.

Complaining is a speech act that expresses a negative inconsistency between reality and human’s expectations. While prior studies mostly focus on identifying the existence or the type of complaints, in this work, we present the first study in computational linguistics of measuring the intensity of complaints from text. Analyzing complaints from such perspective is particularly useful, as complaints of certain degrees may cause severe consequences for companies or organizations. We first collect 3,103 posts about complaints in education domain from Weibo, a popular Chinese social media platform. These posts are then annotated with complaints intensity scores using Best-Worst Scaling (BWS) method. We show that complaints intensity can be accurately estimated by computational models with best mean square error achieving 0.11. Furthermore, we conduct a comprehensive linguistic analysis around complaints, including the connections between complaints and sentiment, and a cross-lingual comparison for complaints expressions used by Chinese and English speakers. We finally show that our complaints intensity scores can be incorporated for better estimating the popularity of posts on social media.

Co-authors

Venues

Fix author