Tong Wu

Papers on this page may belong to the following people: Tong Wu

2026

NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating
Tong Wu | Thanet Markchom | Huizhi(elly) Liang
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1–5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task.

pdf bib abs

NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression
Tong Wu | Nicolay Rusnachenko | Huizhi(elly) Liang
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence–arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, using dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language–domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.

pdf bib abs

Delayed Wh-Question Development in Children with Hearing Loss: Evidence for Morphosyntactic Vulnerability from Corpus-Based NLP and LLM Analyses
Tong Wu
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

This study provides corpus-based evidence that English-speaking children with hearing loss (CHL) show both quantitative and qualitative delays in wh-question development compared to typically developing (TD) peers. Using Natural Language Processing (NLP)/Large Language Model (LLM) based methods and two clinical subcorpora from CHILDES, we analyzed child utterances across several syntactic dimensions: frequency, lexical diversity, structural completeness, clausal embedding, wh-fronting, and utterance length. CHL produced significantly fewer wh-questions, used a narrower range of wh-types, showed lower rates of embedding, and more structural incompleteness. These differences were most evident in syntactically complex forms, such as embedded and canonical fronted wh-questions. The results support input-sensitive and usage-based accounts of syntactic development and highlight the need for enriched linguistic input in supporting CHL’s grammatical growth. Importantly, these group differences persisted when controlling for overalllanguage development as indexed by mean length of utterance (MLU) in words, indicatingthat CHL’s difficulties with wh-questions are not reducible to generalgrammatical delay.Methodologically, the study combines dependency-parsing-based analyses with exploratory LLM evaluation to assess the feasibility and limits of automated approaches to spontaneous child language. NLP-based analyses were more stable for formally defined syntactic features, while GPT-based analysis showed mixed performance, performing better on global structural judgments than on fine-grained syntactic diagnostics.

pdf bib abs

Targeting the Needle, Ignoring the Haystack: Anchoring Crucial Cues for Evolving Scam Call Detection via an LLM-Assisted Classifier
Tong Wu | Qinliang Su | Jianxing Yu | Bo Liang | Minhua Huang
Findings of the Association for Computational Linguistics: ACL 2026

Automatic detection of fraudulent voice calls is essential for online service platforms but faces significant challenges due to the scarcity of labeled data and the continuous evolution of conversational contexts. Standard supervised methods often fail to generalize, as they tend to overfit to variable background narratives rather than capturing the core deceptive intent. In this paper, we propose a lightweight framework that anchors detection on Semantic Primitives, a set of stable, interpretable evidentiary cues derived from expert knowledge. Our approach decomposes the fraud detection task into two distinct stages: identifying the presence of these predefined semantic signals within the transcript, and deriving a final verdict through a logical combination of the detected cues. By explicitly prioritizing stable evidence over diverse conversational noise, this framework ensures that decisions are based on verifiable fraud tactics rather than spurious correlations. Experimental results demonstrate that our method achieves superior robustness and efficiency compared to traditional baselines, particularly in scenarios with shifting service contexts.

2025

pdf bib abs

Ensuring the safety of large language models (LLMs) is paramount, yet identifying potential vulnerabilities is challenging. While manual red teaming is effective, it is time-consuming, costly and lacks scalability. Automated red teaming (ART) offers a more cost-effective alternative, automatically generating adversarial prompts to expose LLM vulnerabilities. However, in current ART efforts, a robust framework is absent, which explicitly frames red teaming as an effectively learnable task. To address this gap, we propose Automated Progressive Red Teaming (APRT) as an effectively learnable framework. APRT leverages three core modules: an Intention Expanding LLM that generates diverse initial attack samples, an Intention Hiding LLM that crafts deceptive prompts, and an Evil Maker to manage prompt diversity and filter ineffective samples. The three modules collectively and progressively explore and exploit LLM vulnerabilities through multi-round interactions. In addition to the framework, we further propose a novel indicator, Attack Effectiveness Rate (AER) to mitigate the limitations of existing evaluation metrics. By measuring the likelihood of eliciting unsafe but seemingly helpful responses, AER aligns closely with human evaluations. Extensive experiments with both automatic and human evaluations, demonstrate the effectiveness of APRT across both open- and closed-source LLMs. Specifically, APRT effectively elicits 54% unsafe yet useful responses from Meta’s Llama-3-8B-Instruct, 50% from GPT-4o (API access), and 39% from Claude-3.5 (API access), showcasing its robust attack capability and transferability across LLMs (especially from open-source LLMs to closed-source LLMs). The code and seed data are available at https://github.com/tjunlp-lab/APRT.

pdf bib abs

Predicting and Evaluating Item Responses Using Machine Learning, Text Embeddings, and LLMs
Evelyn Johnson | Hsin-Ro Wei | Tong Wu | Huan Liu
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Works in Progress

This work-in-progress study compares the accuracy of machine learning and large language models to predict student responses to field-test items on a social-emotional learning assessment. We evaluate how well each method replicates actual responses and examine the item parameters generated by synthetic data to those derived from actual student data.

pdf bib abs

The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents
Feiran Jia | Tong Wu | Xin Qin | Anna Squicciarini
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real-world tasks through tool integration. This enhanced ability to interact with external systems and process various data sources, while powerful, introduces significant security vulnerabilities. In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions. While existing defenses show promise, they struggle to maintain robust security while preserving task functionality. We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives. Based on this insight, we develop Task Shield, a test-time defense mechanism that systematically verifies whether each instruction and tool call contributes to user-specified goals. Through experiments on the AgentDojo benchmark, we demonstrate that Task Shield reduces attack success rates (2.07%) while maintaining high task utility (69.79%) on GPT-4o, significantly outperforming existing defenses in various real-world scenarios.

2024

pdf bib

Decomposing Directional Serial Verb Constructions in Mandarin:A Preliminary Study
Tong Wu
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse Kullback-Leibler (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when few distribution overlap exists between the teacher and the student. In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse NLP tasks. We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions. Besides, profit by properties of the Sinkhorn metric, we can get rid of sample-wise KD that restricts the perception of divergence in each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture geometric intricacies of distributions across samples in the high-dimensional space. Comprehensive evaluation on GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures.

2023

pdf bib abs

Intersectional Stereotypes in Large Language Models: Dataset and Analysis
Weicheng Ma | Brian Chiang | Tong Wu | Lili Wang | Soroush Vosoughi
Findings of the Association for Computational Linguistics: EMNLP 2023

Despite many stereotypes targeting intersectional demographic groups, prior studies on stereotypes within Large Language Models (LLMs) primarily focus on broader, individual categories. This research bridges this gap by introducing a novel dataset of intersectional stereotypes, curated with the assistance of the ChatGPT model and manually validated. Moreover, this paper offers a comprehensive analysis of intersectional stereotype propagation in three contemporary LLMs by leveraging this dataset. The findings underscore the urgency of focusing on intersectional biases in ongoing efforts to reduce stereotype prevalence in LLMs.