Yizhi Li
Also published as: Yizhi LI
2026
Context as a Tool: Context Management for Long-Horizon SWE-Agents
Shukai Liu | Bo Jiang | Jian Yang | Yizhi LI | Jinyang Guo | Xianglong Liu | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
Shukai Liu | Bo Jiang | Jian Yang | Yizhi LI | Jinyang Guo | Xianglong Liu | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose Cat, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. Cat formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CaT-Generator, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.
HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
Hanhua Hong | Yizhi LI | Jiaoyan Chen | Sophia Ananiadou | Xiaoli Li | Jung-jae Kim | Chenghua Lin
Findings of the Association for Computational Linguistics: ACL 2026
Hanhua Hong | Yizhi LI | Jiaoyan Chen | Sophia Ananiadou | Xiaoli Li | Jung-jae Kim | Chenghua Lin
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance. In this work, we propose Hierarchical Research Agent System (HiRAS), a hierarchical multi-agent framework for end-to-end paper reproduction that employs supervisory manager agents to coordinate specialised agents across fine-grained stages. We also identify limitations in the reference-free evaluation of the Paper2Code benchmark and introduce Paper2Code-Extra (P2C-Ex), a refined protocol that incorporates repository-level information and better aligns with the original reference-based metric. We conduct extensive evaluation, validating the effectiveness and robustness of our proposed methods, and observing improvements, including >10% relative performance gain above the previous state-of-the-art using open-source backbone models and significantly reduced hallucination in the evaluation. All code and data will be made publicly available.
CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
Siyi Li | Jiajun Shi | Shiwen Ni | Ge Zhang | Shuaimin Li | Shijian Wang | Zhoufutu Wen | Yizhi LI | Hamid Alinejad-Rokny | Jiaheng Liu | Min Yang | Wenhao Huang
Findings of the Association for Computational Linguistics: ACL 2026
Siyi Li | Jiajun Shi | Shiwen Ni | Ge Zhang | Shuaimin Li | Shijian Wang | Zhoufutu Wen | Yizhi LI | Hamid Alinejad-Rokny | Jiaheng Liu | Min Yang | Wenhao Huang
Findings of the Association for Computational Linguistics: ACL 2026
Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal – how much of a CoT is necessary versus structurally redundant – that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.
LoopCoder: Scaling Code Intelligence via Looped Language Models
Jian Yang | Wei Zhang | Shuyue Guo | Yizhi LI | Linzheng Chai | Zhengmao Ye | Shukai Liu | Yuyang Song | Jiajun Wu | Che Liu | Tianyu Zheng | Siwei Wu | Leo L | Xudong Ma | Chuan Hao | Ran Tao | Yan Xing | Jianzhou Wang | Mingjie Tang | Aishan Liu | Zhoujun Li | Xianglong Liu | Weifeng Lv | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
Jian Yang | Wei Zhang | Shuyue Guo | Yizhi LI | Linzheng Chai | Zhengmao Ye | Shukai Liu | Yuyang Song | Jiajun Wu | Che Liu | Tianyu Zheng | Siwei Wu | Leo L | Xudong Ma | Chuan Hao | Ran Tao | Yan Xing | Jianzhou Wang | Mingjie Tang | Aishan Liu | Zhoujun Li | Xianglong Liu | Weifeng Lv | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
While large language models (LLMs) have mastered syntax-level code generation, complex algorithmic reasoning remains a challenge, typically addressed by scaling model depth and parameter count. Universal Transformers (UT) offer a compelling alternative by introducing a recurrent inductive bias that aligns with the recursive nature of programming logic. However, training looped architectures at scale has historically been hindered by severe instability and optimization difficulties associated with backpropagation through time (BPTT). We present LoopCoder (40B-A80B) pre-trained on 12T+ code and general tokens, along with LoopCoder-Thinking and LoopCoder-Instruct variants—the first large-scale looped transformer for code, achieving comparable performance to standard dense architectures with more parameters. Unlike prior approaches that restrict recurrence to small-scale tasks, we implement a comprehensive looped training protocol spanning both pre-training and post-training phases. We initiate the model via dense-to-loop transformation, folding a pre-trained dense checkpoint to initialize a recurrent block, followed by rigorous looped pre-training and specialized post-training for instruction following and reasoning. Our results establish a robust recipe for scaling coding intelligence via recurrent computation, proving that dense checkpoints serve as an optimal foundation for evolving into dynamic, looped reasoners.
MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models
Siwei Wu | King Zhu | Yu Bai | Yiming Liang | Yizhi Li | Haoning Wu | Jiaheng Liu | Ruibo Liu | Xingwei Qu | Xuxin Cheng | Ge Zhang | Wenhao Huang | Chenghua Lin
Findings of the Association for Computational Linguistics: EACL 2026
Siwei Wu | King Zhu | Yu Bai | Yiming Liang | Yizhi Li | Haoning Wu | Jiaheng Liu | Ruibo Liu | Xingwei Qu | Xuxin Cheng | Ge Zhang | Wenhao Huang | Chenghua Lin
Findings of the Association for Computational Linguistics: EACL 2026
Current multi-modal benchmarks primarily focus on facts within individual images. However, they overlook the associative relations among multiple images, which necessitate conducting commonsense reasoning grounded in associated knowledge at different granularities (i.e., image-level and entity-level) as well as the ability to perceive the order of images. Therefore, we propose a multi-image relational association task and a meticulously curated Multi-granularity Multi-image Relational Association (MMRA) benchmark, comprising 1,024 samples. To systematically evaluate current LVLMs, we establish a system of associative relations among images that contains 11 subtasks (e.g., UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., image-level and entity-level), based on relations in ConceptNet. Our experiments reveal that entity-level multi-image perception tasks pose greater challenges for LVLMs than image-level tasks. Moreover, LVLMs perform poorly on spatial-related tasks, indicating limited spatial awareness. Furthermore, we find that LVLMs exhibit weak image order perception capabilities, and we design a method to significantly improve this ability, demonstrating that most current LVLMs do not adequately consider image order perception during pre-training.
2025
Adversarial Defense without Adversarial Defense: Enhancing Language Model Robustness via Instance-level Principal Component Removal
Yang Wang | Chenghao Xiao | Yizhi Li | Stuart E. Middleton | Noura Al Moubayed | Chenghua Lin
Transactions of the Association for Computational Linguistics, Volume 13
Yang Wang | Chenghao Xiao | Yizhi Li | Stuart E. Middleton | Noura Al Moubayed | Chenghua Lin
Transactions of the Association for Computational Linguistics, Volume 13
Pre-trained language models (PLMs) have driven substantial progress in natural language processing but remain vulnerable to adversarial attacks, raising concerns about their robustness in real-world applications. Previous studies have sought to mitigate the impact of adversarial attacks by introducing adversarial perturbations into the training process, either implicitly or explicitly. While both strategies enhance robustness, they often incur high computational costs. In this work, we propose a simple yet effective add-on module that enhances the adversarial robustness of PLMs by removing instance-level principal components, without relying on conventional adversarial defenses or perturbing the original training data. Our approach transforms the embedding space to approximate Gaussian properties, thereby reducing its susceptibility to adversarial perturbations while preserving semantic relationships. This transformation aligns embedding distributions in a way that minimizes the impact of adversarial noise on decision boundaries, enhancing robustness without requiring adversarial examples or costly training-time augmentation. Evaluations on eight benchmark datasets show that our approach improves adversarial robustness while maintaining comparable before-attack accuracy to baselines, achieving a balanced trade-off between robustness and generalization.
MIO: A Foundation Model on Multimodal Tokens
Zekun Moore Wang | King Zhu | Chunpu Xu | Wangchunshu Zhou | Jiaheng Liu | Yibo Zhang | Jessie Wang | Ning Shi | Siyu Li | Yizhi Li | Haoran Que | Zhaoxiang Zhang | Yuanxing Zhang | Ge Zhang | Ke Xu | Jie Fu | Wenhao Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zekun Moore Wang | King Zhu | Chunpu Xu | Wangchunshu Zhou | Jiaheng Liu | Yibo Zhang | Jessie Wang | Ning Shi | Siyu Li | Yizhi Li | Haoran Que | Zhaoxiang Zhang | Yuanxing Zhang | Ge Zhang | Ke Xu | Jie Fu | Wenhao Huang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.
DocMMIR: A Framework for Document Multi-modal Information Retrieval
Zirui Li | Siwei Wu | Yizhi Li | Xingyu Wang | Yi Zhou | Chenghua Lin
Findings of the Association for Computational Linguistics: EMNLP 2025
Zirui Li | Siwei Wu | Yizhi Li | Xingyu Wang | Yi Zhou | Chenghua Lin
Findings of the Association for Computational Linguistics: EMNLP 2025
The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains—including Wikipedia articles, scientific papers (arXiv), and presentation slides—within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal dataset, comprising 450K training, 19.2K validation, and 19.2K test documents, serving as both a benchmark to reveal the shortcomings of existing MMIR models and a training set for further improvement. The dataset systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current state-of-the-art MLLMs (CLIP, BLIP2, SigLIP-2, ALIGN) when applied to our tasks, with only CLIP (ViT-L/14) demonstrating reasonable zero-shot performance. Through systematic investigation of cross-modal fusion strategies and loss function selection on the CLIP (ViT-L/14) model, we develop an optimised approach that achieves a +31% improvement in MRR@10 metrics from zero-shot baseline to fine-tuned model. Our findings offer crucial insights and practical guidance for future development in unified multimodal document retrieval tasks.
LIME: Less Is More for MLLM Evaluation
King Zhu | Qianbo Zang | Shian Jia | Siwei Wu | Feiteng Fang | Yizhi Li | Shuyue Guo | Tianyu Zheng | Jiawei Guo | Bo Li | Haoning Wu | Xingwei Qu | Jian Yang | Ruibo Liu | Xiang Yue | Jiaheng Liu | Chenghua Lin | Hamid Alinejad-Rokny | Min Yang | Shiwen Ni | Wenhao Huang | Ge Zhang
Findings of the Association for Computational Linguistics: ACL 2025
King Zhu | Qianbo Zang | Shian Jia | Siwei Wu | Feiteng Fang | Yizhi Li | Shuyue Guo | Tianyu Zheng | Jiawei Guo | Bo Li | Haoning Wu | Xingwei Qu | Jian Yang | Ruibo Liu | Xiang Yue | Jiaheng Liu | Chenghua Lin | Hamid Alinejad-Rokny | Min Yang | Shiwen Ni | Wenhao Huang | Ge Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal Large Language Models (MLLMs) are measured on numerous benchmarks like image captioning, visual question answer, and reasoning. However, these benchmarks often include overly simple or uninformative samples, making it difficult to effectively distinguish the performance of different MLLMs. Additionally, evaluating models across many benchmarks creates a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated using a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that require image-based understanding. Our experiments show that LIME reduces the number of samples by 76% and evaluation time by 77%, while it can more effectively distinguish different models’ abilities. Notably, we find that traditional automatic metrics like CIDEr are insufficient for evaluating MLLMs’ captioning performance, and excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://anonymous.4open.science/r/LIME-49CD
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Jiawei Guo | Tianyu Zheng | Yizhi Li | Yuelin Bai | Bo Li | Yubo Wang | King Zhu | Graham Neubig | Wenhu Chen | Xiang Yue
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiawei Guo | Tianyu Zheng | Yizhi Li | Yuelin Bai | Bo Li | Yubo Wang | King Zhu | Graham Neubig | Wenhu Chen | Xiang Yue
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse reasoning-intensive tasks.Experiments demonstrate that training MLLMs on our dataset not only significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%), but also gains improvements of up to 4% on non-reasoning-based benchmarks.
2024
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models
Yizhi Li | Ge Zhang | Xingwei Qu | Jiali Li | Zhaoqun Li | Noah Wang | Hao Li | Ruibin Yuan | Yinghao Ma | Kai Zhang | Wangchunshu Zhou | Yiming Liang | Lei Zhang | Lei Ma | Jiajun Zhang | Zuowen Li | Wenhao Huang | Chenghua Lin | Jie Fu
Findings of the Association for Computational Linguistics: ACL 2024
Yizhi Li | Ge Zhang | Xingwei Qu | Jiali Li | Zhaoqun Li | Noah Wang | Hao Li | Ruibin Yuan | Yinghao Ma | Kai Zhang | Wangchunshu Zhou | Yiming Liang | Lei Zhang | Lei Ma | Jiajun Zhang | Zuowen Li | Wenhao Huang | Chenghua Lin | Jie Fu
Findings of the Association for Computational Linguistics: ACL 2024
The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following.Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (**CIF-Bench**), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances.Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts.This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.
Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation
Hao Li | Yuping Wu | Viktor Schlegel | Riza Batista-Navarro | Tharindu Madusanka | Iqra Zahid | Jiayan Zeng | Xiaochi Wang | Xinran He | Yizhi Li | Goran Nenadic
Findings of the Association for Computational Linguistics: ACL 2024
Hao Li | Yuping Wu | Viktor Schlegel | Riza Batista-Navarro | Tharindu Madusanka | Iqra Zahid | Jiayan Zeng | Xiaochi Wang | Xinran He | Yizhi Li | Goran Nenadic
Findings of the Association for Computational Linguistics: ACL 2024
With the recent advances of large language models (LLMs), it is no longer infeasible to build an automated debate system that helps people to synthesise persuasive arguments. Previous work attempted this task by integrating multiple components. In our work, we introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate, which covers the tasks of claim and evidence identification (Task 1 ED), evidence convincingness ranking (Task 2 ECR), argumentative essay summarisation and human preference ranking (Task 3 ASR) and metric learning for automated evaluation of resulting essays, based on human feedback along argument quality dimensions (Task 4 SQE). Our dataset contains 14k examples of claims that are fully annotated with various properties supporting the aforementioned tasks. We evaluate multiple generative baselines for each of these tasks, including representative LLMs. We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly, both in automated measures as well as in human-centred evaluation. This challenge presented by our proposed dataset motivates future research on end-to-end argument mining and summarisation. The repository of this project is available at https://github.com/HarrywillDr/ArgSum-Datatset.
ChatMusician: Understanding and Generating Music Intrinsically with LLM
Ruibin Yuan | Hanfeng Lin | Yi Wang | Zeyue Tian | Shangda Wu | Tianhao Shen | Ge Zhang | Yuhang Wu | Cong Liu | Ziya Zhou | Liumeng Xue | Ziyang Ma | Qin Liu | Tianyu Zheng | Yizhi Li | Yinghao Ma | Yiming Liang | Xiaowei Chi | Ruibo Liu | Zili Wang | Chenghua Lin | Qifeng Liu | Tao Jiang | Wenhao Huang | Wenhu Chen | Jie Fu | Emmanouil Benetos | Gus Xia | Roger Dannenberg | Wei Xue | Shiyin Kang | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2024
Ruibin Yuan | Hanfeng Lin | Yi Wang | Zeyue Tian | Shangda Wu | Tianhao Shen | Ge Zhang | Yuhang Wu | Cong Liu | Ziya Zhou | Liumeng Xue | Ziyang Ma | Qin Liu | Tianyu Zheng | Yizhi Li | Yinghao Ma | Yiming Liang | Xiaowei Chi | Ruibo Liu | Zili Wang | Chenghua Lin | Qifeng Liu | Tao Jiang | Wenhao Huang | Wenhu Chen | Jie Fu | Emmanouil Benetos | Gus Xia | Roger Dannenberg | Wei Xue | Shiyin Kang | Yike Guo
Findings of the Association for Computational Linguistics: ACL 2024
While LLMs demonstrate impressive capabilities in musical knowledge, we find that music reasoning is still an unsolved task.We introduce ChatMusician, an open-source large language model (LLM) that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language.ChatMusician can understand and generate music with a pure text tokenizer without external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score.ChatMusician is capable of composing well-structured, full-length music, condition on texts, chords, melodies, motifs, musical forms, etc.On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 by a noticeable margin. We show that ChatMusician preserves or even surpasses the original LLaMA2 7B’s language abilities by evaluating on MMLU benchmark.Our work reveals that LLMs can be an excellent compressor for music, which can be seen as humanity’s creative language, but there remains significant territory to be conquered.We release our 5B token music-language corpora MusicPiles, the collected MusicTheoryBench, code, model and demo.
SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval
Siwei Wu | Yizhi Li | Kang Zhu | Ge Zhang | Yiming Liang | Kaijing Ma | Chenghao Xiao | Haoran Zhang | Bohao Yang | Wenhu Chen | Wenhao Huang | Noura Al Moubayed | Jie Fu | Chenghua Lin
Findings of the Association for Computational Linguistics: ACL 2024
Siwei Wu | Yizhi Li | Kang Zhu | Ge Zhang | Yiming Liang | Kaijing Ma | Chenghao Xiao | Haoran Zhang | Bohao Yang | Wenhu Chen | Wenhao Huang | Noura Al Moubayed | Jie Fu | Chenghua Lin
Findings of the Association for Computational Linguistics: ACL 2024
Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairing.However, current benchmarks for evaluating MMIR performance on image-text pairings overlook the scientific domain, which has a notable gap with the generic data since the caption of scientific charts and tables usually describes the analysis of experimental results or scientific principles in contrast to human activity or scenery depicted in generic images.To bridge this gap, we develop a scientific domain-specific MMIR benchmark (SciMMIR) by leveraging open-access research paper corpora to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions from scientific documents.We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of the baselines. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2.Our findings offer critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the effects of different visual and textual encoders.
2023
Length is a Curse and a Blessing for Document-level Semantics
Chenghao Xiao | Yizhi Li | G Hudson | Chenghua Lin | Noura Al Moubayed
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Chenghao Xiao | Yizhi Li | G Hudson | Chenghua Lin | Noura Al Moubayed
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
In recent years, contrastive learning (CL) has been extensively utilized to recover sentence and document-level encoding capability from pre-trained language models. In this work, we question the length generalizability of CL-based models, i.e., their vulnerability towards length-induced semantic shift. We verify not only that length vulnerability is a significant yet overlooked research gap, but we can devise unsupervised CL methods solely depending on the semantic signal provided by document length. We first derive the theoretical foundations underlying length attacks, showing that elongating a document would intensify the high intra-document similarity that is already brought by CL. Moreover, we found that isotropy promised by CL is highly dependent on the length range of text exposed in training. Inspired by these findings, we introduce a simple yet universal document representation learning framework, **LA(SER)3**: length-agnostic self-reference for semantically robust sentence representation learning, achieving state-of-the-art unsupervised performance on the standard information retrieval benchmark. [Our code is publicly available.](https://github.com/gowitheflow-1998/LA-SER-cubed)
2022
TranSHER: Translating Knowledge Graph Embedding with Hyper-Ellipsoidal Restriction
Yizhi Li | Wei Fan | Chao Liu | Chenghua Lin | Jiang Qian
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Yizhi Li | Wei Fan | Chao Liu | Chenghua Lin | Jiang Qian
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Knowledge graph embedding methods are important for the knowledge graph completion (or link prediction) task.One state-of-the-art method, PairRE, leverages two separate vectors to model complex relations (i.e., 1-to-N, N-to-1, and N-to-N) in knowledge graphs. However, such a method strictly restricts entities on the hyper-ellipsoid surfaces which limits the optimization of entity distribution, leading to suboptimal performance of knowledge graph completion. To address this issue, we propose a novel score function TranSHER, which leverages relation-specific translations between head and tail entities to relax the constraint of hyper-ellipsoid restrictions. By introducing an intuitive and simple relation-specific translation, TranSHER can provide more direct guidance on optimization and capture more semantic characteristics of entities with complex relations. Experimental results show that TranSHER achieves state-of-the-art performance on link prediction and generalizes well to datasets in different domains and scales. Our codes are public available athttps://github.com/yizhilll/TranSHER.
HERB: Measuring Hierarchical Regional Bias in Pre-trained Language Models
Yizhi Li | Ge Zhang | Bohao Yang | Chenghua Lin | Anton Ragni | Shi Wang | Jie Fu
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Yizhi Li | Ge Zhang | Bohao Yang | Chenghua Lin | Anton Ragni | Shi Wang | Jie Fu
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Fairness has become a trending topic in natural language processing (NLP) and covers biases targeting certain social groups such as genders and religions. Yet regional bias, another long-standing global discrimination problem, remains unexplored still. Consequently, we intend to provide a study to analyse the regional bias learned by the pre-trained language models (LMs) that are broadly used in NLP tasks. While verifying the existence of regional bias in LMs, we find that the biases on regional groups can be largely affected by the corresponding geographical clustering. We accordingly propose a hierarchical regional bias evaluation method (HERB) utilising the information from the sub-region clusters to quantify the bias in the pre-trained LMs. Experiments show that our hierarchical metric can effectively evaluate the regional bias with regard to comprehensive topics and measure the potential regional bias that can be propagated to downstream tasks. Our codes are available at https://github.com/Bernard-Yang/HERB.
Search
Fix author
Co-authors
- Chenghua Lin 11
- Ge Zhang 8
- Wenhao Huang 7
- Jie Fu 5
- Siwei Wu 5
- Yiming Liang 4
- Jiaheng Liu 4
- Tianyu Zheng 4
- King Zhu 4
- Noura Al Moubayed 3
- Wenhu Chen 3
- Ruibo Liu 3
- Xingwei Qu 3
- Chenghao Xiao 3
- Jian Yang 3
- Hamid Alinejad-Rokny 2
- Bryan Dai 2
- Shuyue Guo 2
- Jiawei Guo 2
- Hao Li 2
- Bo Li 2
- Shukai Liu 2
- Xianglong Liu 2
- Yinghao Ma 2
- Shiwen Ni 2
- Haoning Wu 2
- Min Yang 2
- Bohao Yang 2
- Ruibin Yuan 2
- Xiang Yue 2
- Wangchunshu Zhou 2
- Sophia Ananiadou 1
- Yuelin Bai 1
- Yu Bai (白宇) 1
- Riza Theresa Batista-Navarro 1
- Emmanouil Benetos 1
- Linzheng Chai 1
- Jiaoyan Chen 1
- Xuxin Cheng 1
- Xiaowei Chi 1
- Roger Dannenberg 1
- Wei Fan 1
- Feiteng Fang 1
- Yike Guo 1
- Jinyang Guo 1
- Chuan Hao 1
- Xinran He 1
- Hanhua Hong 1
- G Hudson 1
- Shian Jia 1
- Tao Jiang 1
- Bo Jiang 1
- Shiyin Kang 1
- Jung-jae Kim 1
- Leo L 1
- Jiali Li 1
- Zhaoqun Li 1
- Zuowen Li 1
- Xiaoli Li 1
- Siyi Li 1
- Shuaimin Li 1
- Siyu Li 1
- Zirui Li 1
- Zhoujun Li 1
- Hanfeng Lin 1
- Cong Liu 1
- Qin Liu 1
- Qifeng Liu 1
- Chao Liu 1
- Che Liu 1
- Aishan Liu 1
- Weifeng Lv 1
- Lei Ma 1
- Ziyang Ma 1
- Xudong Ma 1
- Kaijing Ma 1
- Tharindu Madusanka 1
- Stuart E. Middleton 1
- Goran Nenadic 1
- Graham Neubig 1
- Jiang Qian 1
- Haoran Que 1
- Anton Ragni 1
- Viktor Schlegel 1
- Tianhao Shen 1
- Jiajun Shi 1
- Ning Shi 1
- Yuyang Song 1
- Mingjie Tang 1
- Ran Tao 1
- Zeyue Tian 1
- Noah Wang 1
- Xiaochi Wang 1
- Yi Wang 1
- Zili Wang 1
- Yang Wang 1
- Shijian Wang 1
- Zekun Moore Wang 1
- Jiashuo Wang 1
- Xingyu Wang 1
- Yubo Wang 1
- Shi Wang 1
- Jianzhou Wang 1
- Zhoufutu Wen 1
- Yuping Wu 1
- Shangda Wu 1
- Yuhang Wu 1
- Jiajun Wu 1
- Gus Xia 1
- Yan Xing 1
- Chunpu Xu 1
- Ke Xu 1
- Liumeng Xue 1
- Wei Xue 1
- Zhengmao Ye 1
- Iqra Zahid 1
- Qianbo Zang 1
- Jiayan Zeng 1
- Kai Zhang 1
- Lei Zhang 1
- Jiajun Zhang 1
- Yibo Zhang 1
- Zhaoxiang Zhang 1
- Yuanxing Zhang 1
- Wei Zhang 1
- Haoran Zhang 1
- Ziya Zhou 1
- Yi Zhou 1
- Kang Zhu 1