Jiaxu Li


2025

pdf bib
MHALO: Evaluating MLLMs as Fine-grained Hallucination Detectors
Yishuo Cai | Renjie Gu | Jiaxu Li | Xuancheng Huang | Junzhe Chen | Xiaotao Gu | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025

Hallucination remains a critical challenge for multimodal large language models (MLLMs), undermining their reliability in real-world applications. While fine-grained hallucination detection (FHD) holds promise for enhancing high-quality vision-language data construction and model alignment through enriched feedback signals, automated solutions for this task have yet to be systematically explored. Inspired by the concept of “MLLM as a Judge”, we introduce MHALO, the first comprehensive benchmark specifically designed for evaluating MLLMs’ capability in performing token-level FHD. Our benchmark encompasses 12 distinct hallucination types spanning both multimodal perception and reasoning domains. Through extensive evaluations of 9 selected MLLMs, we reveal substantial performance limitations, with the leading model achieving an average F1IoU of only 40.59%. To address this limitation, we develop HaloDet-4B, a specialized model trained on our curated training data, which significantly outperforms existing models. We hope the benchmark can provide valuable insights for future research on hallucination mitigation in MLLMs. The code and dataset will be publicly available.

2024

pdf bib
Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity
Eric Khiu | Hasti Toossi | David Anugraha | Jinyu Liu | Jiaxu Li | Juan Flores | Leandro Roman | A. Seza Doğruöz | En-Shiun Lee
Findings of the Association for Computational Linguistics: EACL 2024

Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.