2025
pdf
bib
abs
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection
Yi-Fan Lu
|
Xian-Ling Mao
|
Tian Lan
|
Tong Zhang
|
Yu-Shi Zhu
|
Heyan Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automatic evaluation for Open Domain Event Detection (ODED) is a highly challenging task, because ODED is characterized by a vast diversity of un-constrained output labels from various domains. Nearly all existing evaluation methods for ODED usually first construct evaluation benchmarks with limited labels and domain coverage, and then evaluate ODED methods using metrics based on token-level label matching rules. However, this kind of evaluation framework faces two issues: (1) The limited evaluation benchmarks lack representatives of the real world, making it difficult to accurately reflect the performance of various ODED methods in real-world scenarios; (2) Evaluation metrics based on token-level matching rules fail to capture semantic similarity between predictions and golden labels. To address these two problems above, we propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection (SEOE) by constructing a more representative evaluation benchmark and introducing a semantic evaluation metric. Specifically, our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains, with a cost-effective supplementary annotation strategy to ensure the benchmark’s representativeness. The strategy also allows for the supplement of new event types and domains in the future. Then, the proposed SEOE leverages large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels to enhance the reliability of the evaluation. Extensive experiments validate the representatives of the benchmark and the reliability of the semantic evaluation metric. Existing ODED methods are thoroughly evaluated, and the error patterns of predictions are analyzed, revealing several insightful findings.
pdf
bib
abs
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark
Rong-Cheng Tu
|
Zi-Ao Ma
|
Tian Lan
|
Yuehao Zhao
|
Heyan Huang
|
Xian-Ling Mao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Driven by the remarkable progress in diffusion models, text-to-image generation has achieved substantial advancements, underscoring the urgent need for robust automatic quality assessment. This task is inherently complex, requiring evaluations that range from object presence and attribute correctness to relational consistency and visual fidelity. Consequently, current state-of-the-art MLLM-based approaches often rely on powerful commercial models such as GPT-4o, which offer superior reasoning and instruction-following capabilities but are not universally accessible. In contrast, while open-source MLLMs demonstrate promising skills in vision and language understanding, they underperform in comprehensive image quality assessment.To address these challenges, we propose a task decomposition evaluation framework based on GPT-4o to automatically construct a specialized training dataset, breaking down the multifaceted evaluation process into simpler sub-tasks and thus reducing learning complexity. Building on this dataset, we design novel training strategies to distill GPT-4o’s evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6, enabling it to better follow instructions across diverse assessment criteria. Furthermore, to reliably and comprehensively assess prior works and our proposed model, we manually annotate a meta-evaluation benchmark that includes chain-of-thought explanations alongside quality scores for generated images.Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline, VIEScore, with over 4.6% improvement in Spearman and Kendall correlations with human judgments.
pdf
bib
abs
Training Language Models to Critique With Multi-agent Feedback
Tian Lan
|
Wenwei Zhang
|
Chengqi Lyu
|
Shuaibin Li
|
Chen Xu
|
Heyan Huang
|
Dahua Lin
|
Xian-Ling Mao
|
Kai Chen
Findings of the Association for Computational Linguistics: EMNLP 2025
Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. While utilizing human annotation can enhance critique ability effectively, most recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4, which is more scalable and cost-effective.However, such model-generated critiques often suffer from inherent flaws due to the complexity of critique. Consequently, fine-tuning LLMs on these flawed critiques not only limits performance but also propagates errors into the learned model.To address this issue, we propose MultiCritique, a unified framework that leverages multi-agent feedback to improve critique ability in both the supervised fine-tuning (SFT) and reinforcement learning (RL) stages.In the SFT stage, MultiCritique aggregates high-quality multi-agent critiques through a fine-grained meta-critique mechanism. In the RL stage, preference critiques are constructed and refined by validating their contributions to revisions, thereby enhancing robustness of RL in improving critique ability.Based on MultiCritique, we construct SFT and RL datasets. Extensive experimental results on two benchmarks highlight the key benefits of our dataset, including superior quality, enhanced data efficiency, strong generalization on unseen tasks, and improvements in the general capability of LLMs.Notably, our fine-tuned 7B model significantly surpasses advanced 7B-13B models, approaching advanced 70B LLMs and GPT-4.Resources have been publicly available.
2022
pdf
bib
abs
Cross-Lingual Phrase Retrieval
Heqi Zheng
|
Xiao Zhang
|
Zewen Chi
|
Heyan Huang
|
Yan Tan
|
Tian Lan
|
Wei Wei
|
Xian-Ling Mao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cross-lingual retrieval aims to retrieve relevant text across languages. Current methods typically achieve cross-lingual retrieval by learning language-agnostic text representations in word or sentence level. However, how to learn phrase representations for cross-lingual phrase retrieval is still an open problem. In this paper, we propose , a cross-lingual phrase retriever that extracts phrase representations from unlabeled example sentences. Moreover, we create a large-scale cross-lingual phrase retrieval dataset, which contains 65K bilingual phrase pairs and 4.2M example sentences in 8 English-centric language pairs. Experimental results show that outperforms state-of-the-art baselines which utilize word-level or sentence-level representations. also shows impressive zero-shot transferability that enables the model to perform retrieval in an unseen language pair during training. Our dataset, code, and trained models are publicly available at github.com/cwszz/XPR/.
pdf
bib
abs
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning
Yixuan Su
|
Fangyu Liu
|
Zaiqiao Meng
|
Tian Lan
|
Lei Shu
|
Ehsan Shareghi
|
Nigel Collier
Findings of the Association for Computational Linguistics: NAACL 2022
Masked language models (MLMs) such as BERT have revolutionized the field of Natural Language Understanding in the past few years. However, existing pre-trained MLMs often output an anisotropic distribution of token representations that occupies a narrow subset of the entire representation space. Such token representations are not ideal, especially for tasks that demand discriminative semantic meanings of distinct tokens. In this work, we propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations. TaCL is fully unsupervised and requires no additional data. We extensively test our approach on a wide range of English and Chinese benchmarks. The results show that TaCL brings consistent and notable improvements over the original BERT model. Furthermore, we conduct detailed analysis to reveal the merits and inner-workings of our approach.