2025
pdf
bib
abs
DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang
|
Jiayi Yuan
|
Yu-Neng Chuang
|
Zhuoer Wang
|
Yingchi Liu
|
Mark Cusick
|
Param Kulkarni
|
Zhengping Ji
|
Yasser Ibrahim
|
Xia Hu
Findings of the Association for Computational Linguistics: NAACL 2025
Large Language Models (LLMs) are increasingly serving as evaluators in Natural Language Generation (NLG) tasks; this is often referred to as “LLM-as-a-judge” paradigm. However, the capabilities of LLMs in evaluating NLG quality remain underexplored. Current studies depend on human assessments and simple metrics that fail to capture the discernment of LLMs across diverse NLG tasks. To address this gap, we propose the Discernment of Hierarchical Perturbation (DHP) benchmarking framework, which provides quantitative discernment scores for LLMs. This framework leverages hierarchically perturbed text data and statistical tests to systematically measure the NLG evaluation capabilities of LLMs. We re-established six evaluation datasets for this benchmark, covering four NLG tasks: Summarization, Story Completion, Question Answering, and Translation. Our comprehensive benchmarking of five major LLM families provides critical insight into their strengths and limitations as NLG evaluators. Our dataset is available at https://huggingface.co/datasets/YCWANGVINCE/DHP_Benchmark.
2019
pdf
bib
abs
Uncover Sexual Harassment Patterns from Personal Stories by Joint Key Element Extraction and Categorization
Yingchi Liu
|
Quanzhi Li
|
Marika Cifor
|
Xiaozhong Liu
|
Qiong Zhang
|
Luo Si
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
The number of personal stories about sexual harassment shared online has increased exponentially in recent years. This is in part inspired by the #MeToo and #TimesUp movements. Safecity is an online forum for people who experienced or witnessed sexual harassment to share their personal experiences. It has collected >10,000 stories so far. Sexual harassment occurred in a variety of situations, and categorization of the stories and extraction of their key elements will provide great help for the related parties to understand and address sexual harassment. In this study, we manually annotated those stories with labels in the dimensions of location, time, and harassers’ characteristics, and marked the key elements related to these dimensions. Furthermore, we applied natural language processing technologies with joint learning schemes to automatically categorize these stories in those dimensions and extract key elements at the same time. We also uncovered significant patterns from the categorized sexual harassment stories. We believe our annotated data set, proposed algorithms, and analysis will help people who have been harassed, authorities, researchers and other related parties in various ways, such as automatically filling reports, enlightening the public in order to prevent future harassment, and enabling more effective, faster action to be taken.
pdf
bib
abs
Rumor Detection on Social Media: Datasets, Methods and Opportunities
Quanzhi Li
|
Qiong Zhang
|
Luo Si
|
Yingchi Liu
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda
Social media platforms have been used for information and news gathering, and they are very valuable in many applications. However, they also lead to the spreading of rumors and fake news. Many efforts have been taken to detect and debunk rumors on social media by analyzing their content and social context using machine learning techniques. This paper gives an overview of the recent studies in the rumor detection field. It provides a comprehensive list of datasets used for rumor detection, and reviews the important studies based on what types of information they exploit and the approaches they take. And more importantly, we also present several new directions for future research.
2018
pdf
bib
abs
NAI-SEA at SemEval-2018 Task 5: An Event Search System
Yingchi Liu
|
Quanzhi Li
|
Luo Si
Proceedings of the 12th International Workshop on Semantic Evaluation
In this paper, we describe Alibaba’s participating system in the semEval-2018 Task5: Counting Events and Participants in the Long Tail. We designed and implemented a pipeline system that consists of components to extract question properties and document features, document event category classifications, document retrieval and document clustering. To retrieve the majority of the relevant documents, we carefully designed our system to extract key information from each question and document pair. After retrieval, we perform further document clustering to count the number of events. The task contains 3 subtasks, on which we achieved F1 score of 78.33, 50.52, 63.59 , respectively, for document level retrieval. Our system ranks first in all the three subtasks on document level retrieval, and it also ranks first in incident-level evaluation by RSME measure in subtask 3.