Liu Liu


2025

pdf bib
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks
Yan Yang | Dongxu Li | Haoning Wu | Bei Chen | Liu Liu | Liyuan Pan | Junnan Li
Findings of the Association for Computational Linguistics: ACL 2025

Solving expert-level multimodal tasks is a key milestone in general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to evolve, evaluation of frontier multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries encapsulating professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently collected from professionals based on their productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, they all face significant challenges in visual perception, textual understanding, domain knowledge, and advanced reasoning. Our benchmark is publicly accessible at TBC.

pdf bib
MONTROSE: LLM-driven Monte Carlo Tree Search Self-Refinement for Cross-Domain Rumor Detection
Shanshan Liu | Menglong Lu | Zhen Huang | Zejiang He | Liu Liu | Zhigang Sun | Dongsheng Li
Findings of the Association for Computational Linguistics: ACL 2025

With the emergence of new topics on social media as sources of rumor dissemination, addressing the distribution shifts between source and target domains remains a crucial task in cross-domain rumor detection. Existing feature alignment methods, which aim to reduce the discrepancies between domains, are often susceptible to task interference during training. Additionally, data distribution alignment methods, which rely on existing data to synthesize new training samples, inherently introduce noise. To deal with these challenges, a new cross-domain rumor detection method, MONTROSE, is proposed. It combines LLM-driven Monte Carlo Tree Search (MCTS) data synthesis to generate high-quality synthetic data for the target domain and a domain-sharpness-aware (DSAM) self-refinement approach to train rumor detection models with these synthetic data effectively. Experiments demonstrate the superior performance of MONTROSE in cross-domain rumor detection.

2023

pdf bib
Vector Based Stylistic Analysis on Ancient Chinese Books: Take the Three Commentaries on the Spring and Autumn Annals as an Example
Yue Qi | Liu Liu | Bin Li | Dongbo Wang
Proceedings of the Ancient Language Processing Workshop

Commentary of Gongyang, Commentary of Guliang, and Commentary of Zuo are collectively called the Three Commentaries on the Spring and Autumn Annals, which are the supplement and interpretation of the content of Spring and Autumn Annals with value in historical and literary research. In traditional research paradigms, scholars often explored the differences between the Three Commentaries within the details in contexts. Starting from the view of computational humanities, this paper examines the differences in the language style of the Three Commentaries through the representation of language, which takes the methods of deep learning. Specifically, this study vectorizes the context at word and sentence levels. It maps them into the same plane to find the differences between the use of words and sentences in the Three Commentaries. The results show that the Commentary of Gongyang and the Commentary of Guliang are relatively similar, while the Commentary of Zuo is significantly different. This paper verifies the feasibility of deep learning methods in stylistics study under computational humanities. It provides a valuable perspective for studying the Three Commentaries on the Spring and Autumn Annals.