This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
YahuiLiu
Also published as:
亚慧 刘
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs).However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (i.e., key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects’ attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at https://github.com/tjunlp-lab/MCTS-VCB.
Recent advances in reasoning with large language models (LLMs) have popularized Long Chain-of-Thought (LCoT), a strategy that encourages deliberate and step-by-step reasoning before producing a final answer. While LCoTs have enabled expert-level performance in complex tasks, how the internal structures of their reasoning chains drive, or even predict, the correctness of final answers remains a critical yet underexplored question. In this work, we present LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures and thus enables deeper structural analysis of LLM reasoning. Using graph neural networks (GNNs), we reveal that structural patterns extracted by LCoT2Tree, including exploration, backtracking, and verification, serve as stronger predictors of final performance across a wide range of tasks and models. Leveraging an explainability technique, we further identify critical thought patterns such as over-branching that account for failures. Beyond diagnostic insights, the structural patterns by LCoT2Tree support practical applications, including improving Best-of-N decoding effectiveness. Overall, our results underscore the critical role of internal structures of reasoning chains, positioning LCoT2Tree as a powerful tool for diagnosing, interpreting, and improving reasoning in LLMs.
“We participate in the open track of the Chinese frame semantic parsing (CFSP) task, i.e., CCL24Eval Task 1, and our submission ranks first. FSP is an important task in Natural Language Processing, aiming to extract the frame semantic structures from sentences, which can be divided into three subtasks, e.g., Frame Identification (FI), Argument Identification (AI), and Role Identification (RI). In this paper, we use the LLM Gemini 1.0 to evaluate the three subtasks of CFSP, and present the techniques and strategies we employed to enhance subtasks performance. For FI, we leverage mapping and similarity strategies to minimize the candidate frames for each target word, which can reduce the complexity of the LLM in identifying the appropriate frame. For AI and RI subtasks, we utilize the results from small models as auxiliary information and apply data augmentation, self-training, and model ensemble techniques on these small models to further enhance the performance of subtasks.”
Current practices in metric evaluation focus on one single dataset, e.g., Newstest dataset in each year’s WMT Metrics Shared Task. However, in this paper, we qualitatively and quantitatively show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets. Then this paper further investigates two potential hypotheses, i.e., insignificant data points and the deviation of i.i.d assumption, which may take responsibility for the issue of data variance. In conclusion, our findings suggest that when evaluating automatic translation metrics, researchers should take data variance into account and be cautious to report the results on unreliable datasets, because it may leads to inconsistent results with most of the other datasets.
During the past decade, neural network models have made tremendous progress on in-domain semantic role labeling (SRL). However, performance drops dramatically under the out-of-domain setting. In order to facilitate research on cross-domain SRL, this paper presents MuCPAD, a multi-domain Chinese predicate-argument dataset, which consists of 30,897 sentences and 92,051 predicates from six different domains. MuCPAD exhibits three important features. 1) Based on a frame-free annotation methodology, we avoid writing complex frames for new predicates. 2) We explicitly annotate omitted core arguments to recover more complete semantic structure, considering that omission of content words is ubiquitous in multi-domain Chinese texts. 3) We compile 53 pages of annotation guidelines and adopt strict double annotation for improving data quality. This paper describes in detail the annotation methodology and annotation process of MuCPAD, and presents in-depth data analysis. We also give benchmark results on cross-domain SRL based on MuCPAD.
Sequence-to-sequence neural generation models have achieved promising performance on short text conversation tasks. However, they tend to generate generic/dull responses, leading to unsatisfying dialogue experience. We observe that in the conversation tasks, each query could have multiple responses, which forms a 1-to-n or m-to-n relationship in the view of the total corpus. The objective function used in standard sequence-to-sequence models will be dominated by loss terms with generic patterns. Inspired by this observation, we introduce a statistical re-weighting method that assigns different weights for the multiple responses of the same query, and trains the common neural generation model with the weights. Experimental results on a large Chinese dialogue corpus show that our method improves the acceptance rate of generated responses compared with several baseline models and significantly reduces the number of generated generic responses.