Tongyan Hu


2025

pdf bib
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
Tingyu Song | Tongyan Hu | Guo Gan | Yilun Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, multimodal large language models (MLLMs) have been extensively explored in video question answering. However, most existing assessments focus on natural videos, overlooking synthetic videos (e.g., AI-generated content). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VQ-Eval, which introduces four tasks—coherence validation, error awareness, error type detection, and reasoning evaluation—to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VQ-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VQ-Eval in improving video generation, we design a re-prompt pipeline, demonstrating that aligning MLLMs more closely with human feedback can benefit the video generation.

pdf bib
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
Tiansheng Hu | Tongyan Hu | Liuyang Bai | Yilun Zhao | Arman Cohan | Chen Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs’ trustworthiness evaluation in finance domain.