2025
pdf
bib
abs
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
Zheqi He
|
Yesheng Liu
|
Jing-Shu Zheng
|
Xuejing Li
|
Jin-Ge Yao
|
Bowen Qin
|
Richeng Xuan
|
Xi Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible at https://github.com/flageval-baai/FlagEvalMM, with a demonstration video available at https://youtu.be/L7EtacjoM0k.
pdf
bib
abs
FlagEval-Arena: A Side-by-Side Comparative Evaluation Platform for Large Language Models and Text-Driven AIGC
Jing-Shu Zheng
|
Richeng Xuan
|
Bowen Qin
|
Zheqi He
|
Tongshuai.ren Tongshuai.ren
|
Xuejing Li
|
Jin-Ge Yao
|
Xi Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
We introduce FlagEval-Arena, an evaluation platform for side-by-side comparisons of large language models and text-driven AIGC systems.Compared with the well-known LM Arena (LMSYS Chatbot Arena), we reimplement our own framework with the flexibility to introduce new mechanisms or features. Our platform enables side-by-side evaluation not only for language models or vision-language models, but also text-to-image or text-to-video synthesis. We specifically target at Chinese audience with a more focus on the Chinese language, more models developed by Chinese institutes, and more general usage beyond the technical community. As a result, we currently observe very interesting differences from usual results presented by LM Arena. Our platform is available via this URL:
https://flageval.baai.org/#/arena.
2021
pdf
bib
abs
Issues with Entailment-based Zero-shot Text Classification
Tingting Ma
|
Jin-Ge Yao
|
Chin-Yew Lin
|
Tiejun Zhao
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
The general format of natural language inference (NLI) makes it tempting to be used for zero-shot text classification by casting any target label into a sentence of hypothesis and verifying whether or not it could be entailed by the input, aiming at generic classification applicable on any specified label space. In this opinion piece, we point out a few overlooked issues that are yet to be discussed in this line of work. We observe huge variance across different classification datasets amongst standard BERT-based NLI models and surprisingly find that pre-trained BERT without any fine-tuning can yield competitive performance against BERT fine-tuned for NLI. With the concern that these models heavily rely on spurious lexical patterns for prediction, we also experiment with preliminary approaches for more robust NLI, but the results are in general negative. Our observations reveal implicit but challenging difficulties in entailment-based zero-shot text classification.
2019
pdf
bib
abs
A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation
Feng Nie
|
Jin-Ge Yao
|
Jinpeng Wang
|
Rong Pan
|
Chin-Yew Lin
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Recent neural language generation systems often hallucinate contents (i.e., producing irrelevant or contradicted facts), especially when trained on loosely corresponding pairs of the input structure and text. To mitigate this issue, we propose to integrate a language understanding module for data refinement with self-training iterations to effectively induce strong equivalence between the input data and the paired text. Experiments on the E2E challenge dataset show that our proposed framework can reduce more than 50% relative unaligned noise from the original data-text pairs. A vanilla sequence-to-sequence neural NLG model trained on the refined data has improved on content correctness compared with the current state-of-the-art ensemble generator.
pdf
bib
abs
Towards Improving Neural Named Entity Recognition with Gazetteers
Tianyu Liu
|
Jin-Ge Yao
|
Chin-Yew Lin
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Most of the recently proposed neural models for named entity recognition have been purely data-driven, with a strong emphasis on getting rid of the efforts for collecting external resources or designing hand-crafted features. This could increase the chance of overfitting since the models cannot access any supervision signal beyond the small amount of annotated data, limiting their power to generalize beyond the annotated entities. In this work, we show that properly utilizing external gazetteers could benefit segmental neural NER models. We add a simple module on the recently proposed hybrid semi-Markov CRF architecture and observe some promising results.
pdf
bib
abs
A Closer Look at Recent Results of Verb Selection for Data-to-Text NLG
Guanyi Chen
|
Jin-Ge Yao
Proceedings of the 12th International Conference on Natural Language Generation
Automatic natural language generation systems need to use the contextually-appropriate verbs when describing different kinds of facts or events, which has triggered research interest on verb selection for data-to-text generation. In this paper, we discuss a few limitations of the current task settings and the evaluation metrics. We also provide two simple, efficient, interpretable baseline approaches for statistical selection of trend verbs, which give a strong performance on both previously used evaluation metrics and our new evaluation.
2018
pdf
bib
abs
On the Abstractiveness of Neural Document Summarization
Fangfang Zhang
|
Jin-ge Yao
|
Rui Yan
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Many modern neural document summarization systems based on encoder-decoder networks are designed to produce abstractive summaries. We attempted to verify the degree of abstractiveness of modern neural abstractive summarization systems by calculating overlaps in terms of various types of units. Upon the observation that many abstractive systems tend to be near-extractive in practice, we also implemented a pure copy system, which achieved comparable results as abstractive summarizers while being far more computationally efficient. These findings suggest the possibility for future efforts towards more efficient systems that could better utilize the vocabulary in the original document.
pdf
bib
abs
Learning Latent Semantic Annotations for Grounding Natural Language to Structured Data
Guanghui Qin
|
Jin-Ge Yao
|
Xuening Wang
|
Jinpeng Wang
|
Chin-Yew Lin
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Previous work on grounded language learning did not fully capture the semantics underlying the correspondences between structured world state representations and texts, especially those between numerical values and lexical terms. In this paper, we attempt at learning explicit latent semantic annotations from paired structured tables and texts, establishing correspondences between various types of values and texts. We model the joint probability of data fields, texts, phrasal spans, and latent annotations with an adapted semi-hidden Markov model, and impose a soft statistical constraint to further improve the performance. As a by-product, we leverage the induced annotations to extract templates for language generation. Experimental results suggest the feasibility of the setting in this study, as well as the effectiveness of our proposed framework.
pdf
bib
abs
Operation-guided Neural Networks for High Fidelity Data-To-Text Generation
Feng Nie
|
Jinpeng Wang
|
Jin-Ge Yao
|
Rong Pan
|
Chin-Yew Lin
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Recent neural models for data-to-text generation are mostly based on data-driven end-to-end training over encoder-decoder networks. Even though the generated texts are mostly fluent and informative, they often generate descriptions that are not consistent with the input structured data. This is a critical issue especially in domains that require inference or calculations over raw data. In this paper, we attempt to improve the fidelity of neural data-to-text generation by utilizing pre-executed symbolic operations. We propose a framework called Operation-guided Attention-based sequence-to-sequence network (OpAtt), with a specifically designed gating mechanism as well as a quantization module for operation results to utilize information from pre-executed operations. Experiments on two sports datasets show our proposed method clearly improves the fidelity of the generated texts to the input structured data.
pdf
bib
abs
Data2Text Studio: Automated Text Generation from Structured Data
Longxu Dou
|
Guanghui Qin
|
Jinpeng Wang
|
Jin-Ge Yao
|
Chin-Yew Lin
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Data2Text Studio is a platform for automated text generation from structured data. It is equipped with a Semi-HMMs model to extract high-quality templates and corresponding trigger conditions from parallel data automatically, which improves the interactivity and interpretability of the generated text. In addition, several easy-to-use tools are provided for developers to edit templates of pre-trained models, and APIs are released for developers to call the pre-trained model to generate texts in third-party applications. We conduct experiments on RotoWire datasets for template extraction and text generation. The results show that our model achieves improvements on both tasks.
pdf
bib
abs
Using Intermediate Representations to Solve Math Word Problems
Danqing Huang
|
Jin-Ge Yao
|
Chin-Yew Lin
|
Qingyu Zhou
|
Jian Yin
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
To solve math word problems, previous statistical approaches attempt at learning a direct mapping from a problem description to its corresponding equation system. However, such mappings do not include the information of a few higher-order operations that cannot be explicitly represented in equations but are required to solve the problem. The gap between natural language and equations makes it difficult for a learned model to generalize from limited data. In this work we present an intermediate meaning representation scheme that tries to reduce this gap. We use a sequence-to-sequence model with a novel attention regularization term to generate the intermediate forms, then execute them to obtain the final answers. Since the intermediate forms are latent, we propose an iterative labeling framework for learning by leveraging supervision signals from both equations and answers. Our experiments show using intermediate forms outperforms directly predicting equations.
2017
pdf
bib
abs
Leveraging Diverse Lexical Chains to Construct Essays for Chinese College Entrance Examination
Liunian Li
|
Xiaojun Wan
|
Jin-ge Yao
|
Siming Yan
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
In this work we study the challenging task of automatically constructing essays for Chinese college entrance examination where the topic is specified in advance. We explore a sentence extraction framework based on diversified lexical chains to capture coherence and richness. Experimental analysis shows the effectiveness of our approach and reveals the importance of information richness in essay writing.
pdf
bib
abs
Content Selection for Real-time Sports News Construction from Commentary Texts
Jin-ge Yao
|
Jianmin Zhang
|
Xiaojun Wan
|
Jianguo Xiao
Proceedings of the 10th International Conference on Natural Language Generation
We study the task of constructing sports news report automatically from live commentary and focus on content selection. Rather than receiving every piece of text of a sports match before news construction, as in previous related work, we novelly verify the feasibility of a more challenging but more useful setting to generate news report on the fly by treating live text input as a stream. Specifically, we design various scoring functions to address different requirements of the task. The near submodularity of scoring functions makes it possible to adapt efficient greedy algorithms even in stream data settings. Experiments suggest that our proposed framework can already produce comparable results compared with previous work that relies on a supervised learning-to-rank model with heavy feature engineering.
2016
pdf
bib
Towards Constructing Sports News from Live Text Commentary
Jianmin Zhang
|
Jin-ge Yao
|
Xiaojun Wan
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2015
pdf
bib
Phrase-based Compressive Cross-Language Summarization
Jin-ge Yao
|
Xiaojun Wan
|
Jianguo Xiao
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
2014
pdf
bib
Joint Decoding of Tree Transduction Models for Sentence Compression
Jin-ge Yao
|
Xiaojun Wan
|
Jianguo Xiao
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)