2025
pdf
bib
abs
TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustworthiness of Generative Foundation Models
Yanbo Wang
|
Jiayi Ye
|
Siyuan Wu
|
Chujie Gao
|
Yue Huang
|
Xiuying Chen
|
Yue Zhao
|
Xiangliang Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Ensuring the trustworthiness of Generative Foundation Models (GenFMs) is a pressing challenge as they gain widespread use. Existing evaluation toolkits are often limited in scope, dynamism, and flexibility. This paper introduces TRUSTEVAL, a dynamic and comprehensive toolkit designed for evaluating GenFMs across various dimensions. TRUSTEVAL supports both dynamic dataset generation and evaluation, offering advanced features including comprehensiveness, usability, and flexibility. TRUSTEVAL integrates diverse generative models, datasets, evaluation methods, metrics, inference efficiency enhancement, and evaluation report generation. Through case studies, we demonstrate TRUSTEVAL’s potential to advance the trustworthiness evaluation of GenFMs.
2024
pdf
bib
abs
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark
Zhenwen Liang
|
Kehan Guo
|
Gang Liu
|
Taicheng Guo
|
Yujun Zhou
|
Tianyu Yang
|
Jiajun Jiao
|
Renjie Pi
|
Jipeng Zhang
|
Xiangliang Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
The paper introduces SceMQA, a novel benchmark for scientific multimodal question answering at the college entrance level. It addresses a critical educational phase often overlooked in existing benchmarks, spanning high school to pre-college levels. SceMQA focuses on core science subjects including Mathematics, Physics, Chemistry, and Biology. It features a blend of multiple-choice and free-response formats, ensuring a comprehensive evaluation of AI models’ abilities. Additionally, our benchmark provides specific knowledge points for each problem and detailed explanations for each answer. SceMQA also uniquely presents problems with identical contexts but varied questions to facilitate a more thorough and accurate assessment of reasoning capabilities. In the experiment, we evaluate both open-source and close-source state-of-the-art Multimodal Large Language Models (MLLMs), across various experimental settings. The results show that further research and development are needed in developing more capable MLLM, as highlighted by only 50% to 60% accuracy achieved by the strongest models.
pdf
bib
abs
MinT: Boosting Generalization in Mathematical Reasoning via Multi-view Fine-tuning
Zhenwen Liang
|
Dian Yu
|
Xiaoman Pan
|
Wenlin Yao
|
Qingkai Zeng
|
Xiangliang Zhang
|
Dong Yu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Reasoning in mathematical domains remains a significant challenge for relatively small language models (LMs). Many current methods focus on specializing LMs in mathematical reasoning and rely heavily on distilling knowledge from powerful yet inefficient large LMs (LLMs). In this work, we explore a new direction that avoids over-reliance on LLM teachers, introducing a multi-view fine-tuning method that efficiently exploits existing mathematical problem datasets with diverse annotation styles. Our approach uniquely considers the various annotation formats as different “views” that may help each other and leverage them in training the model. By postpending distinct instructions to input questions, models can learn to generate solutions in diverse formats in a flexible manner. Experimental results show that our strategy enables relatively small LMs to outperform prior approaches that heavily rely on knowledge distillation, as well as carefully established baselines. Additionally, the proposed method grants the models promising generalization ability across various views and datasets, and the capability to learn from inaccurate or incomplete noisy data. We hope our multi-view training paradigm could inspire future studies in other machine reasoning domains.
2023
pdf
bib
abs
UniMath: A Foundational and Multimodal Mathematical Reasoner
Zhenwen Liang
|
Tianyu Yang
|
Jipeng Zhang
|
Xiangliang Zhang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
While significant progress has been made in natural language processing (NLP), existing methods exhibit limitations in effectively interpreting and processing diverse mathematical modalities. Therefore, we introduce UniMath, a versatile and unified system designed for multimodal mathematical reasoning tasks. Tackling complex problem-solving in arithmetic, geometry, and table-based math, UniMath utilizes a fine-tuned T5 model augmented with a variational autoencoder (VAE)-based image tokenizer. By jointly training and evaluating the model on three diverse datasets - SVAMP, GeoQA, and TableMWP, UniMath achieves state-of-the-art performance. The model’s generalization ability is further demonstrated via fine-tuning on two additional datasets, MathQA and Geo-Proving. Through comprehensive evaluations, we showcase that joint training across diverse math tasks improves overall model performance and enhances its ability to generalize across different mathematical reasoning tasks. This pioneering approach provides a blueprint and inspires further efforts on unified mathematical reasoning with deep learning systems.
pdf
bib
abs
Let GPT be a Math Tutor: Teaching Math Word Problem Solvers with Customized Exercise Generation
Zhenwen Liang
|
Wenhao Yu
|
Tanmay Rajpurohit
|
Peter Clark
|
Xiangliang Zhang
|
Ashwin Kalyan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
In this paper, we present a novel approach for distilling math word problem solving capabilities from large language models (LLMs) into smaller, more efficient student models. Our approach is designed to consider the student model’s weaknesses and foster a tailored learning experience by generating targeted exercises aligned with educational science principles, such as knowledge tracing and personalized learning. Concretely, we let GPT-3 be a math tutor and run two steps iteratively: 1) assessing the student model’s current learning status on a GPT-generated exercise book, and 2) improving the student model by training it with tailored exercise samples generated by GPT-3. Experimental results reveal that our approach outperforms LLMs (e.g., GPT-3 and PaLM) in accuracy across three distinct benchmarks while employing significantly fewer parameters. Furthermore, we provide a comprehensive analysis of the various components within our methodology to substantiate their efficacy.
pdf
bib
Don’t be Blind to Questions: Question-Oriented Math Word Problem Solving
Zhenwen Liang
|
Jipeng Zhang
|
Xiangliang Zhang
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
2022
pdf
bib
abs
Scientific Paper Extractive Summarization Enhanced by Citation Graphs
Xiuying Chen
|
Mingzhe Li
|
Shen Gao
|
Rui Yan
|
Xin Gao
|
Xiangliang Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
In a citation graph, adjacent paper nodes share related scientific terms and topics. The graph thus conveys unique structure information of document-level relatedness that can be utilized in the paper summarization task, for exploring beyond the intra-document information.In this work, we focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings.We first propose a Multi-granularity Unsupervised Summarization model (MUS) as a simple and low-cost solution to the task.MUS finetunes a pre-trained encoder model on the citation graph by link prediction tasks.Then, the abstract sentences are extracted from the corresponding paper considering multi-granularity information.Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework.Motivated by this, we next propose a Graph-based Supervised Summarizationmodel (GSS) to achieve more accurate results on the task when large-scale labeled data are available.Apart from employing the link prediction as an auxiliary task, GSS introduces a gated sentence encoder and a graph information fusion module to take advantage of the graph information to polish the sentence representation.Experiments on a public benchmark dataset show that MUS and GSS bring substantial improvements over the prior state-of-the-art model.
pdf
bib
abs
ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture
Youssef Mohamed
|
Mohamed Abdelfattah
|
Shyma Alhuwaider
|
Feifan Li
|
Xiangliang Zhang
|
Kenneth Church
|
Mohamed Elhoseiny
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate “cultural-transfer” performance. 51K artworks have 5 annotations or more in 3 languages. This diversity makes it possible to study similarities and differences across languages and cultures. Further, we investigate captioning tasks, and find diversity improves the performance of baseline models. ArtELingo is publicly available at ‘www.artelingo.org‘ with standard splits and baseline models. We hope our work will help ease future research on multilinguality and culturally-aware AI.
pdf
bib
abs
Analogical Math Word Problems Solving with Enhanced Problem-Solution Association
Zhenwen Liang
|
Jipeng Zhang
|
Xiangliang Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Math word problem (MWP) solving is an important task in question answering which requires human-like reasoning ability. Analogical reasoning has long been used in mathematical education, as it enables students to apply common relational structures of mathematical situations to solve new problems. In this paper, we propose to build a novel MWP solver by leveraging analogical MWPs, which advance the solver’s generalization ability across different kinds of MWPs. The key idea, named analogy identification, is to associate the analogical MWP pairs in a latent space, i.e., encoding an MWP close to another analogical MWP, while leaving away from the non-analogical ones. Moreover, a solution discriminator is integrated into the MWP solver to enhance the association between an MWP and its true solution. The evaluation results verify that our proposed analogical learning strategy promotes the performance of MWP-BERT on Math23k over the state-of-the-art model Generate2Rank, with 5 times fewer parameters in the encoder. We also find that our model has a stronger generalization ability in solving difficult MWPs due to the analogical learning from easy MWPs.
pdf
bib
abs
Unsupervised Mitigating Gender Bias by Character Components: A Case Study of Chinese Word Embedding
Xiuying Chen
|
Mingzhe Li
|
Rui Yan
|
Xin Gao
|
Xiangliang Zhang
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
Word embeddings learned from massive text collections have demonstrated significant levels of discriminative biases. However, debias on the Chinese language, one of the most spoken languages, has been less explored. Meanwhile, existing literature relies on manually created supplementary data, which is time- and energy-consuming. In this work, we propose the first Chinese Gender-neutral word Embedding model (CGE) based on Word2vec, which learns gender-neutral word embeddings without any labeled data. Concretely, CGE utilizes and emphasizes the rich feminine and masculine information contained in radicals, i.e., a kind of component in Chinese characters, during the training procedure. This consequently alleviates discriminative gender biases. Experimental results on public benchmark datasets show that our unsupervised method outperforms the state-of-the-art supervised debiased word embedding models without sacrificing the functionality of the embedding model.
pdf
bib
abs
ArMATH: a Dataset for Solving Arabic Math Word Problems
Reem Alghamdi
|
Zhenwen Liang
|
Xiangliang Zhang
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper studies solving Arabic Math Word Problems by deep learning. A Math Word Problem (MWP) is a text description of a mathematical problem that can be solved by deriving a math equation to reach the answer. Effective models have been developed for solving MWPs in English and Chinese. However, Arabic MWPs are rarely studied. This paper contributes the first large-scale dataset for Arabic MWPs, which contains 6,000 samples of primary-school math problems, written in Modern Standard Arabic (MSA). Arabic MWP solvers are then built with deep learning models and evaluated on this dataset. In addition, a transfer learning model is built to let the high-resource Chinese MWP solver promote the performance of the low-resource Arabic MWP solver. This work is the first to use deep learning methods to solve Arabic MWP and the first to use transfer learning to solve MWP across different languages. The transfer learning enhanced solver has an accuracy of 74.15%, which is 3% higher than the solver without using transfer learning. We make the dataset and solvers available in public for encouraging more research of Arabic MWPs:
https://github.com/reem-codes/ArMATH2021
pdf
bib
abs
Capturing Relations between Scientific Papers: An Abstractive Model for Related Work Section Generation
Xiuying Chen
|
Hind Alamro
|
Mingzhe Li
|
Shen Gao
|
Xiangliang Zhang
|
Dongyan Zhao
|
Rui Yan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Given a set of related publications, related work section generation aims to provide researchers with an overview of the specific research area by summarizing these works and introducing them in a logical order. Most of existing related work generation models follow the inflexible extractive style, which directly extract sentences from multiple original papers to form a related work discussion. Hence, in this paper, we propose a Relation-aware Related work Generator (RRG), which generates an abstractive related work from the given multiple scientific papers in the same research area. Concretely, we propose a relation-aware multi-document encoder that relates one document to another according to their content dependency in a relation graph. The relation graph and the document representation are interacted and polished iteratively, complementing each other in the training process. We also contribute two public datasets composed of related work sections and their corresponding papers. Extensive experiments on the two datasets show that the proposed model brings substantial improvements over several strong baselines. We hope that this work will promote advances in related work generation task.
pdf
bib
abs
Data-Efficient Language Shaped Few-shot Image Classification
Zhenwen Liang
|
Xiangliang Zhang
Findings of the Association for Computational Linguistics: EMNLP 2021
Many existing works have demonstrated that language is a helpful guider for image understanding by neural networks. We focus on a language-shaped learning problem in a few-shot setting, i.e., using language to improve few-shot image classification when language descriptions are only available during training. We propose a data-efficient method that can make the best usage of the few-shot images and the language available only in training. Experimental results on dataset ShapeWorld and Birds show that our method outperforms other state-of-the-art baselines in language-shaped few-shot learning area, especially when training data is more severely limited. Therefore, we call our approach data-efficient language-shaped learning (DF-LSL).