Chen Xing
2025
Generating Spatial Knowledge Graphs from Automotive Diagrams for Question Answering
Steve Bakos | Chen Xing | Heidar Davoudi | Aijun An | Ron DiCarlantonio
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Steve Bakos | Chen Xing | Heidar Davoudi | Aijun An | Ron DiCarlantonio
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Answering “Where is the X button?” with “It’s next to the Y button” is unhelpful if the user knows neither location. Useful answers require obvious landmarks as a reference point. We address this by generating from a vehicle dashboard diagram a spatial knowledge graph (SKG) that shows the spatial relationship between a dashboard component and its nearby landmarks and using the SKG to help answer questions. We evaluate three distinct generation pipelines (Per-Attribute, Per-Component, and a Single-Shot baseline) to create the SKG using Large Vision-Language Models (LVLMs). On a new 65-vehicle dataset, we demonstrate that a decomposed Per-Component pipeline is the most effective strategy for generating a high-quality SKG; the graph produced by this method, when evaluated with a novel Significance score, identifies landmarks achieving 71.3% agreement with human annotators. This work enables downstream QA systems to provide more intuitive, landmark-based answers.
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Kaustubh Deshpande | Ved Sirdeshmukh | Johannes Baptist Mols | Lifeng Jin | Ed-Yeremai Hernandez-Cardona | Dean Lee | Jeremy Kritz | Willow E. Primack | Summer Yue | Chen Xing
Findings of the Association for Computational Linguistics: ACL 2025
Kaustubh Deshpande | Ved Sirdeshmukh | Johannes Baptist Mols | Lifeng Jin | Ed-Yeremai Hernandez-Cardona | Dean Lee | Jeremy Kritz | Willow E. Primack | Summer Yue | Chen Xing
Findings of the Association for Computational Linguistics: ACL 2025
We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time.We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (October 2024) achieving just a 41.4% average accuracy.
2024
FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability
Congying Xia | Chen Xing | Jiangshu Du | Xinyi Yang | Yihao Feng | Ran Xu | Wenpeng Yin | Caiming Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Congying Xia | Chen Xing | Jiangshu Du | Xinyi Yang | Yihao Feng | Ran Xu | Wenpeng Yin | Caiming Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper presents FoFo, a pioneering benchmark for evaluating large language models’ (LLMs) ability to follow complex, domain-specific formats, a crucial yet under-examined capability for their application as AI agents. Despite LLMs’ advancements, existing benchmarks fail to assess their format-following proficiency adequately. FoFo fills this gap with a diverse range of real-world formats and instructions, developed through an AI-Human collaborative method. Our evaluation across both open-source (e.g., Llama 2, WizardLM) and closed-source (e.g., GPT-4, PALM2, Gemini) LLMs highlights three key findings: open-source models significantly lag behind closed-source ones in format adherence; LLMs’ format-following performance is independent of their content generation quality; and LLMs’ format proficiency varies across different domains. These insights suggest the need for specialized tuning for format-following skills and highlight FoFo’s role in guiding the selection of domain-specific AI agents. FoFo will be publicly released, contributing a critical tool for advancing LLM evaluation and application.
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
Jiangshu Du | Yibo Wang | Wenting Zhao | Zhongfen Deng | Shuaiqi Liu | Renze Lou | Henry Peng Zou | Pranav Narayanan Venkit | Nan Zhang | Mukund Srinath | Haoran Ranran Zhang | Vipul Gupta | Yinghui Li | Tao Li | Fei Wang | Qin Liu | Tianlin Liu | Pengzhi Gao | Congying Xia | Chen Xing | Cheng Jiayang | Zhaowei Wang | Ying Su | Raj Sanjay Shah | Ruohao Guo | Jing Gu | Haoran Li | Kangda Wei | Zihao Wang | Lu Cheng | Surangika Ranathunga | Meng Fang | Jie Fu | Fei Liu | Ruihong Huang | Eduardo Blanco | Yixin Cao | Rui Zhang | Philip S. Yu | Wenpeng Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Jiangshu Du | Yibo Wang | Wenting Zhao | Zhongfen Deng | Shuaiqi Liu | Renze Lou | Henry Peng Zou | Pranav Narayanan Venkit | Nan Zhang | Mukund Srinath | Haoran Ranran Zhang | Vipul Gupta | Yinghui Li | Tao Li | Fei Wang | Qin Liu | Tianlin Liu | Pengzhi Gao | Congying Xia | Chen Xing | Cheng Jiayang | Zhaowei Wang | Ying Su | Raj Sanjay Shah | Ruohao Guo | Jing Gu | Haoran Li | Kangda Wei | Zihao Wang | Lu Cheng | Surangika Ranathunga | Meng Fang | Jie Fu | Fei Liu | Ruihong Huang | Eduardo Blanco | Yixin Cao | Rui Zhang | Philip S. Yu | Wenpeng Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Claim: This work is not advocating the use of LLMs for paper (meta-)reviewing. Instead, wepresent a comparative analysis to identify and distinguish LLM activities from human activities. Two research goals: i) Enable better recognition of instances when someone implicitly uses LLMs for reviewing activities; ii) Increase community awareness that LLMs, and AI in general, are currently inadequate for performing tasks that require a high level of expertise and nuanced judgment.This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload?This study focuses on the topic of LLMs as NLP Researchers, particularly examining the effectiveness of LLMs in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with “deficiency” labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) “LLMs as Reviewers”, how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) “LLMs as Metareviewers”, how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.
2023
Improving Gender Fairness of Pre-Trained Language Models without Catastrophic Forgetting
Zahra Fatemi | Chen Xing | Wenhao Liu | Caimming Xiong
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Zahra Fatemi | Chen Xing | Wenhao Liu | Caimming Xiong
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Existing studies addressing gender bias of pre-trained language models, usually build a small gender-neutral data set and conduct a second phase pre-training on the model with such data. However, given the limited size and concentrated focus of the gender-neutral data, catastrophic forgetting would occur during second-phase pre-training. Forgetting information in the original training data may damage the model’s downstream performance by a large margin. In this work, we empirically show that catastrophic forgetting occurs in such methods by evaluating them with general NLP tasks in GLUE. Then, we propose a new method, GEnder Equality Prompt (GEEP), to improve gender fairness of pre-trained models with less forgetting. GEEP freezes the pre-trained model and learns gender-related prompts with gender-neutral data. Empirical results show that GEEP not only achieves SOTA performances on gender fairness tasks, but also forgets less and performs better on GLUE by a large margin.
2022
DocQueryNet: Value Retrieval with Arbitrary Queries for Form-like Documents
Mingfei Gao | Le Xue | Chetan Ramaiah | Chen Xing | Ran Xu | Caiming Xiong
Proceedings of the 29th International Conference on Computational Linguistics
Mingfei Gao | Le Xue | Chetan Ramaiah | Chen Xing | Ran Xu | Caiming Xiong
Proceedings of the 29th International Conference on Computational Linguistics
We propose, DocQueryNet, a value retrieval method with arbitrary queries for form-like documents to reduce human effort of processing forms. Unlike previous methods that only address a fixed set of field items, our method predicts target value for an arbitrary query based on the understanding of the layout and semantics of a form. To further boost model performance, we propose a simple document language modeling (SimpleDLM) strategy to improve document understanding on large-scale model pre-training. Experimental results show that DocQueryNet outperforms previous designs significantly and the SimpleDLM further improves our performance on value retrieval by around 17% F1 score compared with the state-of-the-art pre-training method. Code is available here, https://github.com/salesforce/QVR-SimpleDLM.
Search
Fix author
Co-authors
- Zhoujun Li 3
- Yu Wu 3
- Wei Wu 3
- Ming Zhou 3
- Jiangshu Du 2
- Congying Xia 2
- Caiming Xiong 2
- Ran Xu 2
- Wenpeng Yin 2
- Aijun An 1
- Steve Bakos 1
- Eduardo Blanco 1
- Yixin Cao 1
- Lu Cheng 1
- Heidar Davoudi 1
- Zhongfen Deng 1
- Kaustubh Deshpande 1
- Ron DiCarlantonio 1
- Meng Fang 1
- Zahra Fatemi 1
- Yihao Feng 1
- Jie Fu 1
- Mingfei Gao 1
- Pengzhi Gao 1
- Jing Gu 1
- Ruohao Guo 1
- Vipul Gupta 1
- Ed-Yeremai Hernandez-Cardona 1
- Ruihong Huang 1
- Cheng Jiayang 1
- Lifeng Jin 1
- Jeremy Kritz 1
- Dean Lee 1
- Yinghui Li 1
- Tao Li 1
- Haoran Li 1
- Chaozhuo Li 1
- Wenhao Liu 1
- Shuaiqi Liu 1
- Qin Liu 1
- Tianlin Liu 1
- Fei Liu 1
- Renze Lou 1
- Johannes Baptist Mols 1
- Pranav Narayanan Venkit 1
- Willow E. Primack 1
- Chetan Ramaiah 1
- Surangika Ranathunga 1
- Raj Sanjay Shah 1
- Ved Sirdeshmukh 1
- Mukund Srinath 1
- Ying Su 1
- Yibo Wang 1
- Fei Wang 1
- Zhaowei Wang 1
- Zihao Wang 1
- Kangda Wei 1
- Caimming Xiong 1
- Can Xu 1
- Le Xue 1
- Xinyi Yang 1
- Philip S. Yu 1
- Summer Yue 1
- Nan Zhang 1
- Haoran Ranran Zhang 1
- Rui Zhang 1
- Wenting Zhao 1
- Henry Peng Zou 1