Qi Xu


2024

pdf
Through the MUD: A Multi-Defendant Charge Prediction Benchmark with Linked Crime Elements
Xiao Wei | Qi Xu | Hang Yu | Qian Liu | Erik Cambria
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The current charge prediction datasets mostly focus on single-defendant criminal cases.However, real-world criminal cases usually involve multiple defendants whose criminal facts are intertwined. In an early attempt to fill this gap, we introduce a new benchmark that encompasses legal cases involving multiple defendants, where each defendant is labeled with a charge and four types of crime elements, i.e., Object Element, Objective Element, Subject Element, and Subjective Element. Based on the dataset, we further develop an interpretable model called EJudge that incorporates crime elements and legal rules to infer charges. We observe that predicting crime charges while providing corresponding rationales benefits the interpretable AI system. Extensive experiments show that EJudge significantly surpasses state-of-the-art methods, which verify the importance of crime elements and legal rules in multi-defendant charge prediction. The source code and dataset are available at https://anonymous.4open.science/r/MCP_1-6010.

pdf
GroundingGPT: Language Enhanced Multi-modal Grounding Model
Zhaowei Li | Qi Xu | Dong Zhang | Hang Song | YiQing Cai | Qi Qi | Ran Zhou | Junting Pan | Zefeng Li | Vu Tu | Zhida Huang | Tao Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-modal large language models (MLLMs) have demonstrated remarkable performance across various tasks. However, these models often prioritize capturing global information and overlook the importance of perceiving local information. This limitation hinders their ability to effectively understand fine-grained details and handle grounding tasks that necessitate nuanced comprehension. Although some recent works have made strides in this, they have primarily focused on single-modality inputs. Therefore, we propose GroundingGPT, an end-to-end language enhanced multi-modal grounding model. It is designed to perform fine-grained grounding tasks for three modalities: image, video and audio. To enhance the model’s performance, we adopt a coarse-to-fine training strategy, utilizing a three-stage training approach to progressively enhance the model’s semantic awareness and fine-grained understanding capabilities. Additionally, we employ a diversified stage-specific dataset construction pipeline, developing a multi-modal, multi-granularity dataset tailored for training the model in different stages. Extensive experiments conducted on multiple multi-modal benchmarks demonstrate that our model achieves impressive fine-grained understanding of multi-modal inputs on grounding tasks while maintaining or improving its global comprehension capabilities. Our code, model, and dataset are available at https://github.com/lzw-lzw/GroundingGPT.