Chen Zhu

2025

pdf bib abs
KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph
Jinhao Jiang | Kun Zhou | Xin Zhao | Yang Song | Chen Zhu | Hengshu Zhu | Ji-Rong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we aim to improve the reasoning ability of large language models(LLMs) over knowledge graphs(KGs) to answer complex questions. Inspired by existing methods that design the interaction strategy between LLMs and KG, we propose an autonomous LLM-based agent framework, called KG-Agent, which enables a small LLM to actively make decisions until finishing the reasoning process over KGs. In KG-Agent, we integrate the LLM, multifunctional toolbox, KG-based executor, and knowledge memory, and develop an iteration mechanism that autonomously selects the tool and then updates the memory for reasoning over KG. To guarantee the effectiveness, we leverage program language to formulate the multi-hop reasoning process over the KG and synthesize a code-based instruction dataset to fine-tune the base LLM. Extensive experiments demonstrate that only using 10K samples for tuning LLaMA2-7B can outperform competitive methods using larger LLMs or more data, on both in-domain and out-domain datasets. Our code and data will be publicly released.

Text-to-audio (T2A) generation has achieved remarkable progress in generating a variety of audio outputs from language prompts. However, current state-of-the-art T2A models still struggle to satisfy human preferences for prompt-following and acoustic quality when generating complex multi-event audio. To improve the performance of the model in these high-level applications, we propose to enhance the basic capabilities of the model with AI feedback learning. First, we introduce fine-grained AI audio scoring pipelines to: 1) verify whether each event in the text prompt is present in the audio (Event Occurrence Score), 2) detect deviations in event sequences from the language description (Event Sequence Score), and 3) assess the overall acoustic and harmonic quality of the generated audio (Acoustic&Harmonic Quality). We evaluate these three automatic scoring pipelines and find that they correlate significantly better with human preferences than other evaluation metrics. This highlights their value as both feedback signals and evaluation metrics. Utilizing our robust scoring pipelines, we construct a large audio preference dataset, T2A-FeedBack, which contains 41k prompts and 249k audios, each accompanied by detailed scores. Moreover, we introduce T2A-EpicBench, a benchmark that focuses on long captions, multi-events, and story-telling scenarios, aiming to evaluate the advanced capabilities of T2A models. Finally, we demonstrate how T2A-FeedBack can enhance current state-of-the-art audio model. With simple preference tuning, the audio generation model exhibits significant improvements in both simple (AudioCaps test set) and complex (T2A-EpicBench) scenarios.

2024

Multimodal Named Entity Recognition and Grounding (MNERG) aims to extract paired textual and visual entities from texts and images. It has been well explored through a two-step paradigm: initially identifying potential visual entities using object detection methods and then aligning the extracted textual entities with their corresponding visual entities. However, when it comes to fine-grained MNERG, the long-tailed distribution of textual entity categories and the performance of object detectors limit the effectiveness of traditional methods. Specifically, more detailed classification leads to many low-frequency categories, and existing object detection methods often fail to pinpoint subtle regions within images. To address these challenges, we propose the Granular Entity Mapper (GEM) framework. Firstly, we design a multi-granularity entity recognition module, followed by a reranking module based on the Multimodal Large Language Model (MLLM) to incorporate hierarchical information of entity categories, visual cues, and external textual resources collectively for accurate fine-grained textual entity recognition. Then, we utilize a pre-trained Large Visual Language Model (LVLM) as an implicit visual entity grounder that directly deduces relevant visual entity regions from the entire image without the need for bounding box training. Experimental results on the GMNER and FMNERG datasets demonstrate that our GEM framework achieves state-of-the-art results on the fine-grained content extraction task.

pdf bib abs
From Technology to Market. Bilingual Corpus on the Evaluation of Technology Opportunity Discovery
Amir Hazem | Kazuyuki Motohashi | Chen Zhu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

As companies aim to enhance and expand their product portfolios, Technology Opportunity Discovery (TOD) has gained increasing interest. To comprehend the role of emerging technologies in innovation, we introduce a novel technology-market corpus in English and Japanese languages, and conduct a comprehensive empirical evaluation of the linkage between technology and the market. Our dataset comprises English patents extracted from the USPTO database and Japanese patents from the Japanese Patent Office (JPO), along with their associated products for each stock market company. We compare several static and contextualized word embedding methods to construct a technology-market space and propose an effective methodology based on a fine-tuned BERT model for linking technology to the market.

pdf bib abs
MRT: Multi-modal Short- and Long-range Temporal Convolutional Network for Time-sync Comment Video Behavior Prediction
Weihao Zhao | Weidong He | Hao Wang | Haoyang Bi | Han Wu | Chen Zhu | Tong Xu | Enhong Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

As a fresh way to improve the user viewing experience, videos of time-sync comments have attracted a lot of interest. Many efforts have been made to explore the effectiveness of time-sync comments for various applications. However, due to the complexity of interactions among users, videos, and comments, it still remains challenging to understand users’ behavior on time-sync comments. Along this line, we study the problem of time-sync comment behavior prediction with considerations of both historical behaviors and multi-modal information of visual frames and textual comments. Specifically, we propose a novel Multi-modal short- and long-Range Temporal Convolutional Network model, namely MRT. Firstly, we design two amplified Temporal Convolutional Networks with different sizes of receptive fields, to capture both short- and long-range surrounding contexts for each frame and time-sync comments. Then, we design a bottle-neck fusion module to obtain the multi-modal enhanced representation. Furthermore, we take the user preferences into consideration to generate the personalized multi-model semantic representation at each timestamp. Finally, we utilize the binary cross-entropy loss to optimize MRT on the basis of users’ historical records. Through comparing with representative baselines, we demonstrate the effectiveness of MRT and qualitatively verify the necessity and utility of short- and long-range contextual and multi-modal information through extensive experiments.

2023

pdf bib abs
How Many Demonstrations Do You Need for In-context Learning?
Jiuhai Chen | Lichang Chen | Chen Zhu | Tianyi Zhou
Findings of the Association for Computational Linguistics: EMNLP 2023

Large language models (LLMs) are capable to perform complex reasoning by in-context learning (ICL) when provided with a few input-output demonstrations (demos) and more powerful when intermediate reasoning steps (chain of thoughts (CoT)) of the demos are given. Is it necessary to use multi-demo in ICL? In this paper, we study ICL using fewer demos for each test query on the tasks in (Wei et al., 2022). Surprisingly, we do not observe significant degradation when using only one randomly chosen demo. To study this phenomenon, for each test query, we categorize demos into “positive demos” leading to the correct answer, and “negative demos” resulting in wrong answers. Our analysis reveals an inherent bias in those widely studied datasets and the redundancy of demos: most demos are positive for a majority of test queries, which explains the good performance of ICL with one random demo. Moreover, ICL (with and w/o CoT) using only one positive demo significantly outperforms multi-demo ICL adopted by most previous works, indicating the weakness of LLMs in finding positive demo(s) for input queries, which is difficult to evaluate on the biased datasets. Furthermore, we observe a counterintuitive behavior of ICL using multi-demo, i.e., its accuracy degrades(improves) when given more positive(negative) demos. This implies that ICL can be easily misguided by interference among demos and their spurious correlations. Our analyses highlight several fundamental challenges that need to be addressed in LLMs training, ICL, and benchmark design.