This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
YuejieZhang
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
Randomized Controlled Trials (RCTs) are rigorous clinical studies crucial for reliable decision-making, but their credibility can be compromised by bias. The Cochrane Risk of Bias tool (RoB 2) assesses this risk, yet manual assessments are time-consuming and labor-intensive. Previous approaches have employed Large Language Models (LLMs) to automate this process. However, they typically focus on manually crafted prompts and a restricted set of simple questions, limiting their accuracy and generalizability. Inspired by the human bias assessment process, we propose RoBGuard, a novel framework for enhancing LLMs to assess the risk of bias in RCTs. Specifically, RoBGuard integrates medical knowledge-enhanced question reformulation, multimodal document parsing, and multi-expert collaboration to ensure both completeness and accuracy. Additionally, to address the lack of suitable datasets, we introduce two new datasets: RoB-Item and RoB-Domain. Experimental results demonstrate RoBGuard’s effectiveness on the RoB-Item dataset, outperforming existing methods.
Document-level relation extraction (DocRE) identifies relations between entities across an entire document. However, as the number and complexity of entities and entity-pair relations grow, the problem space expands quadratically, causing incomplete annotations and frequent false negatives, especially in biomedical datasets due to high construction costs. This leads to low recall in real-world scenarios. To address this, we propose GLiM, a novel framework that reduces the problem space using a graph-enhanced Transformer-based model and leverages large language models (LLMs) for reasoning. GLiM employs a cascaded approach: first, a graph-enhanced Transformer processes entity-pair relations with finer granularity by dynamically adjusting the graph size based on the number of entities; then, LLM inference handles challenging cases. Experiments show that GLiM boosts average recall and F1 scores by +6.34 and +4.41, respectively, outperforming state-of-the-art models on biomedical benchmarks. These results demonstrate the effectiveness of combining graph-enhanced Transformers with LLM inference for biomedical DocRE. Code will be released at https://github.com/HaoFang10/GLiM.
Role-playing agents (RPAs) powered by large language models (LLMs) have been widely utilized in dialogue systems for their capability to deliver personalized interactions. Current evaluations of RPAs mainly focus on personality fidelity, tone imitation, and knowledge consistency, while overlooking emotional fidelity, a key factor that affects user experience. To this end, we propose a benchmark called EmoCharacter to assess emotional fidelity of RPAs in dialogues. EmoCharacter includes two benchmark datasets (single-turn and multi-turn dialogues), three evaluation settings, and six metrics to measure the emotional fidelity between RPAs and the characters they portray. Based on EmoCharacter, we conduct extensive evaluations on RPAs powered by seven widely used LLMs with representative role-playing methods. Our empirical findings reveal that: (1) Contrary to intuition, current role-playing methods often reduce the emotional fidelity of LLMs in dialogues; (2) Enhancing the general capabilities of LLMs does not necessarily improve the emotional fidelity of RPAs; (3) Fine-tuning or In-Context Learning based on real dialogue data can enhance emotional fidelity.
With the Generative Pre-trained Transformer 3.5 (GPT-3.5) exhibiting remarkable reasoning and comprehension abilities in Natural Language Processing (NLP), most Question Answering (QA) research has primarily centered around general QA tasks based on GPT, neglecting the specific challenges posed by Complex Table QA. In this paper, we propose to incorporate GPT-3.5 to address such challenges, in which complex tables are reconstructed into tuples and specific prompt designs are employed for dialogues. Specifically, we encode each cell’s hierarchical structure, position information, and content as a tuple. By enhancing the prompt template with an explanatory description of the meaning of each tuple and the logical reasoning process of the task, we effectively improve the hierarchical structure awareness capability of GPT-3.5 to better parse the complex tables. Extensive experiments and results on Complex Table QA datasets, i.e., the open-domain dataset HiTAB and the aviation domain dataset AIT-QA show that our approach significantly outperforms previous work on both datasets, leading to state-of-the-art (SOTA) performance.
Generating paragraph captions for untrimmed videos without event annotations is challenging, especially when aiming to enhance precision and minimize repetition at the same time. To address this challenge, we propose a module called Sparse Frame Grouping (SFG). It dynamically groups event information with the help of action information for the entire video and excludes redundant frames within pre-defined clips. To enhance the performance, an Intra Contrastive Learning technique is designed to align the SFG module with the core event content in the paragraph, and an Inter Contrastive Learning technique is employed to learn action-guided context with reduced static noise simultaneously. Extensive experiments are conducted on two benchmark datasets (ActivityNet Captions and YouCook2). Results demonstrate that SFG outperforms the state-of-the-art methods on all metrics.
Neural network based models have achieved impressive results on the sentence classification task. However, most of previous work focuses on designing more sophisticated network or effective learning paradigms on monolingual data, which often suffers from insufficient discriminative knowledge for classification. In this paper, we investigate to improve sentence classification by multilingual data augmentation and consensus learning. Comparing to previous methods, our model can make use of multilingual data generated by machine translation and mine their language-share and language-specific knowledge for better representation and classification. We evaluate our model using English (i.e., source language) and Chinese (i.e., target language) data on several sentence classification tasks. Very positive classification performance can be achieved by our proposed model.