This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Nam LeHai
Also published as:
Nam Le Hai
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
To address the phenomenon of similar classes, existing methods in few-shot continual relation extraction (FCRE) face two main challenges: non-representative prototypes and representation bias, especially when the number of available samples is limited. In our work, we propose Minion to address these challenges. Firstly, we leverage the General Orthogonal Frame (GOF) structure, based on the concept of Neural Collapse, to create robust class prototypes with clear separation, even between analogous classes. Secondly, we utilize label description representations as global class representatives within the fast-slow contrastive learning paradigm. These representations consistently encapsulate the essential attributes of each relation, acting as global information that helps mitigate overfitting and reduces representation bias caused by the limited local few-shot examples within a class. Extensive experiments on well-known FCRE benchmarks show that our method outperforms state-of-the-art approaches, demonstrating its effectiveness for advancing RE system.
CodeLLMs are widely used for code generation, yet their ability to handle repository-level dependencies remains underexplored. We introduce RepoExec, a benchmark for evaluating repository-level code generation, focusing on executability, functional correctness, and dependency utilization. Our study evaluates 18 models, revealing that retaining full dependency context yields the best performance, while smaller context sizes can be misleading. Pretrained LLMs excel in correctness but often reimplement dependencies, while instruction-tuned models better utilize dependencies but sometimes introduce unnecessary complexity. We propose an instruction-tuning dataset that improves dependency handling and introduce a new metric, Dependency Invocation Rate (DIR), to measure context utilization. Experiments show that instruction-tuned models improve DIR by over 10%, and multi-round debugging further enhances both correctness and dependency use. RepoExec provides a comprehensive framework to advance CodeLLMs for real-world applications. The dataset and source code are available at https://github.com/FSoft-AI4Code/RepoExec.
Retrieval-Augmented Generation (RAG) enhances large language models by grounding their outputs in external knowledge. Recent advances in Graph-based RAG (GRAG) frameworks, such as GraphRAG, LightRAG, and HippoRAG2, integrate knowledge graphs into the retrieval process to improve multi-hop reasoning and semantic coherence. While effective in monolingual settings, these methods remain underexplored in cross-lingual scenarios and face limitations in semantic granularity and entity alignment. In this work, we propose MaGiX, the first GRAG framework tailored for English–Vietnamese cross-lingual question answering. MaGiX constructs a multi-granular cross-lingual knowledge graph using fine-grained attribute descriptions and cross-synonym edges, and incorporates a custom multilingual embedding model trained with contrastive learning for semantic alignment. During retrieval, MaGiX leverages graph-based reasoning and a semantic-aware reranking strategy to enhance cross-lingual relevance. Experiments across five benchmarks show that MaGiX substantially outperforms prior GRAG systems in both retrieval accuracy and generation quality, advancing structured retrieval for multilingual QA.
Few-shot Continual Relation Extraction (FCRE) has emerged as a significant challenge in information extraction, necessitating that relation extraction (RE) systems can sequentially identify new relations with limited labeled samples. While existing studies have demonstrated promising results in FCRE, they often overlook the issue of similar relations, which is a critical factor contributing to catastrophic forgetting. In this work, we propose Sirus–a novel method that utilizes relation descriptions and dynamic clustering on these descriptions to identify similar relations. Leveraging this information, we introduce innovative loss functions specifically designed to enhance the distinction between relations, with a focus on learning to differentiate similar ones. Experimental results show that our approach can effectively mitigate the problem of catastrophic forgetting and outperforms state-of-the-art methods by a large margin. Additionally, we explore the potential of Large Language Model Embeddings (LLMEs) with representation learning and embedding capabilities, demonstrating their promise for advancing FCRE systems.
Document retrieval plays a crucial role in numerous question-answering systems, yet research has concentrated on the general knowledge domain and resource-rich languages like English. In contrast, it remains largely underexplored in low-resource languages and cross-lingual scenarios within specialized domain knowledge such as legal. We present a novel dataset designed for cross-lingual retrieval between Vietnamese and English, which not only covers the general domain but also extends to the legal field. Additionally, we propose auxiliary loss function and symmetrical training strategy that significantly enhance the performance of state-of-the-art models on these retrieval tasks. Our contributions offer a significant resource and methodology aimed at improving cross-lingual retrieval in both legal and general QA settings, facilitating further advancements in document retrieval research across multiple languages and a broader spectrum of specialized domains. All the resources related to our work can be accessed at huggingface.co/datasets/bkai-foundation-models/crosslingual.
Few-shot Continual Relations Extraction (FCRE) is an emerging and dynamic area of study where models can sequentially integrate knowledge from new relations with limited labeled data while circumventing catastrophic forgetting and preserving prior knowledge from pre-trained backbones. In this work, we introduce a novel method that leverages often-discarded language model heads. By employing these components via a mutual information maximization strategy, our approach helps maintain prior knowledge from the pre-trained backbone and strategically aligns the primary classification head, thereby enhancing model performance. Furthermore, we explore the potential of Large Language Models (LLMs), renowned for their wealth of knowledge, in addressing FCRE challenges. Our comprehensive experimental results underscore the efficacy of the proposed method and offer valuable insights for future work.
La recherche conversationnelle est une tâche qui vise à retrouver des documents à partir de la questioncourante de l’utilisateur ainsi que l’historique complet de la conversation. La plupart des méthodesantérieures sont basées sur une approche multi-étapes reposant sur une reformulation de la question.Cette étape de reformulation est critique, car elle peut conduire à un classement sous-optimal des do-cuments. D’autres approches ont essayé d’ordonner directement les documents, mais s’appuient pourla plupart sur un jeu de données contenant des pseudo-labels. Dans ce travail, nous proposons une tech-nique d’apprentissage à la fois “légère” et innovante pour un modèle contextualisé d’ordonnancementbasé sur SPLADE. En s’appuyant sur les représentations parcimonieuses de SPLADE, nous montronsque notre modèle, lorsqu’il est combiné avec le modèle de ré-ordonnancement T5Mono, obtient desrésultats qui sont compétitifs avec ceux obtenus par les participants des campagnes d’évaluation TRECCAsT 2020 et 2021. Le code source est disponible sur https://github.com/anonymous.