Nguyen Thi Ngoc Diep


2026

Cross-lingual topic modeling aims to discover shared semantic structures across languages, yet existing models depend on sparse bilingual resources and often yield incoherent or weakly aligned topics. Recent LLM-based refinements improve interpretability but are costly, document-level, and prone to hallucination, with prior white-box approaches requiring inaccessible token probabilities. We propose LLM-XTM, a framework that integrates LLM-guided topic refinement with self-consistency uncertainty quantification, enabling black-box, stable, and scalable enhancement of cross-lingual topic models. Experiments on multilingual corpora show that LLM-XTM achieves superior topic coherence and alignment while reducing reliance on bilingual dictionaries and expensive LLM calls.
Cross-Tokenizer Knowledge Distillation (CTKD) enables knowledge transfer between a large language model and a smaller student, even when they employ different tokenizers. While existing approaches mainly focus on token-level alignment strategies, which are often brittle and sensitive to discrepancies between tokenizers, we argue that the method of aggregating tokens into more robust representations before distillation is of equal importance. In this paper, we introduce SRA (Span Representation Alignment for Large Language Model Distillation), a novel framework that reframes CTKD through the physical lens of Multi-Particle Dynamical Systems. SRA shifts the fundamental unit of alignment from tokens to robust, tokenizer-agnostic spans. We model each span as a cluster of particles and represent its state by its Center of Mass (CoM) - an attention-weighted average that captures rich semantic information. We leverage the concept of span centers of mass with attention-derived weighting to prioritize the most salient spans. In addition, we employ a geometric regularizer to preserve the structural integrity of the representation space and introduce aligned span logit distillation to enhance knowledge transfer across models. In challenging cross-architecture distillation experiments, SRA consistently and significantly outperforms state-of-the-art CTKD baselines, validating our physically-grounded approach.
Knowledge Distillation (KD) has established itself as a pivotal technique for compressing large pre-trained language models. However, existing methods that force a student to strictly mimic the teacher’s sentence embeddings or internal features often incur prohibitive computational costs and yield suboptimal performance due to the inherent capacity gap. To address these challenges, we propose TALAS (Teacher-Anchored Layer Alignment with Sharpness-aware minimization), a unified framework that synergizes hierarchical (multi-layer) alignment with robust optimization. First, we introduce a Teacher-Anchored mechanism that selectively distills final sentence embeddings only into the student’s upper layers, thereby reducing overhead while respecting capacity constraints. Second, we bridge the semantic gap in lower layers via Layer-Aligned Self-Distillation, which propagates knowledge top-down using internal geometric relational constraints in the embedding space. Finally, to prevent the student from memorizing point-wise teacher noise, we integrate Adaptive Sharpness-Aware Minimization (ASAM) into the training objective, guiding the model towards flat minima for enhanced generalization. Empirical results on standard sentence embedding benchmarks demonstrate that TALAS consistently outperforms strong distillation baselines while achieving superior training efficiency in terms of computational cost and memory footprint.

2025

Data scarcity is a major challenge in Few-shot Continual Relation Extraction (FCRE), where models must learn new relations from limited data while retaining past knowledge. Current methods, restricted by minimal data streams, struggle with catastrophic forgetting and overfitting. To overcome this, we introduce a novel *data augmentation strategy* that transforms single input sentences into complex texts by integrating both old and new data. Our approach sharpens model focus, enabling precise identification of word relationships based on specified relation types. By embedding adversarial training effects and leveraging new training perspectives through special objective functions, our method enhances model performance significantly. Additionally, we explore Sharpness-Aware Minimization (SAM) in Few-shot Continual Learning. Our extensive experiments uncover fascinating behaviors of SAM across tasks and offer valuable insights for future research in this dynamic field.
Few-shot Continual Relation Extraction (FCRE) has emerged as a significant challenge in information extraction, necessitating that relation extraction (RE) systems can sequentially identify new relations with limited labeled samples. While existing studies have demonstrated promising results in FCRE, they often overlook the issue of similar relations, which is a critical factor contributing to catastrophic forgetting. In this work, we propose Sirus–a novel method that utilizes relation descriptions and dynamic clustering on these descriptions to identify similar relations. Leveraging this information, we introduce innovative loss functions specifically designed to enhance the distinction between relations, with a focus on learning to differentiate similar ones. Experimental results show that our approach can effectively mitigate the problem of catastrophic forgetting and outperforms state-of-the-art methods by a large margin. Additionally, we explore the potential of Large Language Model Embeddings (LLMEs) with representation learning and embedding capabilities, demonstrating their promise for advancing FCRE systems.
Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications.We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions.
Retrieval-Augmented Generation (RAG) enhances large language models by grounding their outputs in external knowledge. Recent advances in Graph-based RAG (GRAG) frameworks, such as GraphRAG, LightRAG, and HippoRAG2, integrate knowledge graphs into the retrieval process to improve multi-hop reasoning and semantic coherence. While effective in monolingual settings, these methods remain underexplored in cross-lingual scenarios and face limitations in semantic granularity and entity alignment. In this work, we propose MaGiX, the first GRAG framework tailored for English–Vietnamese cross-lingual question answering. MaGiX constructs a multi-granular cross-lingual knowledge graph using fine-grained attribute descriptions and cross-synonym edges, and incorporates a custom multilingual embedding model trained with contrastive learning for semantic alignment. During retrieval, MaGiX leverages graph-based reasoning and a semantic-aware reranking strategy to enhance cross-lingual relevance. Experiments across five benchmarks show that MaGiX substantially outperforms prior GRAG systems in both retrieval accuracy and generation quality, advancing structured retrieval for multilingual QA.
Knowledge distillation (KD) is crucial for compressing large text embedding models, but faces challenges when teacher and student models use different tokenizers (Cross-Tokenizer KD - CTKD). Vocabulary mismatches impede the transfer of relational knowledge encoded in deep representations, such as hidden states and attention matrices, which are vital for producing high-quality embeddings. Existing CTKD methods often focus on direct output alignment, neglecting this crucial structural information. We propose a novel framework tailored for CTKD embedding model distillation. We first map tokens one-to-one via Minimum Edit Distance (MinED). Then, we distill intra-model relational knowledge by aligning attention matrix patterns using Centered Kernel Alignment, focusing on the top-m most important tokens of the directly mapped tokens. Simultaneously, we align final hidden states via Optimal Transport with Importance-Scored Mass Assignment, which emphasizes semantically important token representations, based on importance scores derived from attention weights. We evaluate distillation from state-of-the-art embedding models (e.g., LLM2Vec, BGE) to a Bert-base-uncased model on embedding-reliant tasks such as text classification, sentence pair classification, and semantic textual similarity. Our proposed framework significantly outperforms existing CTKD baselines. By preserving attention structure and prioritizing key representations, our approach yields smaller, high-fidelity embedding models despite tokenizer differences.
To address the phenomenon of similar classes, existing methods in few-shot continual relation extraction (FCRE) face two main challenges: non-representative prototypes and representation bias, especially when the number of available samples is limited. In our work, we propose Minion to address these challenges. Firstly, we leverage the General Orthogonal Frame (GOF) structure, based on the concept of Neural Collapse, to create robust class prototypes with clear separation, even between analogous classes. Secondly, we utilize label description representations as global class representatives within the fast-slow contrastive learning paradigm. These representations consistently encapsulate the essential attributes of each relation, acting as global information that helps mitigate overfitting and reduces representation bias caused by the limited local few-shot examples within a class. Extensive experiments on well-known FCRE benchmarks show that our method outperforms state-of-the-art approaches, demonstrating its effectiveness for advancing RE system.