This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Sentence representation learning (SRL) aims to learn sentence embeddings that conform to the semantic information of sentences. In recent years, fine-tuning methods based on pre-trained models and contrastive learning frameworks have significantly advanced the quality of sentence representations. However, within the semantic space of SRL models, both word embeddings and sentence representations derived from word embeddings exhibit substantial redundant information, which can adversely affect the precision of sentence representations. Existing approaches predominantly optimize training strategies to alleviate the redundancy problem, lacking fine-grained guidance on reducing redundant representations. This paper proposes a novel approach that dynamically identifies and reduces redundant information from a dimensional perspective, training the SRL model to redistribute semantics on different dimensions, and entailing better sentence representations. Extensive experiments across seven semantic text similarity benchmarks demonstrate the effectiveness and generality of the proposed method. A comprehensive analysis of the experimental results is conducted, and the code/data will be released.
Large language models (LLMs) demonstrate remarkable text generation and syntax parsing capabilities in high-resource languages. However, their performance notably declines in low-resource languages due to memory forgetting stemming from semantic interference across languages. To address this issue, we propose a novel deep hierarchical syntax understanding approach to improve the cross-lingual semantic memory capability of LLMs. First, we design a multi-task joint fine-tuning strategy to implicitly align linguistic knowledge between source and target languages in LLMs, which is leveraged to initially parse the target text. Second, we automatically construct the multilingual dependency label banks based on the statistical structure information from the Universal Dependencies (UD) data. Third, we obtain each label’s memory strength via in-depth analysis of the initial parsing tree and its dependency label bank. Finally, memory strength is further exploited to guide LLMs to learn the linguistic commonalities from multilingual dependency label banks, thus activating the memory ability of weak labels. Experimental results on four benchmark datasets show that our method can dramatically improve the parsing accuracy of all baseline models, leading to new state-of-the-art results. Further analysis reveals that our approach can effectively enhance the weak syntactic label memory cognition of LLMs by combining the advantages of both implicit multi-task fine-tuning and explicit label bank guiding. Our code and dependency label banks are released at https://github.com/Flamelunar/memory_dep.
Generative Information Retrieval is an emerging retrieval paradigm that exhibits remarkable performance in monolingual scenarios. However, applying these methods to multilingual retrieval still encounters two primary challenges, cross-lingual identifier misalignment and identifier inflation. To address these limitations, we propose Multilingual Generative Retrieval via Cross-lingual Semantic Compression (MGR-CSC), a novel framework that unifies semantically equivalent multilingual keywords into shared atoms to align semantics and compresses the identifier space, and we propose a dynamic multi-step constrained decoding strategy during retrieval. MGR-CSC improves cross-lingual alignment by assigning consistent identifiers and enhances decoding efficiency by reducing redundancy. Experiments demonstrate that MGR-CSC achieves outstanding retrieval accuracy, improving by 6.83% on mMarco100k and 4.77% on mNQ320k, while reducing document identifiers length by 74.51% and 78.2%, respectively. We publicly release our dataset and code at https://github.com/simengggg/MGR-CSC
Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs’ multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs).However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge.In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER).KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization.To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method.The experimental results demonstrate that our framework achieves improvements of 5.47%, 3.27%, and 1.01% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method.Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages.We have released the dataset and code for our work on https://github.com/gaoxiaofei07/KL-GMoE.
Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about an international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack of datasets for MLMD news summarization has constrained the development of research in this area. To fill this gap, we construct a mixed-language multi-document news summarization dataset (MLMD-news), which contains four different languages and 10,992 source document cluster and target summary pairs. Additionally, we propose a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset and publicly release our dataset and code, aiming to advance research in summarization within MLMD scenarios.
“Dialect speech recognition has always been one of the challenges in Automatic Speech Recog-nition (ASR) systems. While lots of ASR systems perform well in Mandarin, their performancesignificantly drops when handling dialect speech. This is mainly due to the obvious differencesbetween dialects and Mandarin in pronunciation and the data scarcity of dialect speech. In thispaper, we propose DialectMoE, a Chinese multi-dialects speech recognition model based onMixture-of-Experts (MoE) in a low-resource conditions. Specifically, DialectMoE assigns inputsequences to a set of experts using a dynamic routing algorithm, with each expert potentiallytrained for a specific dialect. Subsequently, the outputs of these experts are combined to derivethe final output. Due to the similarities among dialects, distinct experts may offer assistance inrecognizing other dialects as well. Experimental results on the Datatang dialect public datasetshow that, compared with the baseline model, DialectMoE reduces Character Error Rate (CER)for Sichuan, Yunnan, Hubei and Henan dialects by 23.6%, 32.6%, 39.2% and 35.09% respec-tively. The proposed DialectMoE model demonstrates outstanding performance in multi-dialectsspeech recognition.”
Large language models (LLMs) have demonstrated remarkable capabilities in comprehensively handling various types of natural language processing (NLP) tasks. However, there are significant differences in the knowledge and abilities required for different tasks. Therefore, it is important to understand whether the same LLM processes different tasks in the same way. Are there specific neurons in a LLM for different tasks? Inspired by neuroscience, this paper pioneers the exploration of whether distinct neurons are activated when a LLM handles different tasks. Compared with current research exploring the neurons of language and knowledge, task-specific neurons present a greater challenge due to their abstractness, diversity, and complexity. To address these challenges, this paper proposes a method for task-specific neuron localization based on Causal Gradient Variation with Special Tokens (CGVST). CGVST identifies task-specific neurons by concentrating on the most significant tokens during task processing, thereby eliminating redundant tokens and minimizing interference from non-essential neurons. Compared to traditional neuron localization methods, our approach can more effectively identify task-specific neurons. We conduct experiments across eight different public tasks. Experiments involving the inhibition and amplification of identified neurons demonstrate that our method can accurately locate task-specific neurons.
With the strong representational capabilities of pre-trained language models, dependency parsing in resource-rich languages has seen significant advancements. However, the parsing accuracy drops sharply when the model is transferred to low-resource language due to distribution shifts. To alleviate this issue, we propose a representation alignment and adversarial model to filter out useful knowledge from rich-resource language and ignore useless ones. Our proposed model consists of two components, i.e., an alignment network in the input layer for selecting useful language-specific features and an adversarial network in the encoder layer for augmenting the language-invariant contextualized features. Experiments on the benchmark datasets show that our proposed model outperforms RoBERTa-enhanced strong baseline models by 1.37 LAS and 1.34 UAS. Detailed analysis shows that both alignment and adversarial networks are equally important in alleviating the distribution shifts problem and can complement each other. In addition, the comparative experiments demonstrate that both the alignment and adversarial networks can substantially facilitate extracting and utilizing relevant target language features, thereby increasing the adaptation capability of our proposed model.
Existing accent transfer works rely on parallel data or speech recognition models. This paper focuses on the practical application of accent transfer and aims to implement accent transfer using non-parallel datasets. The study has encountered the challenge of speech representation disentanglement and modeling accents. In our accent modeling transfer framework, we manage to solve these problems by two proposed methods. First, we learn the suprasegmental information associated with tone to finely model the accents in terms of tone and rhythm. Second, we propose to use mutual information learning to disentangle the accent features and control the accent of the generated speech during the inference time. Experiments show that the proposed framework attains superior performance to the baseline models in terms of accentedness and audio quality.
Knowledge Graph Embedding (KGE) has been proposed and successfully utilized to knowledge Graph Completion (KGC). But classic KGE paradigm often fail in unseen relation representations. Previous studies mainly utilize the textual descriptions of relations and its neighbor relations to represent unseen relations. In fact, the semantics of a relation can be expressed by three kinds of graphs: factual graph, ontology graph, textual description graph, and they can complement each other. A more common scenario in the real world is that seen and unseen relations appear at the same time. In this setting, the training set (only seen relations) and testing set (both seen and unseen relations) own different distributions. And the train-test inconsistency problem will make KGE methods easiy overfit on seen relations and under-performance on unseen relations. In this paper, we propose decoupling mixture-of-graph experts (DMoG) for unseen relations learning, which could represent the unseen relations in the factual graph by fusing ontology and textual graphs, and decouple fusing space and reasoning space to alleviate overfitting for seen relations. The experiments on two unseen only public datasets and a mixture dataset verify the effectiveness of the proposed method, which improves the state-of-the-art methods by 6.84% in Hits@10 on average.
Change captioning is to use a natural language sentence to describe the fine-grained disagreement between two similar images. Viewpoint change is the most typical distractor in this task, because it changes the scale and location of the objects and overwhelms the representation of real change. In this paper, we propose a Relation-embedded Representation Reconstruction Network (Rˆ3Net) to explicitly distinguish the real change from the large amount of clutter and irrelevant changes. Specifically, a relation-embedded module is first devised to explore potential changed objects in the large amount of clutter. Then, based on the semantic similarities of corresponding locations in the two images, a representation reconstruction module (RRM) is designed to learn the reconstruction representation and further model the difference representation. Besides, we introduce a syntactic skeleton predictor (SSP) to enhance the semantic interaction between change localization and caption generation. Extensive experiments show that the proposed method achieves the state-of-the-art results on two public datasets.