This is an internal, temporary preview of a proposed change to the ACL Anthology.
It may be incomplete or contain mistakes.
Please do not link to this content or treat it as official.
It will be removed when the change is merged or abandoned.
Recent studies have explored Continual Instruction Tuning (CIT) in Multimodal Large Language Models (MLLMs), with a primary focus on Task-incremental CIT, where MLLMs are required to continuously acquire new tasks. However, the more practical and challenging Domain-incremental CIT, focused on the continual adaptation of MLLMs to new domains, remains underexplored. In this paper, we propose a new Sparse Mixture of Expert (SMoE) based method for domain-incremental CIT in MLLMs. During training, we learn a domain-specific SMoE module for each new domain in every FFN sub-layer of MLLMs, preventing catastrophic forgetting caused by inter-domain conflicts. Moreover, we equip the SMoE module with a domain-specific autoregressive loss (DSAL), which is used to identify the most suitable SMoE module for processing each test instruction during inference. To further enhance the SMoE module’s ability to learn domain knowledge, we design an adaptive threshold-based router (AT-Router) that allocates computing resources (experts) to instruction tokens based on their importance. Finally, we establish a new benchmark to evaluate the efficacy of our method and advance future research. Extensive experiments show that our method consistently outperforms all competitive baselines.
The few-shot relation extraction (FSRE) aims at enhancing the model’s generalization to new relations with very few labeled instances (support instances). Most existing studies use prototype networks (ProtoNets) for FSRE and assume that the support set, adapting the model to new relations, only contains accurately labeled instances. However, this assumption is usually unrealistic, as even carefully-annotated datasets often contain mislabeled instances. Thus, it is essential to enhance the robustness of FSRE models to noisy labels in support set, but this issue remains unexplored. In this paper, we first conduct a preliminary study, revealing the high sensitivity of ProtoNets to such noisy labels. Meanwhile, we discover that fully leveraging mislabeled support instances is crucial for enhancing the model’s robustness. To do this, we propose a self-denoising model for FSRE, which can automatically correct noisy labels of support instances. Specifically, our model comprises two core components: 1) a label correction module (LCM), used to correct mislabeled support instances based on the distances between them in the embedding space, and 2) a relation classification module (RCM), designed to achieve more robust relation prediction using the corrected labels generated by the LCM. Moreover, we propose a feedback-based training strategy, which focuses on training LCM and RCM to synergistically handle noisy labels in support set. Experimental results on two public datasets show the effectiveness and robustness of our model. Notably, even in scenarios without noisy labels, our model significantly outperforms all competitive baselines.
Speech Relation Extraction (SpeechRE) aims to extract relation triplets from speech data. However, existing studies usually use synthetic speech to train and evaluate SpeechRE models, hindering the further development of SpeechRE due to the disparity between synthetic and real speech. Meanwhile, the modality gap issue, unexplored in SpeechRE, limits the performance of existing models. In this paper, we construct two real SpeechRE datasets to facilitate subsequent researches and propose a Multi-level Cross-modal Alignment Model (MCAM) for SpeechRE. Our model consists of three components: 1) a speech encoder, extracting speech features from the input speech; 2) an alignment adapter, mapping these speech features into a suitable semantic space for the text decoder; and 3) a text decoder, autoregressively generating relation triplets based on the speech features. During training, we first additionally introduce a text encoder to serve as a semantic bridge between the speech encoder and the text decoder, and then train the alignment adapter to align the output features of speech and text encoders at multiple levels. In this way, we can effectively train the alignment adapter to bridge the modality gap between the speech encoder and the text decoder. Experimental results and in-depth analysis on our datasets strongly demonstrate the efficacy of our method.
k-Nearest-Neighbor Machine Translation (kNN-MT) becomes an important research direction of NMT in recent years. Its main idea is to retrieve useful key-value pairs from an additional datastore to modify translations without updating the NMT model. However, the underlying retrieved noisy pairs will dramatically deteriorate the model performance. In this paper, we conduct a preliminary study and find that this problem results from not fully exploiting the prediction of the NMT model. To alleviate the impact of noise, we propose a confidence-enhanced kNN-MT model with robust training. Concretely, we introduce the NMT confidence to refine the modeling of two important components of kNN-MT: kNN distribution and the interpolation weight. Meanwhile we inject two types of perturbations into the retrieved pairs for robust training. Experimental results on four benchmark datasets demonstrate that our model not only achieves significant improvements over current kNN-MT models, but also exhibits better robustness. Our code is available at https://github.com/DeepLearnXMU/Robust-knn-mt.
We present ClidSum, a benchmark dataset towards building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART via further pre-training, where the multiple objectives help the pre-trained model capture the structural characteristics as well as key content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research. We have released the dataset and code at https://github.com/krystalan/ClidSum.
In this paper, we focus on the problem of citing sentence generation, which entails generating a short text to capture the salient information in a cited paper and the connection between the citing and cited paper. We present BACO, a BAckground knowledge- and COntent-based framework for citing sentence generation, which considers two types of information: (1) background knowledge by leveraging structural information from a citation network; and (2) content, which represents in-depth information about what to cite and why to cite. First, a citation network is encoded to provide background knowledge. Second, we apply salience estimation to identify what to cite by estimating the importance of sentences in the cited paper. During the decoding stage, both types of information are combined to facilitate the text generation, and then we conduct a joint training for the generator and citation function classification to make the model aware of why to cite. Our experimental results show that our framework outperforms comparative baselines.
In aspect-level sentiment classification (ASC), it is prevalent to equip dominant neural models with attention mechanisms, for the sake of acquiring the importance of each context word on the given aspect. However, such a mechanism tends to excessively focus on a few frequent words with sentiment polarities, while ignoring infrequent ones. In this paper, we propose a progressive self-supervised attention learning approach for neural ASC models, which automatically mines useful attention supervision information from a training corpus to refine attention mechanisms. Specifically, we iteratively conduct sentiment predictions on all training instances. Particularly, at each iteration, the context word with the maximum attention weight is extracted as the one with active/misleading influence on the correct/incorrect prediction of every instance, and then the word itself is masked for subsequent iterations. Finally, we augment the conventional training objective with a regularization term, which enables ASC models to continue equally focusing on the extracted active context words while decreasing weights of those misleading ones. Experimental results on multiple datasets show that our proposed approach yields better attention mechanisms, leading to substantial improvements over the two state-of-the-art neural ASC models. Source code and trained models are available at https://github.com/DeepLearnXMU/PSSAttention.