This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Van-HienTran
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
This work introduces PrahokBART, a compact pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. We focus on improving the pre-training corpus quality and addressing the linguistic issues of Khmer, which are ignored in existing multilingual models, by incorporating linguistic components such as word segmentation and normalization. We evaluate PrahokBART on three generative tasks: machine translation, text summarization, and headline generation, where our results demonstrate that it outperforms mBART50, a strong multilingual pre-trained model. Additionally, our analysis provides insights into the impact of each linguistic module and evaluates how effectively our model handles space during text generation, which is crucial for the naturalness of texts in Khmer.
Large language models (LLMs) have demonstrated strong zero-shot machine translation (MT) performance but often exhibit gender bias that is present in their training data, especially when translating into grammatically gendered languages. In this paper, we investigate whether explicitly providing gender information can mitigate this issue and improve translation quality. We propose a two-step approach: (1) inferring entity gender from context, and (2) incorporating this information into prompts using either Structured Tagging or Natural Language. Experiments with five LLMs across four language pairs show that explicit gender cues consistently reduce gender errors, with structured tagging yielding the largest gains. Our results highlight prompt-level gender disambiguation as a simple yet effective strategy for more accurate and fair zero-shot MT.
Machine Translation (MT) has made great strides with the use of Large Language Models (LLMs) and advanced prompting techniques. However, translating sentences with ambiguous words remains challenging, especially when LLMs have limited proficiency in the source language. This paper introduces two methods to enhance MT performance by leveraging the word sense disambiguation capabilities of LLMs. The first method integrates all the available senses of an ambiguous word into the prompting template. The second method uses a pre-trained source language model to predict the correct sense of the ambiguous word, which is then incorporated into the prompting template. Additionally, we propose two prompting template styles for providing word sense information to LLMs. Experiments on the HOLLY dataset demonstrate the effectiveness of our approach in improving MT performance.
Zero-shot in-context learning allows large language models (LLMs) to perform tasks using only provided instructions. However, pre-trained LLMs often face calibration issues in zero-shot scenarios, leading to challenges such as hallucinations and off-target translations that compromise output quality, particularly in machine translation (MT). This paper introduces a new method to improve zero-shot MT using fixed prefix pair bootstrapping. By initializing translations with an accurate bilingual prefix pair at the start of both source and target sentences, this approach effectively guides the model to generate precise target-language outputs. Extensive evaluations across four model architectures and multiple translation directions demonstrate significant and consistent improvements, showcasing the potential of this straightforward strategy to enhance zero-shot MT performance.
Low-resource machine translation (LRMT) poses a substantial challenge due to the scarcity of parallel training data. This paper introduces a new method to improve the transfer of the embedding layer from the Parent model to the Child model in LRMT, utilizing trained token embeddings in the Parent model’s high-resource vocabulary. Our approach involves projecting all tokens into a shared semantic space and measuring the semantic similarity between tokens in the low-resource and high-resource languages. These measures are then utilized to initialize token representations in the Child model’s low-resource vocabulary. We evaluated our approach on three benchmark datasets of low-resource language pairs: Myanmar-English, Indonesian-English, and Turkish-English. The experimental results demonstrate that our method outperforms previous methods regarding translation quality. Additionally, our approach is computationally efficient, leading to reduced training time compared to prior works.
Zero-shot relation extraction (ZSRE) aims to predict target relations that cannot be observed during training. While most previous studies have focused on fully supervised relation extraction and achieved considerably high performance, less effort has been made towards ZSRE. This study proposes a new model incorporating discriminative embedding learning for both sentences and semantic relations. In addition, a self-adaptive comparator network is used to judge whether the relationship between a sentence and a relation is consistent. Experimental results on two benchmark datasets showed that the proposed method significantly outperforms the state-of-the-art methods.
This paper presents CovRelex, a scientific paper retrieval system targeting entities and relations via relation extraction on COVID-19 scientific papers. This work aims at building a system supporting users efficiently in acquiring knowledge across a huge number of COVID-19 scientific papers published rapidly. Our system can be accessed via https://www.jaist.ac.jp/is/labs/nguyen-lab/systems/covrelex/.
Recently, relation classification has gained much success by exploiting deep neural networks. In this paper, we propose a new model effectively combining Segment-level Attention-based Convolutional Neural Networks (SACNNs) and Dependency-based Recurrent Neural Networks (DepRNNs). While SACNNs allow the model to selectively focus on the important information segment from the raw sequence, DepRNNs help to handle the long-distance relations from the shortest dependency path of relation entities. Experiments on the SemEval-2010 Task 8 dataset show that our model is comparable to the state-of-the-art without using any external lexical features.