Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Yang (Trista) Cao, Isabel Papadimitriou, Anaelia Ovalle, Marcos Zampieri, Francis Ferraro, Swabha Swayamdipta (Editors)

Anthology ID:: 2024.naacl-srw
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2024.naacl-srw
DOI:
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/dois-2013-emnlp/2024.naacl-srw.pdf

pdf bib
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Yang (Trista) Cao | Isabel Papadimitriou | Anaelia Ovalle | Marcos Zampieri | Francis Ferraro | Swabha Swayamdipta

pdf bib abs
Systematic Analysis for Pretrained Language Model Priming for Parameter-Efficient Fine-tuning
Shih-Cheng Huang | Shih-Heng Wang | Min-Han Shih | Saurav Sahay | Hung-yi Lee

Parameter-efficient (PE) methods (like Prompts or Adapters) for adapting pre-trained language models (PLM) to downstream tasks have been popular recently. However, hindrances still prevent these methods from reaching their full potential. For example, two significant challenges are few-shot adaptation and cross-task generalization. To tackle these issues, we propose a general PE priming framework to enhance and explore the few-shot adaptation and generalization ability of PE methods. In this framework, PLMs are primed with PE methods for rapidly adapting to various target tasks. To evaluate the generalization ability of these PE methods, we conduct experiments on a few-shot cross-domain benchmark containing 160 diverse NLP tasks. Our experiment not only reveals the best priming strategy but also verifies that priming facilitates the adaptation to target tasks.

pdf bib abs
Rephrasing Invokes Better Generations for Large Language Models
Haoran Yang | Hongyuan Lu | Wai Lam

In the realm of emerging multitasking abilities of Large language models (LLMs), methodologies like prompt tuning enable low-cost adaptation to downstream tasks without retraining the model. However, automatic input pre-processing when LLMs are unavailable is currently under-studied. This paper proposes ReLLM (Rephrasing for LLMs), a method that automatically paraphrases input content for better output generations. ReLLM replaces low-frequency lexical items with their high-frequency counterparts. This substitution is particularly beneficial for low-resource language tasks that lack sufficient training data and resources. ReLLM is user-friendly and requires no additional LLM training. Experimental results in cross-lingual summarization, and natural language inference demonstrate the effectiveness of ReLLM.

pdf abs
Exploring Compositional Generalization of Large Language Models
Haoran Yang | Hongyuan Lu | Wai Lam | Deng Cai

In this paper, we study the generalization ability of large language models (LLMs) with respect to compositional instructions, which are instructions that can be decomposed into several sub-instructions. We argue that the ability to generalize from simple instructions to more intricate compositional instructions represents a key aspect of the out-of-distribution generalization for LLMs. Since there are no specialized datasets for studying this phenomenon, we first construct a dataset with the help of ChatGPT, guided by the self-instruct technique. Then, we fine-tune and evaluate LLMs on these datasets. Interestingly, our experimental results indicate that training LLMs on higher-order compositional instructions enhances their performance on lower-order ones, but the reverse does not hold true.

pdf abs
Explainable CED: A Dataset for Explainable Critical Error Detection in Machine Translation
Dahyun Jung | Sugyeong Eo | Chanjun Park | Heuiseok Lim

Critical error detection (CED) in machine translation is a task that aims to detect errors that significantly distort the intended meaning. However, the existing study of CED lacks explainability due to the absence of content addressing the reasons for catastrophic errors. To address this limitation, we propose Explainable CED, a dataset that introduces the attributes of error explanation and correction regarding critical errors. Considering the advantage of reducing time costs and mitigating human annotation bias, we leverage a large language model in the data construction process. To improve the quality of the dataset and mitigate hallucination, we compare responses from the model and introduce an additional data filtering method through feedback scoring. The experiment demonstrates that the dataset appropriately reflects a consistent explanation and revision for errors, validating the reliability of the dataset.

pdf abs
SMARTR: A Framework for Early Detection using Survival Analysis of Longitudinal Texts
Jean-Thomas Baillargeon | Luc Lamontagne

This paper presents an innovative approach to the early detection of expensive insurance claims by leveraging survival analysis concepts within a deep learning framework exploiting textual information from claims notes. Our proposed SMARTR model addresses limitations of state-of-the-art models, such as handling data-label mismatches and non-uniform data frequency, to enhance a posteriori classification and early detection. Our results suggest that incorporating temporal dynamics and empty period representation improves model performance, highlighting the importance of considering time in insurance claim analysis. The approach appears promising for application to other insurance datasets.

pdf abs
Fast Exact Retrieval for Nearest-neighbor Lookup (FERN)
Richard Zhu

Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling — vector retrieval — can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension d relative to the number of vectors, N, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a O(Nd) problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. d=1 vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. d=2 or d=3 cases), kd-trees provide a O(dlog N) algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a O(dN) solution at high dimensions (e.g. k=128), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by kd-trees. The algorithm achieves O(dlog N) look-up with 100% recall on 10 million d=128 uniformly randomly generated vectors.

pdf abs
Start Simple: Progressive Difficulty Multitask Learning
Yunfei Luo | Yuyang Liu | Rukai Cai | Tauhidur Rahman

The opaque nature of neural networks, often described as black boxes, poses significant challenges in understanding their learning mechanisms, which limit our ability to fully optimize and trust these models.Inspired by how humans learn, this paper proposes a novel neural network training strategy that employs multitask learning with progressive difficulty subtasks, which we believe can potentially shed light on the internal learning mechanisms of neural networks.We implemented this strategy across a range of NLP tasks, data sets, and neural network architectures and observed notable improvements in model performance.This suggests that neural networks may be able to extract common features and internalize shared representations across similar subtasks that differ in their difficulty.Analyzing this strategy could lead us to more interpretable and robust neural networks, enhancing both their performance and our understanding of their nature.

Spurred by recent advances in Large Language Models (LLMs), virtual assistants are poised to take a leap forward in terms of their dialogue capabilities. Yet a major bottleneck to achieving genuinely transformative task-oriented dialogue capabilities remains the scarcity of high quality data. Existing datasets, while impressive in scale, have limited domain coverage and contain few genuinely challenging conversational phenomena; those which are present are typically unlabelled, making it difficult to assess the strengths and weaknesses of models without time-consuming and costly human evaluation. Moreover, creating high quality dialogue data has until now required considerable human input, limiting both the scale of these datasets and the ability to rapidly bootstrap data for a new target domain. We aim to overcome these issues with LUCID, a modularised and highly automated LLM-driven data generation system that produces realistic, diverse and challenging dialogues. We use LUCID to generate a seed dataset of 4,277 conversations across 100 intents to demonstrate its capabilities, with a human review finding consistently high quality labels in the generated data.

pdf abs
Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages
Sankalp Bahad | Pruthwik Mishra | Parameswari Krishnamurthy | Dipti Sharma

Named Entity Recognition (NER) is a use-ful component in Natural Language Process-ing (NLP) applications. It is used in varioustasks such as Machine Translation, Summa-rization, Information Retrieval, and Question-Answering systems. The research on NER iscentered around English and some other ma-jor languages, whereas limited attention hasbeen given to Indian languages. We analyze thechallenges and propose techniques that can betailored for Multilingual Named Entity Recog-nition for Indian Languages. We present a hu-man annotated named entity corpora of ∼40Ksentences for 4 Indian languages from two ofthe major Indian language families. Addition-ally, we show the transfer learning capabilitiesof pre-trained transformer models from a highresource language to multiple low resource lan-guages through a series of experiments. Wealso present a multilingual model fine-tunedon our dataset, which achieves an F1 score of∼0.80 on our dataset on average. We achievecomparable performance on completely unseenbenchmark datasets for Indian languages whichaffirms the usability of our model.

pdf abs
Knowledge-centered conversational agents with a drive to learn
Selene Baez Santamaria

We create an adaptive conversational agent that assesses the quality of its knowledge and is driven to become more knowledgeable. Unlike agents with predefined tasks, ours can leverage people as diverse sources to meet its knowledge needs. We test the agent in social contexts, where personal and subjective information can be obtained through dialogue. We provide the agent both with generic methods for assessing its knowledge quality (e.g. correctness, completeness, redundancy, interconnectedness, and diversity), as well as with generic capabilities to improve its knowledge by leveraging external sources. We demonstrate that the agent can learn effective policies to acquire the knowledge needed by assessing the efficiency of these capabilities during interaction. Our framework enables on-the-fly learning, offering a dynamic and adaptive approach to shaping conversational interactions.

pdf abs
Exploring Inherent Biases in LLMs within Korean Social Context: A Comparative Analysis of ChatGPT and GPT-4
Seungyoon Lee | Dong Kim | Dahyun Jung | Chanjun Park | Heuiseok Lim

Large Language Models (LLMs) have significantly impacted various fields requiring advanced linguistic understanding, yet concerns regarding their inherent biases and ethical considerations have also increased. Notably, LLMs have been critiqued for perpetuating stereotypes against diverse groups based on race, sexual orientation, and other attributes. However, most research analyzing these biases has predominantly focused on communities where English is the primary language, neglecting to consider the cultural and linguistic nuances of other societies. In this paper, we aim to explore the inherent biases and toxicity of LLMs, specifically within the social context of Korea. We devise a set of prompts that reflect major societal issues in Korea and assign varied personas to both ChatGPT and GPT-4 to assess the toxicity of the generated sentences. Our findings indicate that certain personas or prompt combinations consistently yield harmful content, highlighting the potential risks associated with specific persona-issue alignments within the Korean cultural framework. Furthermore, we discover that GPT-4 can produce more than twice the level of toxic content than ChatGPT under certain conditions.

pdf abs
To Clarify or not to Clarify: A Comparative Analysis of Clarification Classification with Fine-Tuning, Prompt Tuning, and Prompt Engineering
Alina Leippert | Tatiana Anikina | Bernd Kiefer | Josef Genabith

Misunderstandings occur all the time in human conversation but deciding on when to ask for clarification is a challenging task for conversational systems that requires a balance between asking too many unnecessary questions and running the risk of providing incorrect information. This work investigates clarification identification based on the task and data from (Xu et al., 2019), reproducing their Transformer baseline and extending it by comparing pre-trained language model fine-tuning, prompt tuning and manual prompt engineering on the task of clarification identification. Our experiments show strong performance with LM and a prompt tuning approach with BERT and RoBERTa, outperforming standard LM fine-tuning, while manual prompt engineering with GPT-3.5 proved to be less effective, although informative prompt instructions have the potential of steering the model towards generating more accurate explanations for why clarification is needed.

pdf abs
Detecting Response Generation Not Requiring Factual Judgment
Ryohei Kamei | Daiki Shiono | Reina Akama | Jun Suzuki

With the remarkable development of large language models (LLMs), ensuring the factuality of output has become a challenge.However, having all the contents of the response with given knowledge or facts is not necessarily a good thing in dialogues.This study aimed to achieve both attractiveness and factuality in a dialogue response for which a task was set to predict sentences that do not require factual correctness judgment such as agreeing, or personal opinions/feelings.We created a dataset, dialogue dataset annotated with fact-check-needed label (DDFC), for this task via crowdsourcing, and classification tasks were performed on several models using this dataset.The model with the highest classification accuracy could yield about 88% accurate classification results.

pdf abs
Unknown Script: Impact of Script on Cross-Lingual Transfer
Wondimagegnhue Tufa | Ilia Markov | Piek Vossen

Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size.

pdf abs
Improving Repository-level Code Search with Text Conversion
Mizuki Kondo | Daisuke Kawahara | Toshiyuki Kurabayashi

The ability to generate code using large language models (LLMs) has been increasing year by year. However, studies on code generation at the repository level are not very active. In repository-level code generation, it is necessary to refer to related code snippets among multiple files. By taking the similarity between code snippets, related files are searched and input into an LLM, and generation is performed. This paper proposes a method to search for related files (code search) by taking similarities not between code snippets but between the texts converted from the code snippets by the LLM. We confirmed that converting to text improves the accuracy of code search.

pdf abs
Improving Multi-lingual Alignment Through Soft Contrastive Learning
Minsu Park | Seyeon Choi | Chanyeol Choi | Jun-Seong Kim | Jy-yong Sohn

Making decent multi-lingual sentence representations is critical to achieve high performances in cross-lingual downstream tasks. In this work, we propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model. Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model. Our method can be considered as contrastive learning with soft labels defined as the similarity between sentences. Our experimental results on five languages show that our contrastive loss with soft labels far outperforms conventional constrastive loss with hard labels in various benchmarks for bitext mining tasks and STS tasks. In addition, our method outperforms existing multi-lingual embeddings including LaBSE, for Tatoeba dataset.

pdf abs
Few-Shot Event Argument Extraction Based on a Meta-Learning Approach
Aboubacar Tuo | Romaric Besançon | Olivier Ferret | Julien Tourille

Few-shot learning techniques for Event Extraction are developed to alleviate the cost of data annotation. However, most studies on few-shot event extraction only focus on event trigger detection and no study has been proposed on argument extraction in a meta-learning context. In this paper, we investigate few-shot event argument extraction using prototypical networks, casting the task as a relation classification problem. Furthermore, we propose to enhance the relation embeddings by injecting syntactic knowledge into the model using graph convolutional networks. Our experimental results show that our proposed approach achieves strong performance on ACE 2005 in several few-shot configurations, and highlight the importance of syntactic knowledge for this task. More generally, our paper provides a unified evaluation framework for meta-learning approaches for argument extraction.

pdf abs
Investigating Web Corpus Filtering Methods for Language Model Development in Japanese
Rintaro Enomoto | Arseny Tolmachev | Takuro Niitsuma | Shuhei Kurita | Daisuke Kawahara

The development of large language models (LLMs) is becoming increasingly significant, and there is a demand for high-quality, large-scale corpora for their pretraining.The quality of a web corpus is especially essential to improve the performance of LLMs because it accounts for a large proportion of the whole corpus. However, filtering methods for Web corpora have yet to be established.In this paper, we present empirical studies to reveal which filtering methods are indeed effective and analyze why they are.We build classifiers and language models in Japanese that can process large amounts of corpora rapidly enough for pretraining LLMs in limited computational resources. By evaluating these filtering methods based on a Web corpus quality evaluation benchmark, we reveal that the most accurate method is the N-gram language model. Indeed, we empirically present that strong filtering methods can rather lead to lesser performance in downstream tasks.We also report that the proportion of some specific topics in the processed documents decreases significantly during the filtering process.

pdf abs
Referring Expressions in Human-Robot Common Ground: A Thesis Proposal
Jaap Kruijt

In this PhD, we investigate the processes through which common ground shapes the pragmatic use of referring expressions in Human-Robot Interaction. A central point in our investigation is the interplay between a growing common ground and changes in the surrounding context, which can create ambiguity, variation and the need for pragmatic interpretations. We outline three objectives that define the scope of our work: 1) obtaining data with common ground interactions, 2) examining reference-making, and 3) evaluating the robot interlocutor. We use datasets as well as a novel interactive experimental framework to investigate the linguistic processes involved in shaping referring expressions. We also design an interactive robot model, which models these linguistic processes and can use pragmatic inference to resolve referring expressions. With this work, we contribute to existing work in HRI, reference resolution and the study of common ground.

pdf abs
Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection
Mohammed Rahaman | Julia Ive

Code clone detection is challenging, as sourcecode can be written in different languages, do-mains, and styles. In this paper, we arguethat source code is inherently a graph, not asequence, and that graph-based methods aremore suitable for code clone detection thansequence-based methods. We compare the per-formance of two state-of-the-art models: Code-BERT (Feng et al., 2020), a sequence-basedmodel, and CodeGraph (Yu et al., 2023), agraph-based model, on two benchmark data-sets: BCB (Svajlenko et al., 2014) and PoolC(PoolC, no date). We show that CodeGraphoutperforms CodeBERT on both data-sets, es-pecially on cross-lingual code clones. To thebest of our knowledge, this is the first work todemonstrate the cross-lingual code clone detec-tion showing superiority on graph-based meth-ods over sequence-based methods

Text Style Transfer (TST) seeks to alter the style of text while retaining its core content. Given the constraints of limited parallel datasets for TST, we propose CoTeX, a framework that leverages large language models (LLMs) alongside chain-of-thought (CoT) prompting to facilitate TST. CoTeX distills the complex rewriting and reasoning capabilities of LLMs into more streamlined models capable of working with both non-parallel and parallel data. Through experimentation across four TST datasets, CoTeX is shown to surpass traditional supervised fine-tuning and knowledge distillation methods, particularly in low-resource settings. We conduct a comprehensive evaluation, comparing CoTeX against current unsupervised, supervised, in-context learning (ICL) techniques, and instruction-tuned LLMs. Furthermore, CoTeX distinguishes itself by offering transparent explanations for its style transfer process.

pdf abs
Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation
Hao Wang | Tetsuro Morimura | Ukyo Honda | Daisuke Kawahara

Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models’ training.

pdf abs
Evaluation Dataset for Japanese Medical Text Simplification
Koki Horiguchi | Tomoyuki Kajiwara | Yuki Arase | Takashi Ninomiya

We create a parallel corpus for medical text simplification in Japanese, which simplifies medical terms into expressions that patients can understand without effort.While text simplification in the medial domain is strongly desired by society, it is less explored in Japanese because of the lack of language resources.In this study, we build a parallel corpus for Japanese text simplification evaluation in the medical domain using patients’ weblogs.This corpus consists of 1,425 pairs of complex and simple sentences with or without medical terms.To tackle medical text simplification without a training corpus of the corresponding domain, we repurpose a Japanese text simplification model of other domains.Furthermore, we propose a lexically constrained reranking method that allows to avoid technical terms to be output.Experimental results show that our method contributes to achieving higher simplification performance in the medical domain.

pdf abs
Multi-Source Text Classification for Multilingual Sentence Encoder with Machine Translation
Reon Kajikawa | Keiichiro Yamada | Tomoyuki Kajiwara | Takashi Ninomiya

To reduce the cost of training models for each language for developers of natural language processing applications, pre-trained multilingual sentence encoders are promising.However, since training corpora for such multilingual sentence encoders contain only a small amount of text in languages other than English, they suffer from performance degradation for non-English languages.To improve the performance of pre-trained multilingual sentence encoders for non-English languages, we propose a method of machine translating a source sentence into English and then inputting it together with the source sentence in a multi-source manner.Experimental results on sentiment analysis and topic classification tasks in Japanese revealed the effectiveness of the proposed method.

In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken by URIEL in quantifying language similarity. Our analysis reveals URIEL’s ambiguity in calculating language distances and in handling missing values. Moreover, we find that URIEL does not provide any information about typological features for 31% of the languages it represents, undermining the reliabilility of the database, particularly on low-resource languages. Our literature review suggests URIEL and lang2vec are used in papers on diverse NLP tasks, which motivates us to rigorously verify the database as the effectiveness of these works depends on the reliability of the information the tool provides.

pdf abs
Coding Open-Ended Responses using Pseudo Response Generation by Large Language Models
Yuki Zenimoto | Ryo Hasegawa | Takehito Utsuro | Masaharu Yoshioka | Noriko Kando

Survey research using open-ended responses is an important method thatcontributes to the discovery of unknown issues and new needs. However,survey research generally requires time and cost-consuming manual dataprocessing, indicating that it is difficult to analyze large dataset.To address this issue, we propose an LLM-based method to automate partsof the grounded theory approach (GTA), a representative approach of thequalitative data analysis. We generated and annotated pseudo open-endedresponses, and used them as the training data for the coding proceduresof GTA. Through evaluations, we showed that the models trained withpseudo open-ended responses are quite effective compared with thosetrained with manually annotated open-ended responses. We alsodemonstrate that the LLM-based approach is highly efficient andcost-saving compared to human-based approach.

pdf abs
Cross-Task Generalization Abilities of Large Language Models
Qinyuan Ye

Humans can learn a new language task efficiently with only few examples, by leveraging their knowledge and experience obtained when learning prior tasks. Enabling similar cross-task generalization abilities in NLP systems is fundamental for approaching the goal of general intelligence and expanding the reach of language technology in the future.In this thesis proposal, I will present my work on (1) benchmarking cross-task generalization abilities with diverse NLP tasks; (2) developing model architectures for improving cross-task generalization abilities; (3) analyzing and predicting the generalization landscape of current state-of-the-art large language models. Additionally, I will outline future research directions, along with preliminary thoughts on addressing them.

pdf abs
Commentary Generation from Data Records of Multiplayer Strategy Esports Game
Zihan Wang | Naoki Yoshinaga

Esports, a sports competition on video games, has become one of the most important sporting events. Although esports play logs have been accumulated, only a small portion of them accompany text commentaries for the audience to retrieve and understand the plays. In this study, we therefore introduce the task of generating game commentaries from esports’ data records. We first build large-scale esports data-to-text datasets that pair structured data and commentaries from a popular esports game, League of Legends. We then evaluate Transformer-based models to generate game commentaries from structured data records, while examining the impact of the pre-trained language models. Evaluation results on our dataset revealed the challenges of this novel task. We will release our dataset to boost potential research in the data-to-text generation community.

pdf abs
Facilitating Opinion Diversity through Hybrid NLP Approaches
Michiel Van Der Meer

Modern democracies face a critical issue of declining citizen participation in decision-making. Online discussion forums are an important avenue for enhancing citizen participation. This thesis proposal 1) identifies the challenges involved in facilitating large-scale online discussions with Natural Language Processing (NLP), 2) suggests solutions to these challenges by incorporating hybrid human-AI technologies, and 3) investigates what these technologies can reveal about individual perspectives in online discussions. We propose a three-layered hierarchy for representing perspectives that can be obtained by a mixture of human intelligence and large language models. We illustrate how these representations can draw insights into the diversity of perspectives and allow us to investigate interactions in online discussions.

pdf abs
HybridBERT - Making BERT Pretraining More Efficient Through Hybrid Mixture of Attention Mechanisms
Gokul Srinivasagan | Simon Ostermann

Pretrained transformer-based language models have produced state-of-the-art performance in most natural language understanding tasks. These models undergo two stages of training: pretraining on a huge corpus of data and fine-tuning on a specific downstream task. The pretraining phase is extremely compute-intensive and requires several high-performance computing devices like GPUs and several days or even months of training, but it is crucial for the model to capture global knowledge and also has a significant impact on the fine-tuning task. This is a major roadblock for researchers without access to sophisticated computing resources. To overcome this challenge, we propose two novel hybrid architectures called HybridBERT (HBERT), which combine self-attention and additive attention mechanisms together with sub-layer normalization. We introduce a computing budget to the pretraining phase, limiting the training time and usage to a single GPU. We show that HBERT attains twice the pretraining accuracy of a vanilla-BERT baseline. We also evaluate our proposed models on two downstream tasks, where we outperform BERT-base while accelerating inference. Moreover, we study the effect of weight initialization with a limited pretraining budget. The code and models are publicly available at: www.github.com/gokulsg/HBERT/.