2025
pdf
bib
abs
Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation
Boxuan Lyu
|
Hidetaka Kamigaito
|
Kotaro Funakoshi
|
Manabu Okumura
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Maximum a posteriori decoding, a commonly used method for neural machine translation (NMT), aims to maximize the estimated posterior probability. However, high estimated probability does not always lead to high translation quality. Minimum Bayes Risk (MBR) decoding offers an alternative by seeking hypotheses with the highest expected utility.Inspired by Quality Estimation (QE) reranking which uses the QE model as a ranker, we propose source-based MBR (sMBR) decoding, a novel approach that utilizes quasi-sources (generated via paraphrasing or back-translation) as “support hypotheses” and a reference-free quality estimation metric as the utility function, marking the first work to solely use sources in MBR decoding. Experiments show that sMBR outperforms QE reranking and the standard MBR decoding. Our findings suggest that sMBR is a promising approach for NMT decoding.
pdf
bib
abs
CoAM: Corpus of All-Type Multiword Expressions
Yusuke Ide
|
Joshua Tanner
|
Adam Nohejl
|
Jacob Hoffman
|
Justin Vasselli
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size.To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking.Additionally, for the first time in a dataset of MWE identification, CoAM’s MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis.Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form.Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset.Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.
pdf
bib
abs
Diversity Explains Inference Scaling Laws: Through a Case Study of Minimum Bayes Risk Decoding
Hidetaka Kamigaito
|
Hiroyuki Deguchi
|
Yusuke Sakai
|
Katsuhiko Hayashi
|
Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Inference methods play an important role in eliciting the performance of large language models (LLMs). Currently, LLMs use inference methods utilizing generated multiple samples, which can be derived from Minimum Bayes Risk (MBR) Decoding. Previous studies have conducted empirical analyses to clarify the improvements in generation performance achieved by MBR decoding and have reported various observations. However, the theoretical underpinnings of these findings remain uncertain. To address this, we offer a new theoretical interpretation of MBR decoding from the perspective of bias–diversity decomposition. In this interpretation, the error in the quality estimation of hypotheses by MBR decoding is decomposed into two main factors: bias, which considers the closeness between the utility function and human evaluation, and diversity, which represents the variability in the quality estimation of the utility function. The theoretical analysis reveals the difficulty of simultaneously improving bias and diversity, confirming the validity of enhancing MBR decoding performance by increasing diversity. Furthermore, we reveal that diversity can explain one aspect of inference scaling laws that describe performance improvement by increasing sample size. Moreover, experiments across multiple NLP tasks yielded results consistent with these theoretical characteristics. Our code is available at https://github.com/naist-nlp/mbr-bias-diversity.
pdf
bib
abs
Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.
pdf
bib
abs
BQA: Body Language Question Answering Dataset for Video Large Language Models
Shintaro Ozaki
|
Kazuki Hayashi
|
Miyu Oba
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
A large part of human communication relies on nonverbal cues such as facial expressions, eye contact, and body language. Unlike language or sign language, such nonverbal communication lacks formal rules, requiring complex reasoning based on commonsense understanding.Enabling current Video Large Language Models (VideoLLMs) to accurately interpret body language is a crucial challenge, as human unconscious actions can easily cause the model to misinterpret their intent.To address this, we propose a dataset, BQA, a body language question answering dataset, to validate whether the model can correctly interpret emotions from short clips of body language comprising 26 emotion labels of videos of body language.We evaluated various VideoLLMs on the BQA with and without Multimodal Chain of Thought (CoT) and revealed that understanding body language is challenging, and our analyses of the wrong answers by VideoLLMs show that certain VideoLLMs made largely biased answers depending on the age group and ethnicity of the individuals. We also found consistent error patterns in VideoLLMs.
pdf
bib
abs
Translating Movie Subtitles by Large Language Models using Movie-meta Information
Ashmari Pramodya
|
Yusuke Sakai
|
Justin Vasselli
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Large language models (LLMs) have advanced natural language processing by understanding, generating, and manipulating texts.Although recent studies have shown that prompt engineering can reduce computational effort and potentially improve translation quality, prompt designs specific to different domains remain challenging. Besides, movie subtitle translation is particularly challenging and understudied, as it involves handling colloquial language, preserving cultural nuances, and requires contextual information such as the movie’s theme and storyline to ensure accurate meaning. This study aims to fill this gap by focusing on the translation of movie subtitles through the use of prompting strategies that incorporate the movie’s meta-information, e.g., movie title, summary, and genre. We build a multilingual dataset which aligns the OpenSubtitles dataset with their corresponding Wikipedia articles and investigate different prompts and their effect on translation performance. Our experiments with GPT-3.5, GPT-4o, and LLaMA-3 models have shown that the presence of meta-information improves translation accuracy. These findings further emphasize the importance of designing appropriate prompts and highlight the potential of LLMs to enhance subtitle translation quality.
pdf
bib
abs
SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala
Ashmari Pramodya
|
Nirasha Nelki
|
Heshan Shalinda
|
Chamila Liyanage
|
Yusuke Sakai
|
Randil Pushpananda
|
Ruvan Weerasinghe
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.
pdf
bib
abs
LoCt-Instruct: An Automatic Pipeline for Constructing Datasets of Logical Continuous Instructions
Hongyu Sun
|
Yusuke Sakai
|
Haruki Sakajo
|
Shintaro Ozaki
|
Kazuki Hayashi
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Continuous instruction following closely mirrors real-world tasks by requiring models to solve sequences of interdependent steps, yet existing multi-step instruction datasets suffer from three key limitations: (1) lack of logical coherence across turns, (2) narrow topical breadth and depth, and (3) reliance on rigid templates or heavy manual effort. We introduce LoCt-Pipeline, a novel pipeline that leverages modern LLMs’ reasoning capabilities to assemble rich, topic-related single-instruction data into multi-turn dialogues, producing chains that are logically coherent, progressively deepen in content, and span diverse domains without fixed templates or extensive human annotation. We employed this pipeline to construct LoCt-Instruct for assessing models’ problem-solving abilities. The generated chains serve as a testbed for benchmarking a variety of models, including reasoning-oriented architectures, instruction-tuned variants, and state-of-the-art closed-source LLMs on their capacity to follow and correctly respond to each step. Our results reveal a substantial performance gap between current LLMs and human solvers. These findings highlight the need for more robust continuous instruction following. We publicly release the dataset and end-to-end pipeline.
pdf
bib
abs
AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts
Soichiro Murakami
|
Peinan Zhang
|
Hidetaka Kamigaito
|
Hiroya Takamura
|
Manabu Okumura
Findings of the Association for Computational Linguistics: NAACL 2025
Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. First, human preferences are complex and influenced by multiple factors, including their content, such as brand names, and their linguistic styles, making analysis challenging. Second, publicly available ad text datasets that include human preferences are lacking, such as ad performance metrics and human feedback, which reflect people’s interests. To address these problems, we present AdParaphrase, a paraphrase dataset that contains human preferences for pairs of ad texts that are semantically equivalent but differ in terms of wording and style. This dataset allows for preference analysis that focuses on the differences in linguistic features. Our analysis revealed that ad texts preferred by human judges have higher fluency, longer length, more nouns, and use of bracket symbols. Furthermore, we demonstrate that an ad text-generation model that considers these findings significantly improves the attractiveness of a given text. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase.
pdf
bib
abs
Considering Length Diversity in Retrieval-Augmented Summarization
Juseon-Do
|
Jaesung Hwang
|
Jingun Kwon
|
Hidetaka Kamigaito
|
Manabu Okumura
Findings of the Association for Computational Linguistics: NAACL 2025
This study investigates retrieval-augmented summarization by specifically examining the impact of exemplar summary lengths because previous methods have not considered length constraints. We propose a Diverse Length-aware Maximal Marginal Relevance (DL-MMR) algorithm to better control summary lengths. This algorithm combines the query relevance with diverse target lengths in retrieval-augmented summarization. Unlike previous methods that necessitate exhaustive exemplar-exemplar relevance comparisons using MMR, DL-MMR considers the exemplar target length as well and avoids comparing exemplars to each other, thereby reducing computational cost and conserving memory during the construction of an exemplar pool. Experimental results showed the effectiveness of DL-MMR, which considers length diversity, compared to the original MMR algorithm. DL-MMR additionally showed the effectiveness in memory saving of 781,513 times and computational cost reduction of 500,092 times, while maintaining the same level of informativeness.
pdf
bib
abs
Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models
Shintaro Ozaki
|
Kazuki Hayashi
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Katsuhiko Hayashi
|
Taro Watanabe
Findings of the Association for Computational Linguistics: NAACL 2025
As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow. However, pre-training of Vision Encoder and the integrated training of LLMs with Vision Encoder are mainly conducted using English training data, leaving it uncertain whether LVLMs can completely handle their potential when generating explanations in languages other than English. In addition, multilingual QA benchmarks that create datasets using machine translation have cultural differences and biases, remaining issues for use as evaluation tasks. To address these challenges, this study created an extended dataset in multiple languages without relying on machine translation. This dataset that takes into account nuances and country-specific phrases was then used to evaluate the generation explanation abilities of LVLMs. Furthermore, this study examined whether Instruction-Tuning in resource-rich English improves performance in other languages. Our findings indicate that LVLMs perform worse in languages other than English compared to English. In addition, it was observed that LVLMs struggle to effectively manage the knowledge learned from English data.
pdf
bib
abs
Efficient Nearest Neighbor based Uncertainty Estimation for Natural Language Processing Tasks
Wataru Hashimoto
|
Hidetaka Kamigaito
|
Taro Watanabe
Findings of the Association for Computational Linguistics: NAACL 2025
Trustworthiness in model predictions is crucial for safety-critical applications in the real world. However, deep neural networks often suffer from the issues of uncertainty estimation, such as miscalibration. In this study, we propose k-Nearest Neighbor Uncertainty Estimation (kNN-UE), which is a new uncertainty estimation method that uses not only the distances from the neighbors, but also the ratio of labels in the neighbors. Experiments on sentiment analysis, natural language inference, and named entity recognition show that our proposed method outperforms the baselines and recent density-based methods in several calibration and uncertainty metrics. Moreover, our analyses indicate that approximate nearest neighbor search techniques reduce the inference overhead without significantly degrading the uncertainty estimation performance when they are appropriately combined.
pdf
bib
abs
Long-Tail Crisis in Nearest Neighbor Language Models
Yuto Nishida
|
Makoto Morishita
|
Hiroyuki Deguchi
|
Hidetaka Kamigaito
|
Taro Watanabe
Findings of the Association for Computational Linguistics: NAACL 2025
The k-nearest-neighbor language model (kNN-LM), one of the retrieval-augmented language models, improves the perplexity for given text by directly accessing a large datastore built from any text data during inference.A widely held hypothesis for the success of kNN-LM is that its explicit memory, i.e., the datastore, enhances predictions for long-tail phenomena.However, prior works have primarily shown its ability to retrieve long-tail contexts, leaving the model’s performance remain underexplored in estimating the probabilities of long-tail target tokens during inference.In this paper, we investigate the behavior of kNN-LM on low-frequency tokens, examining prediction probability, retrieval accuracy, and token distribution in the datastore.Our experimental results reveal that kNN-LM does not improve prediction performance for low-frequency tokens but mainly benefits high-frequency tokens regardless of long-tail contexts in the datastore.
pdf
bib
abs
AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset
Soichiro Murakami
|
Peinan Zhang
|
Hidetaka Kamigaito
|
Hiroya Takamura
|
Manabu Okumura
Findings of the Association for Computational Linguistics: ACL 2025
Identifying factors that make ad text attractive is essential for advertising success. This study proposes AdParaphrase v2.0, a dataset for ad text paraphrasing, containing human preference data, to enable the analysis of the linguistic factors and to support the development of methods for generating attractive ad texts. Compared with v1.0, this dataset is 20 times larger, comprising 16,460 ad text paraphrase pairs, each annotated with preference data from ten evaluators, thereby enabling a more comprehensive and reliable analysis. Through the experiments, we identified multiple linguistic features of engaging ad texts that were not observed in v1.0 and explored various methods for generating attractive ad texts. Furthermore, our analysis demonstrated the relationships between human preference and ad performance, and highlighted the potential of reference-free metrics based on large language models for evaluating ad text attractiveness.The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase-v2.0.
pdf
bib
abs
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries
Haruki Sakajo
|
Yusuke Ide
|
Justin Vasselli
|
Yusuke Sakai
|
Yingtao Tian
|
Hidetaka Kamigaito
|
Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2025
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.
pdf
bib
abs
Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models
Wataru Hashimoto
|
Hidetaka Kamigaito
|
Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2025
Decoding strategies manipulate the probability distribution underlying the output of a language model and can therefore affect both generation quality and its uncertainty. In this study, we investigate the impact of decoding strategies on uncertainty estimation in Large Language Models (LLMs). Our experiments show that Contrastive Search, which mitigates repetition, yields better uncertainty estimates on average across a range of preference-aligned LLMs. In contrast, the benefits of these strategies sometimes diverge when the model is only post-trained with supervised fine-tuning, i.e. without explicit alignment.
pdf
bib
abs
Length Representations in Large Language Models
Sangjun Moon
|
Dasom Choi
|
Jingun Kwon
|
Hidetaka Kamigaito
|
Manabu Okumura
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have shown remarkable capabilities across various tasks, that are learned from massive amounts of text-based data. Although LLMs can control output sequence length, particularly in instruction-based settings, the internal mechanisms behind this control have been unexplored yet. In this study, we provide empirical evidence on how output sequence length information is encoded within the internal representations in LLMs. In particular, our findings show that multi-head attention mechanisms are critical in determining output sequence length, which can be adjusted in a disentangled manner. By scaling specific hidden units within the model, we can control the output sequence length without losing the informativeness of the generated text, thereby indicating that length information is partially disentangled from semantic information. Moreover, some hidden units become increasingly active as prompts become more length-specific, thus reflecting the model’s internal awareness of this attribute. Our findings suggest that LLMs have learned robust and adaptable internal mechanisms for controlling output length without any external control.
pdf
bib
abs
Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates
Toma Suzuki
|
Yusuke Sakai
|
Justin Vasselli
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)
Large language models (LLMs) achieve high performance through instruction-tuning, which involves learning various tasks using instruction templates. However, these templates often contain task-specific expressions, which are words that frequently appear in certain contexts but do not always convey the actual meaning of that context, even if they seem closely related to the target task. Biases inherent in such instruction templates may be learned by LLMs during training, potentially degrading performance when the models encounter superficial expressions. In this study, we propose a method that incorporates additional instructions to FLAN templates, without altering the base instruction to produce “superfluous instructions”. This allows us to investigate the vulnerabilities of LLMs caused by overfitting to task-specific expressions embedded in instruction templates. The experimental results revealed that the inclusion of superficial words strongly related to each task in the instruction text can alter the output, regardless of the intended meaning.
pdf
bib
abs
How to Make the Most of LLMs’ Grammatical Knowledge for Acceptability Judgments
Yusuke Ide
|
Yuto Nishida
|
Justin Vasselli
|
Miyu Oba
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is more acceptable. Conventional approaches compare sentence probabilities directly, but large language models (LLMs) provide nuanced evaluation methods using prompts and templates. We therefore investigate how to derive the most accurate acceptability judgments from LLMs to comprehensively evaluate their grammatical knowledge. Through extensive experiments in both English and Chinese, we compare nine judgment methods and demonstrate that two of them, in-template LP (a probability readout method) and Yes/No probability computing (a prompting-based method), achieve higher accuracy than the conventional approach. Our analysis reveals that the top two methods excel in different linguistic phenomena, suggesting they access different aspects of the LLMs’ grammatical knowledge. We find that ensembling the two methods achieves even higher accuracy. Consequently, we recommend these techniques, either individually or ensembled, as more effective alternatives to conventional approaches for assessing grammatical knowledge in LLMs.
pdf
bib
abs
Tonguescape: Exploring Language Models Understanding of Vowel Articulation
Haruki Sakajo
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Vowels are primarily characterized by tongue position. Humans have discovered these features of vowel articulation through their own experience and explicit objective observation such as using MRI. With this knowledge and our experience, we can explain and understand the relationship between tongue positions and vowels, and this knowledge is helpful for language learners to learn pronunciation. Since language models (LMs) are trained on a large amount of data that includes linguistic and medical fields, our preliminary studies indicate that an LM is able to explain the pronunciation mechanisms of vowels. However, it is unclear whether multi-modal LMs, such as vision LMs, align textual information with visual information. One question arises: do LMs associate real tongue positions with vowel articulation? In this study, we created video and image datasets from the existing real-time MRI dataset and investigated whether LMs can understand vowel articulation based on tongue positions using vision-based information. Our findings suggest that LMs exhibit potential for understanding vowels and tongue positions when reference examples are provided while they have difficulties without them. Our code for dataset building is available on GitHub.
pdf
bib
abs
Reliability of Distribution Predictions by LLMs: Insights from Counterintuitive Pseudo-Distributions
Toma Suzuki
|
Ayuki Katayama
|
Seiji Gobara
|
Ryo Tsujimoto
|
Hibiki Nakatani
|
Kazuki Hayashi
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
The proportion of responses to a question and its options, known as the response distribution, enables detailed analysis of human society. Recent studies highlight the use of Large Language Models (LLMs) for predicting response distributions as a cost-effective survey method. However, the reliability of these predictions remains unclear. LLMs often generate answers by blindly following instructions rather than applying rational reasoning based on pretraining-acquired knowledge. This study investigates whether LLMs can rationally estimate distributions when presented with explanations of “artificially generated distributions” that are against commonsense. Specifically, we assess whether LLMs recognize counterintuitive explanations and adjust their predictions or simply follow these inconsistent explanations. Results indicate that smaller or less human-optimized LLMs tend to follow explanations uncritically, while larger or more optimized models are better at resisting counterintuitive explanations by leveraging their pretraining-acquired knowledge. These findings shed light on factors influencing distribution prediction performance in LLMs and are crucial for developing reliable distribution predictions using language models.
pdf
bib
abs
Knowledge Editing Induces Underconfidence in Language Models
Ryo Hasegawa
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
As language models continue to scale, the demand for knowledge editing, a retraining-free knowledge update method, has increased. However, since knowledge editing directly alters token prediction probabilities acquired during pretraining, the probabilities may diverge from the empirical distribution. In this study, we analyze the impact of knowledge editing to compare the alignment between token prediction probabilities and task accuracy by calculating confidence calibration before and after knowledge editing. Our results reveal that, for tasks requiring semantic understanding, the range of increase in token prediction probabilities tends to be smaller than that of accuracy improvement, suggesting that knowledge editing methods lead to less confidence in prediction.
2024
pdf
bib
abs
Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates
Yusuke Sakai
|
Adam Nohejl
|
Jiangnan Hang
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.
pdf
bib
abs
Can we obtain significant success in RST discourse parsing by using Large Language Models?
Aru Maekawa
|
Tsutomu Hirao
|
Hidetaka Kamigaito
|
Manabu Okumura
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Recently, decoder-only pre-trained large language models (LLMs), with several tens of billion parameters, have significantly impacted a wide range of natural language processing (NLP) tasks. While encoder-only or encoder-decoder pre-trained language models have already proved to be effective in discourse parsing, the extent to which LLMs can perform this task remains an open research question. Therefore, this paper explores how beneficial such LLMs are for Rhetorical Structure Theory (RST) discourse parsing. Here, the parsing process for both fundamental top-down and bottom-up strategies is converted into prompts, which LLMs can work with. We employ Llama 2 and fine-tune it with QLoRA, which has fewer parameters that can be tuned. Experimental results on three benchmark datasets, RST-DT, Instr-DT, and the GUM corpus, demonstrate that Llama 2 with 70 billion parameters in the bottom-up strategy obtained state-of-the-art (SOTA) results with significant differences. Furthermore, our parsers demonstrated generalizability when evaluated on RST-DT, showing that, in spite of being trained with the GUM corpus, it obtained similar performances to those of existing parsers trained with RST-DT.
pdf
bib
abs
Generating Diverse Translation with Perturbed kNN-MT
Yuto Nishida
|
Makoto Morishita
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Generating multiple translation candidates would enable users to choose the one that satisfies their needs.Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem—the model underestimates a prediction that is largely different from the training data, even if that prediction is likely.This paper proposes methods that generate more diverse translations by introducing perturbed k-nearest neighbor machine translation (kNN-MT).Our methods expand the search space of kNN-MT and help incorporate diverse words into candidates by addressing the overcorrection problem.Our experiments show that the proposed methods drastically improve candidate diversity and control the degree of diversity by tuning the perturbation’s magnitude.
pdf
bib
abs
Are Data Augmentation Methods in Named Entity Recognition Applicable for Uncertainty Estimation?
Wataru Hashimoto
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
This work investigates the impact of data augmentation on confidence calibration and uncertainty estimation in Named Entity Recognition (NER) tasks. For the future advance of NER in safety-critical fields like healthcare and finance, it is essential to achieve accurate predictions with calibrated confidence when applying Deep Neural Networks (DNNs), including Pre-trained Language Models (PLMs), as a real-world application. However, DNNs are prone to miscalibration, which limits their applicability. Moreover, existing methods for calibration and uncertainty estimation are computational expensive. Our investigation in NER found that data augmentation improves calibration and uncertainty in cross-genre and cross-lingual setting, especially in-domain setting. Furthermore, we showed that the calibration for NER tends to be more effective when the perplexity of the sentences generated by data augmentation is lower, and that increasing the size of the augmentation further improves calibration and uncertainty.
pdf
bib
abs
Exploring Intrinsic Language-specific Subspaces in Fine-tuning Multilingual Neural Machine Translation
Zhe Cao
|
Zhi Qu
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Multilingual neural machine translation models support fine-tuning hundreds of languages simultaneously. However, fine-tuning on full parameters solely is inefficient potentially leading to negative interactions among languages. In this work, we demonstrate that the fine-tuning for a language occurs in its intrinsic language-specific subspace with a tiny fraction of entire parameters. Thus, we propose language-specific LoRA to isolate intrinsic language-specific subspaces. Furthermore, we propose architecture learning techniques and introduce a gradual pruning schedule during fine-tuning to exhaustively explore the optimal setting and the minimal intrinsic subspaces for each language, resulting in a lightweight yet effective fine-tuning procedure. The experimental results on a 12-language subset and a 30-language subset of FLORES-101 show that our methods not only outperform full-parameter fine-tuning up to 2.25 spBLEU scores but also reduce trainable parameters to 0.4% for high and medium-resource languages and 1.6% for low-resource ones.
pdf
bib
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
Zhiyu Guo
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
pdf
bib
abs
Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model
Mana Makinae
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Simultaneous Speech Translation (SiST) begins translating before the entire source input is received, making it crucial to balance quality and latency. In real interpreting situations, interpreters manage this simultaneity by breaking sentences into smaller segments and translating them while maintaining the source order as much as possible. SiST could benefit from this approach to balance quality and latency. However, current corpora used for simultaneous tasks often involve significant word reordering in translation, which is not ideal given that interpreters faithfully follow source syntax as much as possible. Inspired by conference interpreting by humans utilizing the salami technique, we introduce the Simul-MuST-C, a dataset created by leveraging the Large Language Model (LLM), specifically GPT-4o, which aligns the target text as closely as possible to the source text by using minimal chunks that contain enough information to be interpreted. Experiments on three language pairs show that the effectiveness of segmented-base monotonicity in training data varies with the grammatical distance between the source and the target, with grammatically distant language pairs benefiting the most in achieving quality while minimizing latency.
pdf
bib
abs
Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair
Yusuke Sakai
|
Mana Makinae
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
In Simultaneous Machine Translation (SiMT), training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency. However, constructing such a corpus is challenging due to high costs, and limitations in annotator capabilities, and as a result, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation (ST) corpora into interpretation-style corpora, maintaining the original word order and preserving the entire source content using Large Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models using the LLM-SI-Corpus reduces latency while achieving better quality compared to models fine-tuned with other corpora in both speech-to-text and text-to-text settings. The LLM-SI-Corpus is available at https://github.com/yusuke1997/LLM-SI-Corpus.
pdf
bib
mbrs: A Library for Minimum Bayes Risk Decoding
Hiroyuki Deguchi
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
pdf
bib
abs
InstructCMP: Length Control in Sentence Compression through Instruction-based Large Language Models
Juseon-Do
|
Jingun Kwon
|
Hidetaka Kamigaito
|
Manabu Okumura
Findings of the Association for Computational Linguistics: ACL 2024
Extractive summarization can produce faithful summaries but often requires additional constraints such as a desired summary length. Traditional sentence compression models do not typically consider the constraints because of their restricted model abilities, which require model modifications for coping with them. To bridge this gap, we propose Instruction-based Compression (InstructCMP), an approach to the sentence compression task that can consider the length constraint through instructions by leveraging the zero-shot task-solving abilities of Large Language Models (LLMs). For this purpose, we created new evaluation datasets by transforming traditional sentence compression datasets into an instruction format. By using the datasets, we first reveal that the current LLMs still face challenges in accurately controlling the length for a compressed text. To address this issue, we propose an approach named length priming, that incorporates additional length information into the instructions without external resources. While the length priming effectively works in a zero-shot setting, a training dataset with the instructions would further improve the ability of length control. Thus, we additionally created a training dataset in an instruction format to fine-tune the model on it. Experimental results and analysis show that applying the length priming significantly improves performances of InstructCMP in both zero-shot and fine-tuning settings without the need of any model modifications.
pdf
bib
abs
Centroid-Based Efficient Minimum Bayes Risk Decoding
Hiroyuki Deguchi
|
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
|
Hideki Tanaka
|
Masao Utiyama
Findings of the Association for Computational Linguistics: ACL 2024
Minimum Bayes risk (MBR) decoding achieved state-of-the-art translation performance by using COMET, a neural metric that has a high correlation with human evaluation.However, MBR decoding requires quadratic time since it computes the expected score between a translation hypothesis and all reference translations.We propose centroid-based MBR (CBMBR) decoding to improve the speed of MBR decoding.Our method clusters the reference translations in the feature space, and then calculates the score using the centroids of each cluster.The experimental results show that our CBMBR not only improved the decoding speed of the expected score calculation 5.7 times, but also outperformed vanilla MBR decoding in translation quality by up to 0.5 COMET in the WMT’22 En↔Ja, En↔De, En↔Zh, and WMT’23 En↔Ja translation tasks.
pdf
bib
abs
mCSQA: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2024
It is very challenging to curate a dataset for language-specific knowledge and common sense in order to evaluate natural language understanding capabilities of language models. Due to the limitation in the availability of annotators, most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects. Therefore, we propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction, e.g., by asking LM to generate questions/answers, refine answers and verify QAs followed by reduced human efforts for verification. Constructed dataset is a benchmark for cross-lingual language-transfer capabilities of multilingual LMs, and experimental results showed high language-transfer capabilities for questions that LMs could easily solve, but lower transfer capabilities for questions requiring deep knowledge or commonsense. This highlights the necessity of language-specific datasets for evaluation and training. Finally, our method demonstrated that multilingual LMs could create QA including language-specific knowledge, significantly reducing the dataset creation cost compared to manual creation. The datasets are available at https://huggingface.co/datasets/yusuke1997/mCSQA.
pdf
bib
abs
Cross-lingual Contextualized Phrase Retrieval
Huayang Li
|
Deng Cai
|
Zhi Qu
|
Qu Cui
|
Hidetaka Kamigaito
|
Lemao Liu
|
Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2024
Phrase-level dense retrieval has shown many appealing characteristics in downstream NLP tasks by leveraging the fine-grained information that phrases offer. In our work, we propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval, which aims to augment cross-lingual applications by addressing polysemy using context information. However, the lack of specific training data and models are the primary challenges to achieve our goal. As a result, we extract pairs of cross-lingual phrases using word alignment information automatically induced from parallel sentences. Subsequently, we train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning, which encourages the hidden representations of phrases with similar contexts and semantics to align closely. Comprehensive experiments on both the cross-lingual phrase retrieval task and a downstream task, i.e, machine translation, demonstrate the effectiveness of CCPR. On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher. When utilizing CCPR to augment the large-language-model-based translator, it achieves average gains of 0.7 and 1.5 in BERTScore for translations from X=>En and vice versa, respectively, on WMT16 dataset. We will release our code and data.
pdf
bib
abs
Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches
Tsutomu Hirao
|
Naoki Kobayashi
|
Hidetaka Kamigaito
|
Manabu Okumura
|
Akisato Kimura
Findings of the Association for Computational Linguistics: EMNLP 2024
This paper tackles a new task: discourse parsing for videos, inspired by text discourse parsing based on Rhetorical Structure Theory (RST). The task aims to construct an RST tree for a video to represent its storyline and illustrate the event relationships. We first construct a benchmark dataset by identifying events with their time spans, providing corresponding captions, and constructing RST trees with events as leaves. We then evaluate baseline approaches to video RST parsing: the ‘parsing after captioning’ framework and parsing via visual features. The results show that a parser using gold captions performed the best, while parsers relying on generated captions performed the worst; a parser using visual features provided intermediate performance. However, we observed that parsing via visual features could be improved by pre-training it with video captioning designed to produce a coherent video story. Furthermore, we demonstrated that RST trees obtained from videos contribute to multimodal summarization consisting of keyframes with texts.
2023
pdf
bib
abs
Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models
Hidetaka Kamigaito
|
Katsuhiko Hayashi
|
Taro Watanabe
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
In this paper, we propose a table and image generation task to verify how the knowledge about entities acquired from natural language is retained in Vision & Language (V & L) models. This task consists of two parts: the first is to generate a table containing knowledge about an entity and its related image, and the second is to generate an image from an entity with a caption and a table containing related knowledge of the entity. In both tasks, the model must know the entities used to perform the generation properly. We created the Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000 infoboxes in English Wikipedia articles to perform the proposed tasks. We evaluated the performance on the tasks with respect to the above research question using the V & L model OFA, which has achieved state-of-the-art results in multiple tasks. Experimental results show that OFA forgets part of its entity knowledge by pre-training as a complement to improve the performance of image related tasks.
pdf
bib
abs
Bidirectional Transformer Reranker for Grammatical Error Correction
Ying Zhang
|
Hidetaka Kamigaito
|
Manabu Okumura
Findings of the Association for Computational Linguistics: ACL 2023
Pre-trained seq2seq models have achieved state-of-the-art results in the grammatical error correction task. However, these models still suffer from a prediction bias due to their unidirectional decoding. Thus, we propose a bidirectional Transformer reranker (BTR), that re-estimates the probability of each candidate sentence generated by the pre-trained seq2seq model. The BTR preserves the seq2seq-style Transformer architecture but utilizes a BERT-style self-attention mechanism in the decoder to compute the probability of each target token by using masked language modeling to capture bidirectional representations from the target context. For guiding the reranking, the BTR adopts negative sampling in the objective function to minimize the unlikelihood. During inference, the BTR gives final results after comparing the reranked top-1 results with the original ones by an acceptance threshold. Experimental results show that, in reranking candidates from a pre-trained seq2seq model, T5-base, the BTR on top of T5-base could yield 65.47 and 71.27 F0.5 scores on the CoNLL-14 and BEA test sets, respectively, and yield 59.52 GLEU score on the JFLEG corpus, with improvements of 0.36, 0.76 and 0.48 points compared with the original T5-base. Furthermore, when reranking candidates from T5-large, the BTR on top of T5-base improved the original T5-large by 0.26 points on the BEA test set.