Masafumi Oyamada


2025

pdf bib
On Synthesizing Data for Context Attribution in Question Answering
Gorjan Radevski | Kiril Gashteovski | Shahbaz Syed | Christopher Malon | Sebastien Nicolas | Chia-Chien Hung | Timo Sztyler | Verena Heußer | Wiem Ben Rim | Masafumi Enomoto | Kunihiro Takeoka | Masafumi Oyamada | Goran Glavaš | Carolin Lawrence
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Question Answering (QA) accounts for a significant portion of LLM usage in the wild”. However, LLMs sometimes produce false or misleading responses, also known as hallucinations”. Therefore, grounding the generated answers in contextually provided information—i.e., providing evidence for the generated text—is paramount for LLMs’ trustworthiness. Providing this information is the task of context attribution. In this paper, we systematically study LLM-based approaches for this task, namely we investigate (i) zero-shot inference, (ii) LLM ensembling, and (iii) fine-tuning of small LMs on synthetic data generated by larger LLMs. Our key contribution is SynQA: a novel generative strategy for synthesizing context attribution data. Given selected context sentences, an LLM generates QA pairs that are supported by these sentences. This leverages LLMs’ natural strengths in text generation while ensuring clear attribution paths in the synthetic training data. We show that the attribution data synthesized via SynQA is highly effective for fine-tuning small LMs for context attribution in different QA tasks and domains. Finally, with a user study, we validate the usefulness of small LMs (fine-tuned on synthetic data from SynQA) in context attribution for QA.

pdf bib
Understanding the Impact of Confidence in Retrieval Augmented Generation: A Case Study in the Medical Domain
Shintaro Ozaki | Yuta Kato | Siyuan Feng | Masayo Tomita | Kazuki Hayashi | Wataru Hashimoto | Ryoma Obara | Masafumi Oyamada | Katsuhiko Hayashi | Hidetaka Kamigaito | Taro Watanabe
ACL 2025

Retrieval Augmented Generation (RAG) complements the knowledge of Large Language Models (LLMs) by leveraging external information to enhance response accuracy for queries. This approach is widely applied in several fields by taking its advantage of injecting the most up-to-date information, and researchers are focusing on understanding and improving this aspect to unlock the full potential of RAG in such high-stakes applications.However, despite the potential of RAG to address these needs, the mechanisms behind the confidence levels of its outputs remain underexplored.Our study focuses on the impact of RAG, specifically examining whether RAG increases the confidence of LLM outputs in the medical domain.We conduct this analysis across various configurations and models.We evaluate confidence by treating the model’s predicted probability as its output and calculating several evaluation metrics which include calibration error method, entropy, best probability, and accuracy.Experimental results across multiple datasets confirmed that certain models possess the capability to judge for themselves whether an inserted document relates to the correct answer. These results suggest that evaluating models based on their output probabilities determine whether they function as generators in the RAG framework.Our approach allows to evaluate whether the models handle retrieved documents.

pdf bib
Can Large Language Models Invent Algorithms to Improve Themselves?
Yoichi Ishibashi | Taro Yano | Masafumi Oyamada
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Large Language Models (LLMs) have shown remarkable performance improvements and are rapidly gaining adoption in industry. However, the methods for improving LLMs are still designed by humans, which restricts the invention of new model-improving algorithms to human expertise and imagination. To address this, we propose the Self-Developing framework, which enables LLMs to autonomously generate and learn model-improvement algorithms. In this framework, the seed model generates, applies, and learns model-improving algorithms, continuously improving both the seed model and the algorithms themselves. Among model-improving strategies, we focus on model merging algorithms. In mathematical reasoning tasks, Self-Developing discovers novel merging strategies and outperforms human-designed methods. On GSM8k, the discovered algorithms improve the seed model by 6% and surpass human-designed methods by 4.3%. Moreover, they exhibit strong transferability, achieving a 7.4% performance gain on out-of-domain models. These results suggest that LLMs can autonomously develop effective model-improvement techniques beyond human intuition.

2024

pdf bib
Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing
Haochen Zhang | Yuyang Dong | Chuan Xiao | Masafumi Oyamada
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format. We instruction-tune local LLMs as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models’ abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/JellyfishOur instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct

pdf bib
Relevance, Diversity, and Exclusivity: Designing Keyword-augmentation Strategy for Zero-shot Classifiers
Taro Yano | Kunihiro Takeoka | Masafumi Oyamada
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Zero-shot text classification involves categorizing text into classes without labeled data, typically using a pre-trained language model to compute the correlation between text and class names. This makes it essential for class names to contain sufficient information. Existing methods incorporate semantically similar keywords related to class names, but the properties of effective keywords remain unclear. We demonstrate that effective keywords should possess three properties: 1) keyword relevance to the task objective, 2) inter-class exclusivity, and 3) intra-class diversity. We also propose an automatic method for acquiring keywords that satisfy these properties without additional knowledge bases or data. Experiments on nine real-world datasets show our method outperforms existing approaches in fully zero-shot and generalized zero-shot settings. Ablation studies further confirm the importance of all three properties for superior performance.

2023

pdf bib
Context Quality Matters in Training Fusion-in-Decoder for Extractive Open-Domain Question Answering
Kosuke Akimoto | Kunihiro Takeoka | Masafumi Oyamada
Findings of the Association for Computational Linguistics: EMNLP 2023

Retrieval-augmented generation models augment knowledge encoded in a language model by providing additional relevant external knowledge (context) during generation. Although it has been shown that the quantity and quality of context impact the performance of retrieval-augmented generation models during inference, limited research explores how these characteristics affect model training. This paper explores how context quantity and quality during model training affect the performance of Fusion-in-Decoder (FiD), the state-of-the-art retrieval-augmented generation model, in extractive open-domain question answering tasks. Experimental results suggest that FiD models overfit to context quality during training and show suboptimal performance when evaluated on different context quality. Through the experimental results, we also reveal FiD models trained with different context quality have different cross-attention distribution patterns. Specifically, as context quality during training increases, FiD models tend to attend more uniformly to each passage in context. Finally, based on these observations, we propose a method to mitigate overfitting to specific context quality by introducing bias to the cross-attention distribution, which we demonstrate to be effective in improving the performance of FiD models on different context quality.

2021

pdf bib
Low-resource Taxonomy Enrichment with Pretrained Language Models
Kunihiro Takeoka | Kosuke Akimoto | Masafumi Oyamada
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Taxonomies are symbolic representations of hierarchical relationships between terms or entities. While taxonomies are useful in broad applications, manually updating or maintaining them is labor-intensive and difficult to scale in practice. Conventional supervised methods for this enrichment task fail to find optimal parents of new terms in low-resource settings where only small taxonomies are available because of overfitting to hierarchical relationships in the taxonomies. To tackle the problem of low-resource taxonomy enrichment, we propose Musubu, an efficient framework for taxonomy enrichment in low-resource settings with pretrained language models (LMs) as knowledge bases to compensate for the shortage of information. Musubu leverages an LM-based classifier to determine whether or not inputted term pairs have hierarchical relationships. Musubu also utilizes Hearst patterns to generate queries to leverage implicit knowledge from the LM efficiently for more accurate prediction. We empirically demonstrate the effectiveness of our method in extensive experiments on taxonomies from both a SemEval task and real-world retailer datasets.