Yusuke Ide

2025

Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size.To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking.Additionally, for the first time in a dataset of MWE identification, CoAM’s MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis.Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form.Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset.Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.

Large-scale Vision-Language Models (LVLMs) process both images and text, excelling in multimodal tasks such as image captioning and description generation. However, while these models excel at generating factual content, their ability to generate and evaluate texts reflecting perspectives on the same image, depending on the context, has not been sufficiently explored. To address this, we propose IRR: Image Review Rank, a novel evaluation framework designed to assess critic review texts from multiple perspectives. IRR evaluates LVLMs by measuring how closely their judgments align with human interpretations. We validate it using a dataset of images from 15 categories, each with five critic review texts and annotated rankings in both English and Japanese, totaling over 2,000 data instances. Our results indicate that, although LVLMs exhibited consistent performance across languages, their correlation with human annotations was insufficient, highlighting the need for further advancements. These findings highlight the limitations of current evaluation methods and the need for approaches that better capture human reasoning in Vision & Language tasks.

pdf bib abs
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries
Haruki Sakajo | Yusuke Ide | Justin Vasselli | Yusuke Sakai | Yingtao Tian | Hidetaka Kamigaito | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL 2025

Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.

pdf bib abs
How to Make the Most of LLMs’ Grammatical Knowledge for Acceptability Judgments
Yusuke Ide | Yuto Nishida | Justin Vasselli | Miyu Oba | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The grammatical knowledge of language models (LMs) is often measured using a benchmark of linguistic minimal pairs, where LMs are presented with a pair of acceptable and unacceptable sentences and required to judge which is more acceptable. Conventional approaches compare sentence probabilities directly, but large language models (LLMs) provide nuanced evaluation methods using prompts and templates. We therefore investigate how to derive the most accurate acceptability judgments from LLMs to comprehensively evaluate their grammatical knowledge. Through extensive experiments in both English and Chinese, we compare nine judgment methods and demonstrate that two of them, in-template LP (a probability readout method) and Yes/No probability computing (a prompting-based method), achieve higher accuracy than the conventional approach. Our analysis reveals that the top two methods excel in different linguistic phenomena, suggesting they access different aspects of the LLMs’ grammatical knowledge. We find that ensembling the two methods achieves even higher accuracy. Consequently, we recommend these techniques, either individually or ensembled, as more effective alternatives to conventional approaches for assessing grammatical knowledge in LLMs.

2024

We report the findings of the 2024 Multilingual Lexical Simplification Pipeline shared task. We released a new dataset comprising 5,927 instances of lexical complexity prediction and lexical simplification on common contexts across 10 languages, split into trial (300) and test (5,627). 10 teams participated across 2 tracks and 10 languages with 233 runs evaluated across all systems. Five teams participated in all languages for the lexical complexity prediction task and 4 teams participated in all languages for the lexical simplification task. Teams employed a range of strategies, making use of open and closed source large language models for lexical simplification, as well as feature-based approaches for lexical complexity prediction. The highest scoring team on the combined multilingual data was able to obtain a Pearson’s correlation of 0.6241 and an ACC@1@Top1 of 0.3772, both demonstrating that there is still room for improvement on two difficult sub-tasks of the lexical simplification pipeline.

Geoparsing is a fundamental technique for analyzing geo-entity information in text, which is useful for geographic applications, e.g., tourist spot recommendation. We focus on document-level geoparsing that considers geographic relatedness among geo-entity mentions and present a Japanese travelogue dataset designed for training and evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.

We present preliminary findings on the MultiLS dataset, developed in support of the 2024 Multilingual Lexical Simplification Pipeline (MLSP) Shared Task. This dataset currently comprises of 300 instances of lexical complexity prediction and lexical simplification across 10 languages. In this paper, we (1) describe the annotation protocol in support of the contribution of future datasets and (2) present summary statistics on the existing data that we have gathered. Multilingual lexical simplification can be used to support low-ability readers to engage with otherwise difficult texts in their native, often low-resourced, languages.

pdf bib abs
Difficult for Whom? A Study of Japanese Lexical Complexity
Adam Nohejl | Akio Hayakawa | Yusuke Ide | Taro Watanabe
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

The tasks of lexical complexity prediction (LCP) and complex word identification (CWI) commonly presuppose that difficult-to-understand words are shared by the target population. Meanwhile, personalization methods have also been proposed to adapt models to individual needs. We verify that a recent Japanese LCP dataset is representative of its target population by partially replicating the annotation. By another reannotation we show that native Chinese speakers perceive the complexity differently due to Sino-Japanese vocabulary. To explore the possibilities of personalization, we compare competitive baselines trained on the group mean ratings and individual ratings in terms of performance for an individual. We show that the model trained on a group mean performs similarly to an individual model in the CWI task, while achieving good LCP performance for an individual is difficult. We also experiment with adapting a finetuned BERT model, which results only in marginal improvements across all settings.

2023

pdf bib abs
Japanese Lexical Complexity for Non-Native Readers: A New Dataset
Yusuke Ide | Masato Mita | Adam Nohejl | Hiroki Ouchi | Taro Watanabe
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Lexical complexity prediction (LCP) is the task of predicting the complexity of words in a text on a continuous scale. It plays a vital role in simplifying or annotating complex words to assist readers. To study lexical complexity in Japanese, we construct the first Japanese LCP dataset. Our dataset provides separate complexity scores for Chinese/Korean annotators and others to address the readers’ L1-specific needs. In the baseline experiment, we demonstrate the effectiveness of a BERT-based system for Japanese LCP.

Venues

bea2
findings2
ws2
acl1
coling1
show all...

Yusuke Ide

Fixing paper assignments

2025

2024

2023

Co-authors

Venues