Hiroshi Kanayama

2025

pdf bib abs
A Simple-Yet-Efficient Instruction Augmentation Method for Zero-Shot Sentiment Classification
Yang Zhao | Masayasu Muraoka | Issei Yoshida | Bishwaranjan Bhattacharjee | Hiroshi Kanayama
Proceedings of the 31st International Conference on Computational Linguistics

Instruction tuning significantly enhances the performance of large language models in tasks such as sentiment classification. Previous studies have leveraged labeled instances from sentiment benchmark datasets to instruction-tune LLMs, improving zero-shot sentiment classification performance. In this work, we propose a simple-yet-efficient instruction augmentation method which does not rely on any actual labeled sentiment instances. With just 240 pseudo instruction instances, the proposed method significantly improve the classification performance across several LLMs on 12 benchmark datasets, increasing scores by 30 points and outperforming LLMs that utilize more complex instruction tuning methods by 5.1 points. Surprisingly, the models tuned with 240 pseudo-instructions even outperform those tuned with actual domain-specific instruction instances. Despite method’s simplicity, our further analysis suggests that the probability shift toward the positive and negative classes and its generalization ability may be the primary driver of the improvement.

pdf bib abs
Bias Analysis and Mitigation through Protected Attribute Detection and Regard Classification
Takuma Udagawa | Yang Zhao | Hiroshi Kanayama | Bishwaranjan Bhattacharjee
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) acquire general linguistic knowledge from massive-scale pretraining. However, pretraining data mainly comprised of web-crawled texts contain undesirable social biases which can be perpetuated or even amplified by LLMs. In this study, we propose an efficient yet effective annotation pipeline to investigate social biases in the pretraining corpora. Our pipeline consists of protected attribute detection to identify diverse demographics, followed by regard classification to analyze the language polarity towards each attribute. Through our experiments, we demonstrate the effect of our bias analysis and mitigation measures, focusing on Common Crawl as the most representative pretraining corpus.

2024

pdf bib abs
Incorporating Syntax and Lexical Knowledge to Multilingual Sentiment Classification on Large Language Models
Hiroshi Kanayama | Yang Zhao | Ran Iwamoto | Takuya Ohko
Findings of the Association for Computational Linguistics: ACL 2024

This paper exploits a sentiment extractor supported by syntactic and lexical resources to enhance multilingual sentiment classification solved through the generative approach, without retraining LLMs. By adding external information of words and phrases that have positive/negative polarities, the multilingual sentiment classification error was reduced by up to 33 points, and the combination of two approaches performed best especially in high-performing pairs of LLMs and languages.

pdf bib abs
LLM Neologism: Emergence of Mutated Characters due to Byte Encoding
Ran Iwamoto | Hiroshi Kanayama
Proceedings of the 17th International Natural Language Generation Conference

The process of language generation, which selects the most probable tokens one by one, may intrinsically result in output strings that humans never utter. We name this phenomenon “LLM neologism” and investigate it focusing on Japanese, Chinese, and Korean languages, where tokens can be smaller than characters. Our findings show that LLM neologism occurs through the combination of two high-frequency words with common tokens. We also clarify the cause of LLM neologism in the tokenization process with limited vocabularies. The results of this study provides important clues for better encoding of multibyte characters, aiming to prevent catastrophic results in AI-generated documents.

pdf bib abs
Automatic Manipulation of Training Corpora to Make Parsers Accept Real-world Text
Hiroshi Kanayama | Ran Iwamoto | Masayasu Muraoka | Takuya Ohko | Kohtaroh Miyamoto
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

This paper discusses how to build a practical syntactic analyzer, and addresses the distributional differences between existing corpora and actual documents in applications. As a case study we focus on noun phrases that are not headed by a main verb and sentences without punctuation at the end, which are rare in a number of Universal Dependencies corpora but frequently appear in the real-world use cases of syntactic parsers. We converted the training corpora so that their distribution is closer to that in realistic inputs, and obtained the better scores both in general syntax benchmarking and a sentiment detection task, a typical application of dependency analysis.

2023

pdf bib abs
Sentence Identification with BOS and EOS Label Combinations
Takuma Udagawa | Hiroshi Kanayama | Issei Yoshida
Findings of the Association for Computational Linguistics: EACL 2023

The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs based on dynamic programming. To evaluate this task, we design an automatic, language-independent procedure to convert the Universal Dependencies corpora into sentence identification benchmarks. Finally, our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.

Semantic role labeling (SRL) identifies the predicate-argument structure in a sentence. This task is usually accomplished in four steps: predicate identification, predicate sense disambiguation, argument identification, and argument classification. Errors introduced at one step propagate to later steps. Unfortunately, the existing SRL evaluation scripts do not consider the full effect of this error propagation aspect. They either evaluate arguments independent of predicate sense (CoNLL09) or do not evaluate predicate sense at all (CoNLL05), yielding an inaccurate SRL model performance on the argument classification task. In this paper, we address key practical issues with existing evaluation scripts and propose a more strict SRL evaluation metric PriMeSRL. We observe that by employing PriMeSRL, the quality evaluation of all SoTA SRL models drops significantly, and their relative rankings also change. We also show that PriMeSRLsuccessfully penalizes actual failures in SoTA SRL models.

pdf bib abs
Incorporating Syntactic Knowledge into Pre-trained Language Model using Optimization for Overcoming Catastrophic Forgetting
Ran Iwamoto | Issei Yoshida | Hiroshi Kanayama | Takuya Ohko | Masayasu Muraoka
Findings of the Association for Computational Linguistics: EMNLP 2023

Syntactic knowledge is invaluable information for many tasks which handle complex or long sentences, but typical pre-trained language models do not contain sufficient syntactic knowledge. Thus it results in failures in downstream tasks that require syntactic knowledge. In this paper, we explore additional training to incorporate syntactic knowledge to a language model. We designed four pre-training tasks that learn different syntactic perspectives. For adding new syntactic knowledge and keeping a good balance between the original and additional knowledge, we addressed the problem of catastrophic forgetting that prevents the model from keeping semantic information when the model learns additional syntactic knowledge. We demonstrated that additional syntactic training produced consistent performance gains while clearly avoiding catastrophic forgetting.

2022

pdf bib abs
A Simple Yet Effective Corpus Construction Method for Chinese Sentence Compression
Yang Zhao | Hiroshi Kanayama | Issei Yoshida | Masayasu Muraoka | Akiko Aizawa
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Deletion-based sentence compression in the English language has made significant progress over the past few decades. However, there is a lack of large-scale and high-quality parallel corpus (i.e., (sentence, compression) pairs) for the Chinese language to train an efficient compression system. To remedy this shortcoming, we present a dependency-tree-based method to construct a Chinese corpus with 151k pairs of sentences and compression based on Chinese language-specific characteristics. Subsequently, we trained both extractive and generative neural compression models using the constructed corpus. The experimental results show that our compression model can generate high-quality compressed sentences on both automatic and human evaluation metrics compared with the baselines. The results of the faithfulness evaluation also indicated that the Chinese compression model trained on our constructed corpus can produce more faithful compressed sentences. Furthermore, a dataset with 1,000 pairs of sentences and ground truth compression was manually created for automatic evaluation, which, we believe, will benefit future research on Chinese sentence compression.

2021

pdf bib abs
A Universal Dependencies Corpora Maintenance Methodology Using Downstream Application
Ran Iwamoto | Hiroshi Kanayama | Alexandre Rademaker | Takuya Ohko
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

This paper investigates updates of Universal Dependencies (UD) treebanks in 23 languages and their impact on a downstream application. Numerous people are involved in updating UD’s annotation guidelines and treebanks in various languages. However, it is not easy to verify whether the updated resources maintain universality with other language resources. Thus, validity and consistency of multilingual corpora should be tested through application tasks involving syntactic structures with PoS tags, dependency labels, and universal features. We apply the syntactic parsers trained on UD treebanks from multiple versions (2.0 to 2.7) to a clause-level sentiment extractor. We then analyze the relationships between attachment scores of dependency parsers and performance in application tasks. For future UD developments, we show examples of outputs that differ depending on version.

2020

pdf bib abs
Interactive Construction of User-Centric Dictionary for Text Analytics
Ryosuke Kohita | Issei Yoshida | Hiroshi Kanayama | Tetsuya Nasukawa
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We propose a methodology to construct a term dictionary for text analytics through an interactive process between a human and a machine, which helps the creation of flexible dictionaries with precise granularity required in typical text analysis. This paper introduces the first formulation of interactive dictionary construction to address this issue. To optimize the interaction, we propose a new algorithm that effectively captures an analyst’s intention starting from only a small number of sample terms. Along with the algorithm, we also design an automatic evaluation framework that provides a systematic assessment of any interactive method for the dictionary creation task. Experiments using real scenario based corpora and dictionaries show that our algorithm outperforms baseline methods, and works even with a small number of interactions.

pdf bib abs
Scalable Cross-lingual Treebank Synthesis for Improved Production Dependency Parsers
Yousef El-Kurdi | Hiroshi Kanayama | Efsun Sarioglu Kayi | Vittorio Castelli | Todd Ward | Radu Florian
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

We present scalable Universal Dependency (UD) treebank synthesis techniques that exploit advances in language representation modeling which leverage vast amounts of unlabeled general-purpose multilingual text. We introduce a data augmentation technique that uses synthetic treebanks to improve production-grade parsers. The synthetic treebanks are generated using a state-of-the-art biaffine parser adapted with pretrained Transformer models, such as Multilingual BERT (M-BERT). The new parser improves LAS by up to two points on seven languages. The production models’ LAS performance improves as the augmented treebanks scale in size, surpassing performance of production models trained on originally annotated UD treebanks.

pdf bib abs
How Universal are Universal Dependencies? Exploiting Syntax for Multilingual Clause-level Sentiment Detection
Hiroshi Kanayama | Ran Iwamoto
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper investigates clause-level sentiment detection in a multilingual scenario. Aiming at a high-precision, fine-grained, configurable, and non-biased system for practical use cases, we have designed a pipeline method that makes the most of syntactic structures based on Universal Dependencies, avoiding machine-learning approaches that may cause obstacles to our purposes. We achieved high precision in sentiment detection for 17 languages and identified the advantages of common syntactic structures as well as issues stemming from structural differences on Universal Dependencies. In addition to reusable tips for handling multilingual syntax, we provide a parallel benchmarking data set for further research.

2018

pdf bib abs
A neural parser as a direct classifier for head-final languages
Hiroshi Kanayama | Masayasu Muraoka | Ryosuke Kohita
Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP

This paper demonstrates a neural parser implementation suitable for consistently head-final languages such as Japanese. Unlike the transition- and graph-based algorithms in most state-of-the-art parsers, our parser directly selects the head word of a dependent from a limited number of candidates. This method drastically simplifies the model so that we can easily interpret the output of the neural model. Moreover, by exploiting grammatical knowledge to restrict possible modification types, we can control the output of the parser to reduce specific errors without adding annotated corpora. The neural parser performed well both on conventional Japanese corpora and the Japanese version of Universal Dependency corpus, and the advantages of distributed representations were observed in the comparison with the non-neural conventional model.

This paper discusses the representation of coordinate structures in the Universal Dependencies framework for two head-final languages, Japanese and Korean. UD applies a strict principle that makes the head of coordination the left-most conjunct. However, the guideline may produce syntactic trees which are difficult to accept in head-final languages. This paper describes the status in the current Japanese and Korean corpora and proposes alternative designs suitable for these languages.

2017

pdf bib abs
Multilingual Training of Crosslingual Word Embeddings
Long Duong | Hiroshi Kanayama | Tengfei Ma | Steven Bird | Trevor Cohn
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Crosslingual word embeddings represent lexical items from different languages using the same vector space, enabling crosslingual transfer. Most prior work constructs embeddings for a pair of languages, with English on one side. We investigate methods for building high quality crosslingual word embeddings for many languages in a unified vector space. In this way, we can exploit and combine strength of many languages. We obtained high performance on bilingual lexicon induction, monolingual similarity and crosslingual document classification tasks.

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib abs
A Semi-universal Pipelined Approach to the CoNLL 2017 UD Shared Task
Hiroshi Kanayama | Masayasu Muraoka | Katsumasa Yoshikawa
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper presents our system submitted for the CoNLL 2017 Shared Task, “Multilingual Parsing from Raw Text to Universal Dependencies.” We ran the system for all languages with our own fully pipelined components without relying on re-trained baseline systems. To train the dependency parser, we used only the universal part-of-speech tags and distance between words, and applied deterministic rules to assign dependency labels. The simple and delexicalized models are suitable for cross-lingual transfer approaches and a universal language model. Experimental results show that our model performed well in some metrics and leads discussion on topics such as contribution of each component and on syntactic similarities among languages.

2016

pdf bib
Learning Crosslingual Word Embeddings without Bilingual Corpora
Long Duong | Hiroshi Kanayama | Tengfei Ma | Steven Bird | Trevor Cohn
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon UniDic. Porting is done by mapping the part-of-speech tagset in UniDic to the universal part-of-speech tagset, and converting a constituent-based treebank to a typed dependency tree. The conversion is not straightforward, and we discuss the problems that arose in the conversion and the current solutions. A treebank consisting of 10,000 sentences was built by converting the existent resources and currently released to the public.

This paper describes a framework for multilingual translation using existing translation engines. Our method allows translation between non-English languages through English as a “hub language”. This hub language method has two major problems: “information loss” and “error accumulation”. In order to address these problems, we represent the hub language using the Linguistic Annotation Language (LAL), which contains English syntactic information and source language information. We show the effectiveness of the annotation approach with a series of experiments.

pdf bib
Paraphrasing Rules for Automatic Evaluation of Translation into Japanese
Hiroshi Kanayama
Proceedings of the Second International Workshop on Paraphrasing