2024
pdf
abs
Automatic Manipulation of Training Corpora to Make Parsers Accept Real-world Text
Hiroshi Kanayama
|
Ran Iwamoto
|
Masayasu Muraoka
|
Takuya Ohko
|
Kohtaroh Miyamoto
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
This paper discusses how to build a practical syntactic analyzer, and addresses the distributional differences between existing corpora and actual documents in applications. As a case study we focus on noun phrases that are not headed by a main verb and sentences without punctuation at the end, which are rare in a number of Universal Dependencies corpora but frequently appear in the real-world use cases of syntactic parsers. We converted the training corpora so that their distribution is closer to that in realistic inputs, and obtained the better scores both in general syntax benchmarking and a sentiment detection task, a typical application of dependency analysis.
2023
pdf
abs
Incorporating Syntactic Knowledge into Pre-trained Language Model using Optimization for Overcoming Catastrophic Forgetting
Ran Iwamoto
|
Issei Yoshida
|
Hiroshi Kanayama
|
Takuya Ohko
|
Masayasu Muraoka
Findings of the Association for Computational Linguistics: EMNLP 2023
Syntactic knowledge is invaluable information for many tasks which handle complex or long sentences, but typical pre-trained language models do not contain sufficient syntactic knowledge. Thus it results in failures in downstream tasks that require syntactic knowledge. In this paper, we explore additional training to incorporate syntactic knowledge to a language model. We designed four pre-training tasks that learn different syntactic perspectives. For adding new syntactic knowledge and keeping a good balance between the original and additional knowledge, we addressed the problem of catastrophic forgetting that prevents the model from keeping semantic information when the model learns additional syntactic knowledge. We demonstrated that additional syntactic training produced consistent performance gains while clearly avoiding catastrophic forgetting.
2021
pdf
abs
A Universal Dependencies Corpora Maintenance Methodology Using Downstream Application
Ran Iwamoto
|
Hiroshi Kanayama
|
Alexandre Rademaker
|
Takuya Ohko
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP
This paper investigates updates of Universal Dependencies (UD) treebanks in 23 languages and their impact on a downstream application. Numerous people are involved in updating UD’s annotation guidelines and treebanks in various languages. However, it is not easy to verify whether the updated resources maintain universality with other language resources. Thus, validity and consistency of multilingual corpora should be tested through application tasks involving syntactic structures with PoS tags, dependency labels, and universal features. We apply the syntactic parsers trained on UD treebanks from multiple versions (2.0 to 2.7) to a clause-level sentiment extractor. We then analyze the relationships between attachment scores of dependency parsers and performance in application tasks. For future UD developments, we show examples of outputs that differ depending on version.