So Miyagawa


2024

pdf
Language Atlas of Japanese and Ryukyuan (LAJaR): A Linguistic Typology Database for Endangered Japonic Languages
Kanji Kato | So Miyagawa | Natsuko Nakagawa
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

LAJaR (Language Atlas of Japanese and Ryukyuan) is a linguistic typology database focusing on micro-variation of the Japonic (Japanese and Ryukyuan) languages. This paper aims to report the design and progress of this ongoing database project. Finally, we also show a case study utilizing its database on zero copulas among the Japonic languages.

2023

pdf bib
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Emily Öhman | Flammie Pirinen | Khalid Alnajjar | So Miyagawa | Yuri Bizzoni | Niko Partanen | Jack Rueter
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

pdf
Machine Translation for Highly Low-Resource Language: A Case Study of Ainu, a Critically Endangered Indigenous Language in Northern Japan
So Miyagawa
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

This paper explores the potential of Machine Translation (MT) in preserving and revitalizing Ainu, an indigenous language of Japan classified as critically endangered by UNESCO. Through leveraging Marian MT, an open-source Neural Machine Translation framework, this study addresses the challenging linguistic features of Ainu and the limitations of available resources. The research implemented a meticulous methodology involving rigorous preprocessing of data, prudent training of the model, and robust evaluation using the SacreBLEU metric. The findings underscore the system’s efficacy, achieving a SacreBLEU score of 32.90 for Japanese to Ainu translation. This promising result highlights the capacity of MT systems to support language preservation and aligns with recent research emphasizing the potential of computational techniques for low-resource languages. The paper concludes by affirming the significant role of MT in the broader context of language preservation, serving as a crucial tool in the fight against language extinction. The study paves the way for future research to harness advanced MT techniques and develop more sophisticated models for endangered languages.

pdf
Building Okinawan Lexicon Resource for Language Reclamation/Revitalization and Natural Language Processing Tasks such as Universal Dependencies Treebanking
So Miyagawa | Kanji Kato | Miho Zlazli | Salvatore Carlino | Seira Machida
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

The Open Multilingual Online Lexicon of Okinawan (OMOLO) project aims to create an accessible, user-friendly digital lexicon for the endangered Okinawan language using digital humanities tools and methodologies. The multilingual web application, available in Japanese, English, Portuguese, and Spanish, will benefit language learners, researchers, and the Okinawan community in Japan and diaspora countries such as the U.S., Brazil, and Peru. The project also lays the foundation for an Okinawan UD Treebank, which will support computational analysis and the development of language technology tools such as parsers, machine translation systems, and speech recognition software. The OMOLO project demonstrates the potential of computational linguistics in preserving and revitalizing endangered languages and can serve as a blueprint for similar initiatives.

2019

pdf
The Making of Coptic Wordnet
Laura Slaughter | Luis Morgado Da Costa | So Miyagawa | Marco Büchler | Amir Zeldes | Heike Behlmer
Proceedings of the 10th Global Wordnet Conference

With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millenium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.