Asuka Sumida

2022

pdf bib abs
Resource of Wikipedias in 31 Languages Categorized into Fine-Grained Named Entities
Satoshi Sekine | Kouta Nakayama | Masako Nomoto | Maya Ando | Asuka Sumida | Koji Matsuda
Proceedings of the 29th International Conference on Computational Linguistics

This paper describes a resource of Wikipedias in 31 languages categorized into Extended Named Entity (ENE), which has 219 fine-grained NE categories. We first categorized 920 K Japanese Wikipedia pages according to the ENE scheme using machine learning, followed by manual validation. We then organized a shared task of Wikipedia categorization into 30 languages. The training data were provided by Japanese categorization and the language links, and the task was to categorize the Wikipedia pages into 30 languages, with no language links from Japanese Wikipedia (20M pages in total). Thirteen groups with 24 systems participated in the 2020 and 2021 tasks, sharing their outputs for resource-building. The Japanese categorization accuracy was 98.5%, and the best performance among the 30 languages ranges from 80 to 93 in F-measure. Using ensemble learning, we created outputs with an average F-measure of 86.8, which is 1.7 better than the best single systems. The total size of the resource is 32.5M pages, including the training data. We call this resource creation scheme “Resource by Collaborative Contribution (RbCC)”. We also constructed structuring tasks (attribute extraction and link prediction) using RbCC under our ongoing project, “SHINRA.”

2009

2008

pdf bib
Hacking Wikipedia for Hyponymy Relation Acquisition
Asuka Sumida | Kentaro Torisawa
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib abs
Boosting Precision and Recall of Hyponymy Relation Acquisition from Hierarchical Layouts in Wikipedia
Asuka Sumida | Naoki Yoshinaga | Kentaro Torisawa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper proposes an extension of Sumida and Torisawas method of acquiring hyponymy relations from hierachical layouts in Wikipedia (Sumida and Torisawa, 2008). We extract hyponymy relation candidates (HRCs) from the hierachical layouts in Wikipedia by regarding all subordinate items of an item x in the hierachical layouts as xs hyponym candidates, while Sumida and Torisawa (2008) extracted only direct subordinate items of an item x as xs hyponym candidates. We then select plausible hyponymy relations from the acquired HRCs by running a filter based on machine learning with novel features, which even improve the precision of the resulting hyponymy relations. Experimental results show that we acquired more than 1.34 million hyponymy relations with a precision of 90.1%.