Masatoshi Tsuchiya

2022

pdf abs
Developing a Dataset of Overridden Information in Wikipedia
Masatoshi Tsuchiya | Yasutaka Yokoi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper proposes a new task of detecting information override. Since all information on the Web is not updated in a timely manner, the necessity is created for information that is overridden by another information source to be discarded. The task is formalized as a binary classification problem to determine whether a reference sentence has overridden a target sentence. In investigating this task, this paper describes a construction procedure for the dataset of overridden information by collecting sentence pairs from the difference between two versions of Wikipedia. Our developing dataset shows that the old version of Wikipedia contains much overridden information and that the detection of information override is necessary.

pdf bib abs
Automatic Approach for Building Dataset of Citation Functions for COVID-19 Academic Papers
Setio Basuki | Masatoshi Tsuchiya
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022

This paper develops a new dataset of citation functions of COVID-19-related academic papers. Because the preparation of new labels of citation functions and building a new dataset requires much human effort and is time-consuming, this paper uses our previous citation functions that were built for the Computer Science (CS) domain, which consists of five coarse-grained labels and 21 fine-grained labels. This paper uses the COVID-19 Open Research Dataset (CORD-19) and extracts 99.6k random citing sentences from 10.1k papers. These citing sentences are categorized using the classification models built from the CS domain. The manually check on 475 random samples resulted accuracies of 76.6% and 70.2% on coarse-grained labels and fine-grained labels, respectively. The evaluation reveals three findings. First, two fine-grained labels experienced meaning shift while retaining the same idea. Second, the COVID-19 domain is dominated by statements highlighting the importance, cruciality, usefulness, benefit, consideration, etc. of certain topics for making sensible argumentation. Third, discussing State of The Arts (SOTA) in terms of their outperforming previous works in the COVID-19 domain is less popular compared to the CS domain. Our results will be used for further dataset development by classifying citing sentences in all papers from CORD-19.

2020

pdf abs
Developing Dataset of Japanese Slot Filling Quizzes Designed for Evaluation of Machine Reading Comprehension
Takuto Watarai | Masatoshi Tsuchiya
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes our developing dataset of Japanese slot filling quizzes designed for evaluation of machine reading comprehension. The dataset consists of quizzes automatically generated from Aozora Bunko, and each quiz is defined as a 4-tuple: a context passage, a query holding a slot, an answer character and a set of possible answer characters. The query is generated from the original sentence, which appears immediately after the context passage on the target book, by replacing the answer character into the slot. The set of possible answer characters consists of the answer character and the other characters who appear in the context passage. Because the context passage and the query shares the same context, a machine which precisely understand the context may select the correct answer from the set of possible answer characters. The unique point of our approach is that we focus on characters of target books as slots to generate queries from original sentences, because they play important roles in narrative texts and precise understanding their relationship is necessary for reading comprehension. To extract characters from target books, manually created dictionaries of characters are employed because some characters appear as common nouns not as named entities.

2018

pdf
Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment
Masatoshi Tsuchiya
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf abs
Developing Corpus of Lecture Utterances Aligned to Slide Components
Ryo Minamiguchi | Masatoshi Tsuchiya
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

The approach which formulates the automatic text summarization as a maximum coverage problem with knapsack constraint over a set of textual units and a set of weighted conceptual units is promising. However, it is quite important and difficult to determine the appropriate granularity of conceptual units for this formulation. In order to resolve this problem, we are examining to use components of presentation slides as conceptual units to generate a summary of lecture utterances, instead of other possible conceptual units like base noun phrases or important nouns. This paper explains our developing corpus designed to evaluate our proposing approach, which consists of presentation slides and lecture utterances aligned to presentation slide components.

2012

pdf abs
Detecting Japanese Compound Functional Expressions using Canonical/Derivational Relation
Takafumi Suzuki | Yusuke Abe | Itsuki Toyota | Takehito Utsuro | Suguru Matsuyoshi | Masatoshi Tsuchiya
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The Japanese language has various types of functional expressions. In order to organize Japanese functional expressions with various surface forms, a lexicon of Japanese functional expressions with hierarchical organization was compiled. This paper proposes how to design the framework of identifying more than 16,000 functional expressions in Japanese texts by utilizing hierarchical organization of the lexicon. In our framework, more than 16,000 functional expressions are roughly divided into canonical / derived functional expressions. Each derived functional expression is intended to be identified by referring to the most similar occurrence of its canonical expression. In our framework, contextual occurrence information of much fewer canonical expressions are expanded into the whole forms of derived expressions, to be utilized when identifying those derived expressions. We also empirically show that the proposed method can correctly identify more than 80% of the functional / content usages only with less than 38,000 training instances of manually identified canonical expressions.

pdf abs
Developing Partially-Transcribed Speech Corpus from Edited Transcriptions
Kengo Ohta | Masatoshi Tsuchiya | Seiichi Nakagawa
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Large-scale spontaneous speech corpora are crucial resource for various domains of spoken language processing. However, the available corpora are usually limited because their construction cost is quite expensive especially in transcribing speech precisely. On the other hand, loosely transcribed corpora like shorthand notes, meeting records and closed captions are more widely available than precisely transcribed ones, because their imperfectness reduces their construction cost. Because these corpora contain both precisely transcribed regions and edited regions, it is difficult to use them directly as speech corpora for learning acoustic models. Under this background, we have been considering to build an efficient semi-automatic framework to convert loose transcriptions to precise ones. This paper describes an improved automatic detection method of precise regions from loosely transcribed corpora for the above framework. Our detection method consists of two steps: the first step is a force alignment between loose transcriptions and their utterances to discover the corresponding utterance for the certain loose transcription, and the second step is a detector of precise regions with a support vector machine using several features obtained from the first step. Our experimental result shows that our method achieves a high accuracy of detecting precise regions, and shows that the precise regions extracted by our method are effective as training labels of lightly supervised speaker adaptation.

2009

pdf
Analysis and Robust Extraction of Changing Named Entities
Masatoshi Tsuchiya | Shoko Endo | Seiichi Nakagawa
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

2008

pdf abs
Developing Corpus of Japanese Classroom Lecture Speech Contents
Masatoshi Tsuchiya | Satoru Kogure | Hiromitsu Nishizaki | Kengo Ohta | Seiichi Nakagawa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper explains our developing Corpus of Japanese classroom Lecture speech Contents (henceforth, denoted as CJLC). Increasing e-Learning contents demand a sophisticated interactive browsing system for themselves, however, existing tools do not satisfy such a requirement. Many researches including large vocabulary continuous speech recognition and extraction of important sentences against lecture contents are necessary in order to realize the above system. CJLC is designed as their fundamental basis, and consists of speech, transcriptions, and slides that were collected in real university classroom lectures. This paper also explains the difference about disfluency acts between classroom lectures and academic presentations.

pdf
Robust Extraction of Named Entity Including Unfamiliar Word
Masatoshi Tsuchiya | Shinya Hida | Seiichi Nakagawa
Proceedings of ACL-08: HLT, Short Papers

Venues

lrec6
acl3
mwe2
news1
law1
show all...

alr1

ws1