Shigeki Matsubara

2024

pdf abs
Negation Scope Conversion: Towards a Unified Negation-Annotated Dataset
Asahi Yoshida | Yoshihide Kato | Shigeki Matsubara
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Negation scope resolution is the task that identifies the part of a sentence affected by the negation cue. The three major corpora used for this task, the BioScope corpus, the SFU review corpus and the Sherlock dataset, have different annotation schemes for negation scope. Due to the different annotations, the negation scope resolution models based on pre-trained language models (PLMs) perform worse when fine-tuned on the simply combined dataset consisting of the three corpora. To address this issue, we propose a method for automatically converting the scopes of BioScope and SFU to those of Sherlock and merge them into a unified dataset. To verify the effectiveness of the proposed method, we conducted experiments using the unified dataset for fine-tuning PLM-based models. The experimental results demonstrate that the performances of the models increase when fine-tuned on the unified dataset unlike the simply combined one. In the token-level metric, the model fine-tuned on the unified dataset archived the state-of-the-art performance on the Sherlock dataset.

pdf abs
On an Intermediate Task for Classifying URL Citations on Scholarly Papers
Kazuhiro Wada | Masaya Tsunokake | Shigeki Matsubara
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Citations using URL (URL citations) that appear in scholarly papers can be used as an information source for the research resource search engines. In particular, the information about the types of cited resources and reasons for their citation is crucial to describe the resources and their relations in the search services. To obtain this information, previous studies proposed some methods for classifying URL citations. However, their methods trained the model using a simple fine-tuning strategy and exhibited insufficient performance. We propose a classification method using a novel intermediate task. Our method trains the model on our intermediate task of identifying whether sample pairs belong to the same class before being fine-tuned on the target task. In the experiment, our method outperformed previous methods using the simple fine-tuning strategy with higher macro F-scores for different model sizes and architectures. Our analysis results indicate that the model learns the class boundaries of the target task by training our intermediate task. Our intermediate task also demonstrated higher performance and computational efficiency than an alternative intermediate task using triplet loss. Finally, we applied our method to other text classification tasks and confirmed the effectiveness when a simple fine-tuning strategy does not stably work.

2023

pdf
Paper Recommendation Using Citation Contexts in Scholarly Documents
Tomoki Ikoma | Shigeki Matsubara
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf
Automatic Insertion of Commas and Linefeeds into Lecture Transcripts based on Multi-Task Learning
Zhicheng Fang | Masaki Murata | Shigeki Matsubara
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf abs
Revisiting Syntax-Based Approach in Negation Scope Resolution
Asahi Yoshida | Yoshihide Kato | Shigeki Matsubara
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Negation scope resolution is the process of detecting the negated part of a sentence. Unlike the syntax-based approach employed in previous research, state-of-the-art methods performed better without the explicit use of syntactic structure. This work revisits the syntax-based approach and re-evaluates the effectiveness of syntactic structure in negation scope resolution. We replace the parser utilized in the prior works with state-of-the-art parsers and modify the syntax-based heuristic rules. The experimental results demonstrate that the simple modifications enhance the performance of the prior syntax-based method to the same level as state-of-the-art end-to-end neural-based methods.

pdf
On the Use of Language Models for Function Identification of Citations in Scholarly Papers
Tomoki Ikoma | Shigeki Matsubara
Proceedings of the Second Workshop on Information Extraction from Scientific Publications

2022

pdf abs
Construction of Responsive Utterance Corpus for Attentive Listening Response Production
Koichiro Ito | Masaki Murata | Tomohiro Ohno | Shigeki Matsubara
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In Japan, the number of single-person households, particularly among the elderly, is increasing. Consequently, opportunities for people to narrate are being reduced. To address this issue, conversational agents, e.g., communication robots and smart speakers, are expected to play the role of the listener. To realize these agents, this paper describes the collection of conversational responses by listeners that demonstrate attentive listening attitudes toward narrative speakers, and a method to annotate existing narrative speech with responsive utterances is proposed. To summarize, 148,962 responsive utterances by 11 listeners were collected in a narrative corpus comprising 13,234 utterance units. The collected responsive utterances were analyzed in terms of response frequency, diversity, coverage, and naturalness. These results demonstrated that diverse and natural responsive utterances were collected by the proposed method in an efficient and comprehensive manner. To demonstrate the practical use of the collected responsive utterances, an experiment was conducted, in which response generation timings were detected in narratives.

pdf
A Model-Theoretic Formalization of Natural Language Inference Using Neural Network and Tableau Method
Ayahito Saji | Yoshihide Kato | Shigeki Matsubara
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib abs
Classification of URL Citations in Scholarly Papers for Promoting Utilization of Research Artifacts
Masaya Tsunokake | Shigeki Matsubara
Proceedings of the first Workshop on Information Extraction from Scientific Publications

Utilizing citations for research artifacts (e.g., dataset, software) in scholarly papers contributes to efficient expansion of research artifact repositories and various applications e.g., the search, recommendation, and evaluation of such artifacts. This study focuses on citations using URLs (URL citations) and aims to identify and analyze research artifact citations automatically. This paper addresses the classification task for each URL citation to identify (1) the role that the referenced resources play in research activities, (2) the type of referenced resources, and (3) the reason why the author cited the resources. This paper proposes the classification method using section titles and footnote texts as new input features. We extracted URL citations from international conference papers as experimental data. We performed 5-fold cross-validation using the data and computed the classification performance of our method. The results demonstrate that our method is effective in all tasks. An additional experiment demonstrates that using cited URLs as input features is also effective.

2021

pdf abs
A New Representation for Span-based CCG Parsing
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This paper proposes a new representation for CCG derivations. CCG derivations are represented as trees whose nodes are labeled with categories strictly restricted by CCG rule schemata. This characteristic is not suitable for span-based parsing models because they predict node labels independently. In other words, span-based models may generate invalid CCG derivations that violate the rule schemata. Our proposed representation decomposes CCG derivations into several independent pieces and prevents the span-based parsing models from violating the schemata. Our experimental result shows that an off-the-shelf span-based parser with our representation is comparable with previous CCG parsers.

pdf
Natural Language Inference using Neural Network and Tableau Method
Ayahito Saji | Daiki Takao | Yoshihide Kato | Shigeki Matsubara
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

2020

pdf abs
Parsing Gapping Constructions Based on Grammatical and Semantic Roles
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

A gapping construction consists of a coordinated structure where redundant elements are elided from all but one conjuncts. This paper proposes a method of parsing sentences with gapping to recover elided elements. The proposed method is based on constituent trees annotated with grammatical and semantic roles that are useful for identifying elided elements. Our method outperforms the previous method in terms of F-measure and recall.

pdf abs
Relation between Degree of Empathy for Narrative Speech and Type of Responsive Utterance in Attentive Listening
Koichiro Ito | Masaki Murata | Tomohiro Ohno | Shigeki Matsubara
Proceedings of the Twelfth Language Resources and Evaluation Conference

Nowadays, spoken dialogue agents such as communication robots and smart speakers listen to narratives of humans. In order for such an agent to be recognized as a listener of narratives and convey the attitude of attentive listening, it is necessary to generate responsive utterances. Moreover, responsive utterances can express empathy to narratives and showing an appropriate degree of empathy to narratives is significant for enhancing speaker’s motivation. The degree of empathy shown by responsive utterances is thought to depend on their type. However, the relation between responsive utterances and degrees of the empathy has not been explored yet. This paper describes the classification of responsive utterances based on the degree of empathy in order to explain that relation. In this research, responsive utterances are classified into five levels based on the effect of utterances and literature on attentive listening. Quantitative evaluations using 37,995 responsive utterances showed the appropriateness of the proposed classification.

2019

pdf abs
PTB Graph Parsing with Tree Approximation
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The Penn Treebank (PTB) represents syntactic structures as graphs due to nonlocal dependencies. This paper proposes a method that approximates PTB graph-structured representations by trees. By our approximation method, we can reduce nonlocal dependency identification and constituency parsing into single tree-based parsing. An experimental result demonstrates that our approximation method with an off-the-shelf tree-based constituency parser significantly outperforms the previous methods in nonlocal dependency identification.

2018

pdf
Statistical Analysis of Missing Translation in Simultaneous Interpretation Using A Large-scale Bilingual Speech Corpus
Zhongxi Cai | Koichiro Ryu | Shigeki Matsubara
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Model-Theoretic Incremental Interpretation Based on Discourse Representation Theory
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

2016

pdf abs
Correcting Errors in a Treebank Based on Tree Mining
Kanta Suzuki | Yoshihide Kato | Shigeki Matsubara
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper provides a new method to correct annotation errors in a treebank. The previous error correction method constructs a pseudo parallel corpus where incorrect partial parse trees are paired with correct ones, and extracts error correction rules from the parallel corpus. By applying these rules to a treebank, the method corrects errors. However, this method does not achieve wide coverage of error correction. To achieve wide coverage, our method adopts a different approach. In our method, we consider that an infrequent pattern which can be transformed to a frequent one is an annotation error pattern. Based on a tree mining technique, our method seeks such infrequent tree patterns, and constructs error correction rules each of which consists of an infrequent pattern and a corresponding frequent pattern. We conducted an experiment using the Penn Treebank. We obtained 1,987 rules which are not constructed by the previous method, and the rules achieved good precision.

pdf
Transition-Based Left-Corner Parsing for Identifying PTB-Style Nonlocal Dependencies
Yoshihide Kato | Shigeki Matsubara
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In spoken dialogues, if a spoken dialogue system does not respond at all during users utterances, the user might feel uneasy because the user does not know whether or not the system has recognized the utterances. In particular, back-channel utterances, which the system outputs as voices such as yeah and uh huh in English have important roles for a driver in in-car speech dialogues because the driver does not look owards a listener while driving. This paper describes construction of a back-channel utterance corpus and its analysis to develop the system which can output back-channel utterances at the proper timing in the responsive in-car speech dialogue. First, we constructed the back-channel utterance corpus by integrating the back-channel utterances that four subjects provided for the drivers utterances in 60 dialogues in the CIAIR in-car speech dialogue corpus. Next, we analyzed the corpus and revealed the relation between back-channel utterance timings and information on bunsetsu, clause, pause and rate of speech. Based on the analysis, we examined the possibility of detecting back-channel utterance timings by machine learning technique. As the result of the experiment, we confirmed that our technique achieved as same detection capability as a human.

pdf abs
Construction of Chunk-Aligned Bilingual Lecture Corpus for Simultaneous Machine Translation
Masaki Murata | Tomohiro Ohno | Shigeki Matsubara | Yasuyoshi Inagaki
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

With the development of speech and language processing, speech translation systems have been developed. These studies target spoken dialogues, and employ consecutive interpretation, which uses a sentence as the translation unit. On the other hand, there exist a few researches about simultaneous interpreting, and recently, the language resources for promoting simultaneous interpreting research, such as the publication of an analytical large-scale corpus, has been prepared. For the future, it is necessary to make the corpora more practical toward realization of a simultaneous interpreting system. In this paper, we describe the construction of a bilingual corpus which can be used for simultaneous lecture interpreting research. Simultaneous lecture interpreting systems are required to recognize translation units in the middle of a sentence, and generate its translation at the proper timing. We constructed the bilingual lecture corpus by the following steps. First, we segmented sentences in the lecture data into semantically meaningful units for the simultaneous interpreting. And then, we assigned the translations to these units from the viewpoint of the simultaneous interpreting. In addition, we investigated the possibility of automatically detecting the simultaneous interpreting timing from our corpus.

pdf abs
Collection of Usage Information for Language Resources from Academic Articles
Shunsuke Kozawa | Hitomi Tohyama | Kiyotaka Uchimoto | Shigeki Matsubara
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus.

pdf
Correcting Errors in a Treebank Based on Synchronous Tree Substitution Grammar
Yoshihide Kato | Shigeki Matsubara
Proceedings of the ACL 2010 Conference Short Papers

pdf
Coherent Back-Channel Feedback Tagging of In-Car Spoken Dialogue Corpus
Yuki Kamiya | Tomohiro Ohno | Shigeki Matsubara
Proceedings of the SIGDIAL 2010 Conference

2009

pdf
Linefeed Insertion into Japanese Spoken Monologue for Captioning
Tomohiro Ohno | Masaki Murata | Shigeki Matsubara
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
Incremental Parsing with Monotonic Adjoining Operation
Yoshihide Kato | Shigeki Matsubara
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

pdf
Construction of an Infrastructure for Providing Users with Suitable Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara
Coling 2008: Companion volume: Posters

pdf abs
Automatic Acquisition of Usage Information for Language Resources
Shunsuke Kozawa | Hitomi Tohyama | Kiyotaka Uchimoto | Shigeki Matsubara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recently, language resources (LRs) are becoming indispensable for linguistic research. Unfortunately, it is not easy to find their usages by searching the web even though they must be described in the Internet or academic articles. This indicates that the intrinsic value of LRs is not recognized very well. In this research, therefore, we extract a list of usage information for each LR to promote the efficient utilization of LRs. In this paper, we proposed a method for extracting a list of usage information from academic articles by using rules based on syntactic information. The rules are generated by focusing on the syntactic features that are observed in the sentences describing usage information. As a result of experiments, we achieved 72.9% in recall and 78.4% in precision for the closed test and 60.9% in recall and 72.7% in precision for the open test.

pdf abs
Construction of a Metadata Database for Efficient Development and Use of Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The National Institute of Information and Communications Technology (NICT) and Nagoya University have been jointly constructing a large scale database named SHACHI by collecting detailed meta-information on language resources (LRs) in Asia and Western countries, for the purpose of effectively combining LRs. The purpose of this project is to investigate languages, tag sets, and formats compiled in LRs throughout the world, to systematically store LR metadata, to create a search function for this information, and to ultimately utilize all this for a more efficient development of LRs. This metadata database contains more than 2,000 compiled LRs such as corpora, dictionaries, thesauruses and lexicons, forming a large scale metadata of LRs archive. Its metadata, an extended version of OLAC metadata set conforming to Dublin Core, which contain detailed meta-information, have been collected semi-automatically. This paper explains the design and the structure of the metadata database, as well as the realization of the catalogue search tool. Additionally, the website of this database is now open to the public and accessible to all Internet users.

pdf abs
Construction and Analysis of Word-level Time-aligned Simultaneous Interpretation Corpus
Takahiro Ono | Hitomi Tohyama | Shigeki Matsubara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, quantitative analyses of the delay in Japanese-to-English (J-E) and English-to-Japanese (E-J) interpretations are described. The Simultaneous Interpretation Database of Nagoya University (SIDB) was used for the analyses. Beginning time and end time of each word were provided to the corpus using HMM-based phoneme segmentation, and the time lag between the corresponding words was calculated as the word-level delay. Word-level delay was calculated for 3,722 pairs and 4,932 pairs of words for J-E and E-J interpretations, respectively. The analyses revealed that J-E interpretation has much larger delay than E-J interpretation and that the difference of word order between Japanese and English affect the degree of delay.

2006

pdf abs
Layered Speech-Act Annotation for Spoken Dialogue Corpus
Yuki Irie | Shigeki Matsubara | Nobuo Kawaguchi | Yukiko Yamaguchi | Yasuyoshi Inagaki
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the design of speech act tags for spoken dialogue corpora and its evaluation. Compared with the tags used for conventional corpus annotation, the proposed speech intention tag is specialized enough to determine system operations. However, detailed information description increases tag types. This causes an ambiguous tag selection. Therefore, we have designed an organization of tags, with focusing attention on layered tagging and context-dependent tagging. Over 35,000 utterance units in the CIAIR corpus have been tagged by hand. To evaluate the reliability of the intention tag, a tagging experiment was conducted. The reliability of tagging is evaluated by comparing the tagging among some annotators using kappa value. As a result, we confirmed that reliable data could be built. This corpus with speech intention tag could be widely used from basic research to applications of spoken dialogue. In particular, this would play an important role from the viewpoint of practical use of spoken dialogue corpora.

pdf abs
A Syntactically Annotated Corpus of Japanese Spoken Monologue
Tomohiro Ohno | Shigeki Matsubara | Hideki Kashioka | Naoto Kato | Yasuyoshi Inagaki
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Recently, monologue data such as lecture and commentary by professionals have been considered as valuable intellectual resources, and have been gathering attention. On the other hand, in order to use these monologue data effectively and efficiently, it is necessary for the monologue data not only just to be accumulated but also to be structured. This paper describes the construction of a Japanese spoken monologue corpus in which dependency structure is given to each utterance. Spontaneous monologue includes a lot of very long sentences composed of two or more clauses. In these sentences, there may exist the subject or the adverb common to multi-clauses, and it may be considered that the subject or adverb depend on multi-predicates. In order to give the dependency information in a real fashion, our research allows that a bunsetsu depends on multiple bunsetsus.

pdf abs
Collection of Simultaneous Interpreting Patterns by Using Bilingual Spoken Monologue Corpus
Hitomi Tohyama | Shigeki Matsubara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The manual quantitative analysis of CIAIR simultaneous interpretation corpus and the collection of interpreting patterns This paper provides an investigation of simultaneous interpreting patterns using a bilingual spoken monologue corpus. 4,578 pairs of English-Japanese aligned utterances in CIAIR simultaneous interpretation database were used. This investigation is the largest scale as the observation of simultaneous interpreting speech. The simultaneous interpreters are required to generate the target speech simultaneously with the source speech. Therefore, they have various kinds of strategies to raise simultaneity. In this investigation, the simultaneous interpreting patterns with high frequency and high flexibility were extracted from the corpus. As a result, we collected 203 cases out of aligned utterances in which simultaneous interpretersf strategies for raising simultaneity were observed. These 203 cases could be categorized into 12 types of interpreting pattern. It was clarified that 4.5 percent of the English-Japanese monologue data were fitted in those interpreting patterns. These interpreting patterns can be expected to be used as interpreting rules of simultaneous machine interpretation.

pdf abs
A Corpus Search System Utilizing Lexical Dependency Structure
Yoshihide Kato | Shigeki Matsubara | Yasuyoshi Inagaki
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents a corpus search system utilizing lexical dependency structure. The user's query consists of lexical dependency structure. The user's query consists of a sequence of keywords. For a given query, the system automatically generates the dependency structure patterns which consist of keywords in the query, and returns the sentences whose dependency structures match the generated patterns. The dependency structure patterns are generated by using two operations: combining and interpolation, which utilize dependency structures in the searched corpus. The operations enable the system to generate only the dependency structure patterns that occur in the corpus. The system achieves simple and intuitive corpus search and it is enough linguistically sophisticated to utilize structural information.

pdf
Dependency Parsing of Japanese Spoken Monologue Based on Clause Boundaries
Tomohiro Ohno | Shigeki Matsubara | Hideki Kashioka | Takehiko Maruyama | Yasuyoshi Inagaki
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Simultaneous English-Japanese Spoken Language Translation Based on Incremental Dependency Parsing and Transfer
Koichiro Ryu | Shigeki Matsubara | Yasuyoshi Inagaki
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions