Satoshi Sekine

2021

pdf bib abs
Co-Teaching Student-Model through Submission Results of Shared Task
Kouta Nakayama | Shuhei Kurita | Akio Kobayashi | Yukino Baba | Satoshi Sekine
Findings of the Association for Computational Linguistics: EMNLP 2021

Shared tasks have a long history and have become the mainstream of NLP research. Most of the shared tasks require participants to submit only system outputs and descriptions. It is uncommon for the shared task to request submission of the system itself because of the license issues and implementation differences. Therefore, many systems are abandoned without being used in real applications or contributing to better systems. In this research, we propose a scheme to utilize all those systems which participated in the shared tasks. We use all participated system outputs as task teachers in this scheme and develop a new model as a student aiming to learn the characteristics of each system. We call this scheme “Co-Teaching.” This scheme creates a unified system that performs better than the task’s single best system. It only requires the system outputs, and slightly extra effort is needed for the participants and organizers. We apply this scheme to the “SHINRA2019-JP” shared task, which has nine participants with various output accuracies, confirming that the unified system outperforms the best system. Moreover, the code used in our experiments has been released.

Ranking the user comments posted on a news article is important for online news services because comment visibility directly affects the user experience. Research on ranking comments with different metrics to measure the comment quality has shown “constructiveness” used in argument analysis is promising from a practical standpoint. In this paper, we report a case study in which this constructiveness is examined in the real world. Specifically, we examine an in-house competition to improve the performance of ranking constructive comments and demonstrate the effectiveness of the best obtained model for a commercial service.

2020

pdf bib abs
Multi-class Multilingual Classification of Wikipedia Articles Using Extended Named Entity Tag Set
Hassan S. Shavarani | Satoshi Sekine
Proceedings of the 12th Language Resources and Evaluation Conference

Wikipedia is a great source of general world knowledge which can guide NLP models better understand their motivation to make predictions. Structuring Wikipedia is the initial step towards this goal which can facilitate fine-grain classification of articles. In this work, we introduce the Shinra 5-Language Categorization Dataset (SHINRA-5LDS), a large multi-lingual and multi-labeled set of annotated Wikipedia articles in Japanese, English, French, German, and Farsi using Extended Named Entity (ENE) tag set. We evaluate the dataset using the best models provided for ENE label set classification and show that the currently available classification models struggle with large datasets using fine-grained tag sets.

2019

pdf bib abs
Analytic Score Prediction and Justification Identification in Automated Short Answer Scoring
Tomoya Mizumoto | Hiroki Ouchi | Yoriko Isobe | Paul Reisert | Ryo Nagata | Satoshi Sekine | Kentaro Inui
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper provides an analytical assessment of student short answer responses with a view to potential benefits in pedagogical contexts. We first propose and formalize two novel analytical assessment tasks: analytic score prediction and justification identification, and then provide the first dataset created for analytic short answer scoring research. Subsequently, we present a neural baseline model and report our extensive empirical results to demonstrate how our dataset can be used to explore new and intriguing technical challenges in short answer scoring. The dataset is publicly available for research purposes.

Monotonicity reasoning is one of the important reasoning skills for any intelligent natural language inference (NLI) model in that it requires the ability to capture the interaction between lexical and syntactic structures. Since no test set has been developed for monotonicity reasoning with wide coverage, it is still unclear whether neural models can perform monotonicity reasoning in a proper way. To investigate this issue, we introduce the Monotonicity Entailment Dataset (MED). Performance by state-of-the-art NLI models on the new test set is substantially worse, under 55%, especially on downward reasoning. In addition, analysis using a monotonicity-driven data augmentation method showed that these models might be limited in their generalization ability in upward and downward reasoning.

pdf bib abs
HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning
Hitomi Yanaka | Koji Mineshima | Daisuke Bekki | Kentaro Inui | Satoshi Sekine | Lasha Abzianidze | Johan Bos
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Large crowdsourced datasets are widely used for training and evaluating neural models on natural language inference (NLI). Despite these efforts, neural models have a hard time capturing logical inferences, including those licensed by phrase replacements, so-called monotonicity reasoning. Since no large dataset has been developed for monotonicity reasoning, it is still unclear whether the main obstacle is the size of datasets or the model architectures themselves. To investigate this issue, we introduce a new dataset, called HELP, for handling entailments with lexical and logical phenomena. We add it to training data for the state-of-the-art neural models and evaluate them on test sets for monotonicity phenomena. The results showed that our data augmentation improved the overall accuracy. We also find that the improvement is better on monotonicity inferences with lexical replacements than on downward inferences with disjunction and modification. This suggests that some types of inferences can be improved by our data augmentation while others are immune to it.

pdf bib abs
Select and Attend: Towards Controllable Content Selection in Text Generation
Xiaoyu Shen | Jun Suzuki | Kentaro Inui | Hui Su | Dietrich Klakow | Satoshi Sekine
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Many text generation tasks naturally contain two steps: content selection and surface realization. Current neural encoder-decoder models conflate both steps into a black-box architecture. As a result, the content to be described in the text cannot be explicitly controlled. This paper tackles this problem by decoupling content selection from the decoder. The decoupled content selection is human interpretable, whose value can be manually manipulated to control the content of generated text. The model can be trained end-to-end without human annotations by maximizing a lower bound of the marginal likelihood. We further propose an effective way to trade-off between performance and controllability with a single adjustable hyperparameter. In both data-to-text and headline generation tasks, our model achieves promising results, paving the way for controllable content selection in text generation.

pdf bib abs
Bridging the Defined and the Defining: Exploiting Implicit Lexical Semantic Relations in Definition Modeling
Koki Washio | Satoshi Sekine | Tsuneaki Kato
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Definition modeling includes acquiring word embeddings from dictionary definitions and generating definitions of words. While the meanings of defining words are important in dictionary definitions, it is crucial to capture the lexical semantic relations between defined words and defining words. However, thus far, the utilization of such relations has not been explored for definition modeling. In this paper, we propose definition modeling methods that use lexical semantic relations. To utilize implicit semantic relations in definitions, we use unsupervisedly obtained pattern-based word-pair embeddings that represent semantic relations of word pairs. Experimental results indicate that our methods improve the performance in learning embeddings from definitions, as well as definition generation.

2018

pdf bib abs
What Makes Reading Comprehension Questions Easier?
Saku Sugawara | Kentaro Inui | Satoshi Sekine | Akiko Aizawa
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristics to split each dataset into easy and hard subsets and examine the performance of two baseline models for each of the subsets. We then manually annotate questions sampled from each subset with both validity and requisite reasoning skills to investigate which skills explain the difference between easy and hard questions. From this study, we observed that (i) the baseline performances for the hard subsets remarkably degrade compared to those of entire datasets, (ii) hard questions require knowledge inference and multiple-sentence reasoning in comparison with easy questions, and (iii) multiple-choice questions tend to require a broader range of reasoning skills than answer extraction and description questions. These results suggest that one might overestimate recent advances in MRC.

Named entity recognition (NER) has attracted a substantial amount of research. Recently, several neural network-based models have been proposed and achieved high performance. However, there is little research on fine-grained NER (FG-NER), in which hundreds of named entity categories must be recognized, especially for non-English languages. It is still an open question whether there is a model that is robust across various settings or the proper model varies depending on the language, the number of named entity categories, and the size of training datasets. This paper first presents an empirical comparison of FG-NER models for English and Japanese and demonstrates that LSTM+CNN+CRF (Ma and Hovy, 2016), one of the state-of-the-art methods for English NER, also works well for English FG-NER but does not work well for Japanese, a language that has a large number of character types. To tackle this problem, we propose a method to improve the neural network-based Japanese FG-NER performance by removing the CNN layer and utilizing dictionary and category embeddings. Experiment results show that the proposed method improves Japanese FG-NER F-score from 66.76% to 75.18%.

2017

2016

pdf bib abs
Name Variation in Community Question Answering Systems
Anietie Andy | Satoshi Sekine | Mugizi Rwebangira | Mark Dredze
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Name Variation in Community Question Answering Systems Abstract Community question answering systems are forums where users can ask and answer questions in various categories. Examples are Yahoo! Answers, Quora, and Stack Overflow. A common challenge with such systems is that a significant percentage of asked questions are left unanswered. In this paper, we propose an algorithm to reduce the number of unanswered questions in Yahoo! Answers by reusing the answer to the most similar past resolved question to the unanswered question, from the site. Semantically similar questions could be worded differently, thereby making it difficult to find questions that have shared needs. For example, “Who is the best player for the Reds?” and “Who is currently the biggest star at Manchester United?” have a shared need but are worded differently; also, “Reds” and “Manchester United” are used to refer to the soccer team Manchester United football club. In this research, we focus on question categories that contain a large number of named entities and entity name variations. We show that in these categories, entity linking can be used to identify relevant past resolved questions with shared needs as a given question by disambiguating named entities and matching these questions based on the disambiguated entities, identified entities, and knowledge base information related to these entities. We evaluated our algorithm on a new dataset constructed from Yahoo! Answers. The dataset contains annotated question pairs, (Qgiven, [Qpast, Answer]). We carried out experiments on several question categories and show that an entity-based approach gives good performance when searching for similar questions in entity rich categories.

pdf bib abs
An Entity-Based approach to Answering Recurrent and Non-Recurrent Questions with Past Answers
Anietie Andy | Mugizi Rwebangira | Satoshi Sekine
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)

An Entity-based approach to Answering recurrent and non-recurrent questions with Past Answers Abstract Community question answering (CQA) systems such as Yahoo! Answers allow registered-users to ask and answer questions in various question categories. However, a significant percentage of asked questions in Yahoo! Answers are unanswered. In this paper, we propose to reduce this percentage by reusing answers to past resolved questions from the site. Specifically, we propose to satisfy unanswered questions in entity rich categories by searching for and reusing the best answers to past resolved questions with shared needs. For unanswered questions that do not have a past resolved question with a shared need, we propose to use the best answer to a past resolved question with similar needs. Our experiments on a Yahoo! Answers dataset shows that our approach retrieves most of the past resolved questions that have shared and similar needs to unanswered questions.

pdf bib
Neural Joint Learning for Classifying Wikipedia Articles into Fine-grained Named Entity Types
Masatoshi Suzuki | Koji Matsuda | Satoshi Sekine | Naoaki Okazaki | Kentaro Inui
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Posters

We developed a search tool for ngrams extracted from a very large corpus (the current system uses the entire Wikipedia, which has 1.7 billion tokens). The tool supports queries with an arbitrary number of wildcards and/or specification by a combination of token, POS, chunk (such as NP, VP, PP) and Named Entity (NE). The previous system (Sekine 08) can only handle tokens and unrestricted wildcards in the query, such as * was established in *. However, being able to constrain the wildcards by POS, chunk or NE is quite useful to filter out noise. For example, the new system can search for NE=COMPANY was established in POS=CD. This finer specification reduces the number of outputs to less than half and avoids the ngrams which have a comma or a common noun at the first position or location information at the last position. It outputs the matched ngrams with their frequencies as well as all the contexts (i.e. sentences, KWIC lists and document ID information) where the matched ngrams occur in the corpus. It takes a fraction of a second for a search on a single CPU Linux-PC (1GB memory and 500GB disk) environment.

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.

2008

pdf bib abs
Extended Named Entity Ontology with Attribute Information
Satoshi Sekine
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Named Entities (NE) are regarded as an important type of semantic knowledge in many natural language processing (NLP) applications. Originally, a limited number of NE categories were proposed. In MUC, it was 7 categories - people, organization, location, time, date, money and percentage expressions. However, it was noticed that such a limited number of NE categories is too small for many applications. The author has proposed Extended Named Entity (ENE), which has about 200 categories (Sekine and Nobata 04). During the development of ENE, we noticed that many ENE categories have specific attributes, and those provide very important information for the entities. For example, rivers have attributes like source location, outflow, and length. Some such information is essential to knowing about the river, while the name is only a label which can be used to refer to the river. Also, such attributes are important information for many NLP applications. In this paper, we report on the design of a set of attributes for ENE categories. We used a bottom up approach to creating the knowledge using a Japanese encyclopedia, which contains abundant descriptions of ENE instances.

pdf bib abs
Sentiment Analysis Based on Probabilistic Models Using Inter-Sentence Information
Kugatsu Sadamitsu | Satoshi Sekine | Mikio Yamamoto
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper proposes a new method of the sentiment analysis utilizing inter-sentence structures especially for coping with reversal phenomenon of word polarity such as quotation of others opinions on an opposite side. We model these phenomenon using Hidden Conditional Random Fields(HCRFs) with three kinds of features: transition features, polarity features and reversal (of polarity) features. Polarity features and reversal features are doubly added to each word, and each weight of the features are trained by the common structure of positive and negative corpus in, for example, assuming that reversal phenomenon occured for the same reason (features) in both polarity corpus. Our method achieved better accuracy than the Naive Bayes method and as good as SVMs.

pdf bib
A Linguistic Knowledge Discovery Tool: Very Large Ngram Database Search with Arbitrary Wildcards
Satoshi Sekine
Coling 2008: Companion volume: Demonstrations

2007

pdf bib
System Demonstration of On-Demand Information Extraction
Satoshi Sekine | Akira Oda
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf bib
The SemEval-2007 WePS Evaluation: Establishing a benchmark for the Web People Search Task
Javier Artiles | Julio Gonzalo | Satoshi Sekine
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing
Satoshi Sekine | Kentaro Inui | Ido Dagan | Bill Dolan | Danilo Giampiccolo | Bernardo Magnini
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing

2006

pdf bib
Preemptive Information Extraction using Unrestricted Relation Discovery
Yusuke Shinyama | Satoshi Sekine
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
Using Phrasal Patterns to Identify Discourse Relations
Manami Saito | Kazuhide Yamamoto | Satoshi Sekine
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

pdf bib
On-Demand Information Extraction
Satoshi Sekine
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
A System to Solve Language Tests for Second Grade Students
Manami Saito | Kazuhide Yamamoto | Satoshi Sekine | Hitoshi Isahara
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf bib
Automatic Paraphrase Discovery based on Context and Keywords between NE Pairs
Satoshi Sekine
Proceedings of the Third International Workshop on Paraphrasing (IWP2005)

2004

pdf bib
Named Entity Discovery Using Comparable News Articles
Yusuke Shinyama | Satoshi Sekine
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Cross-lingual Information Extraction System Evaluation
Kiyoshi Sudo | Satoshi Sekine | Ralph Grishman
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Automatic Construction of Japanese KATAKANA Variant List from Large Corpus
Takeshi Masuyama | Satoshi Sekine | Hiroshi Nakagawa
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Multilingual Aligned Parallel Treebank Corpus Reflecting Contextual Information and Its Applications
Kiyotaka Uchimoto | Yujie Zhang | Kiyoshi Sudo | Masaki Murata | Satoshi Sekine | Hitoshi Isahara
Proceedings of the Workshop on Multilingual Linguistic Resources

pdf bib
Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy
Satoshi Sekine | Chikashi Nobata
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Automatic Extraction of Hyponyms from Japanese Newspapers. Using Lexico-syntactic Patterns
Maya Ando | Satoshi Sekine | Shun Ishizaki
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Discovering Relations among Named Entities from Large Corpora
Takaaki Hasegawa | Satoshi Sekine | Ralph Grishman
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

2003

pdf bib
An Improved Extraction Pattern Representation Model for Automatic IE Pattern Acquisition
Kiyoshi Sudo | Satoshi Sekine | Ralph Grishman
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Morphological Analysis of a Large Spontaneous Speech Corpus in Japanese
Kiyotaka Uchimoto | Chikashi Nobata | Atsushi Yamada | Satoshi Sekine | Hitoshi Isahara
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
pre-CODIE–Crosslingual On-Demand Information Extraction
Kiyoshi Sudo | Satoshi Sekine | Ralph Grishman
Companion Volume of the Proceedings of HLT-NAACL 2003 - Demonstrations

pdf bib
A survey for Multi-Document Summarization
Satoshi Sekine | Chikashi Nobata
Proceedings of the HLT-NAACL 03 Text Summarization Workshop

pdf bib
Evaluation of Features for Sentence Extraction on Different Types of Corpora
Chikashi Nobata | Satoshi Sekine | Hitoshi Isahara
Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering

pdf bib
Paraphrase Acquisition for Information Extraction
Yusuke Shinyama | Satoshi Sekine
Proceedings of the Second International Workshop on Paraphrasing

2002

pdf bib
Text Generation from Keywords
Kiyotaka Uchimoto | Satoshi Sekine | Hitoshi Isahara
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Morphological Analysis of the Spontaneous Speech Corpus
Kiyotaka Uchimoto | Chikashi Nobata | Atsushi Yamada | Satoshi Sekine | Hitoshi Isahara
COLING 2002: The 17th International Conference on Computational Linguistics: Project Notes

pdf bib
Summarization System Integrated with Named Entity Tagging and IE pattern Discovery
Chikashi Nobata | Satoshi Sekine | Hitoshi Isahara | Ralph Grishman
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Extended Named Entity Hierarchy
Satoshi Sekine | Kiyoshi Sudo | Chikashi Nobata
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf bib
Word Translation Based on Machine Learning Models Using Translation Memory and Corpora
Kiyotaka Uchimoto | Satoshi Sekine | Masaki Murata | Hitoshi Isahara
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib
Automatic Pattern Acquisition for Japanese Information Extraction
Kiyoshi Sudo | Satoshi Sekine | Ralph Grishman
Proceedings of the First International Conference on Human Language Technology Research

pdf bib
The Unknown Word Problem: a Morphological Analysis of Japanese Using Maximum Entropy Aided by a Dictionary
Kiyotaka Uchimoto | Satoshi Sekine | Hitoshi Isahara
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing

2000

pdf bib
Difficulty Indices for the Named Entity Task in Japanese
Chikashi Nobata | Satoshi Sekine | Jun’ichi Tsujii
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Backward Beam Search Algorithm for Dependency Analysis of Japanese
Satoshi Sekine | Kiyotaka Uchimoto | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Japanese Dependency Analysis using a Deterministic Finite State Transducer
Satoshi Sekine
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Word Order Acquisition from Corpora
Kiyotaka Uchimoto | Masaki Murata | Qing Ma | Satoshi Sekine | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Japanese Named Entity Extraction Evaluation - Analysis of Results -
Satoshi Sekine | Yoshio Eriguchi
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib abs
Dependency Model using Posterior Context
Kiyotaka Uchimoto | Masaki Murata | Satoshi Sekine | Hitoshi Isahara
Proceedings of the Sixth International Workshop on Parsing Technologies

We describe a new model for dependency structure analysis. This model learns the relationship between two phrasal units called bunsetsus as three categories; ‘between’, ‘dependent’, and ‘beyond’, and estimates the dependency likelihood by considering not only the relationship between two bunsetsus but also the relationship between the left bunsetsu and all of the bunsetsus to its right. We implemented this model based on the maximum entropy model. When using the Kyoto University corpus, the dependency accuracy of our model was 88%, which is about 1% higher than that of the conventional model using exactly the same features.

pdf bib
IREX: IR & IE Evaluation Project in Japanese
Satoshi Sekine | Hitoshi Isahara
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
A Treebank of Spanish and its Application to Parsing
Antonio Moreno | Ralph Grishman | Susana López | Fernando Sánchez | Satoshi Sekine
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1999

pdf bib
Statistical Matching of Two Ontologies
Satoshi Sekine | Kiyoshi Sudo | Takano Ogino
SIGLEX99: Standardizing Lexical Resources

pdf bib
Japanese Dependency Structure Analysis Based on Maximum Entropy Models
Kiyotaka Uchimoto | Satoshi Sekine | Hitoshi Isahara
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

pdf bib
A Decision Tree Method for Finding and Classifying Names in Japanese Texts
Satoshi Sekine | Ralph Grishman | Hiroyuki Shinnou
Sixth Workshop on Very Large Corpora

pdf bib
Description of the Japanese NE System Used for MET-2
Satoshi Sekine
Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998

pdf bib
Japanese IE System and Customization Tool
Chikashi Nobata | Satoshi Sekine | Roman Yangarber
TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998

1997

pdf bib
The Domain Dependence of Parsing
Satoshi Sekine
Fifth Conference on Applied Natural Language Processing

1996

pdf bib
Modeling Topic Coherence for Speech Recognition
Satoshi Sekine
COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics

1995

pdf bib abs
A Corpus-based Probabilistic Grammar with Only Two Non-terminals
Satoshi Sekine | Ralph Grishman
Proceedings of the Fourth International Workshop on Parsing Technologies

The availability of large, syntactically-bracketed corpora such as the Penn Tree Bank affords us the opportunity to automatically build or train broad-coverage grammars, and in particular to train probabilistic grammars. A number of recent parsing experiments have also indicated that grammars whose production probabilities are dependent on the context can be more effective than context-free grammars in selecting a correct parse. To make maximal use of context, we have automatically constructed, from the Penn Tree Bank version 2, a grammar in which the symbols S and NP are the only real nonterminals, and the other non-terminals or grammatical nodes are in effect embedded into the right-hand-sides of the S and NP rules. For example, one of the rules extracted from the tree bank would be S -> NP VBX JJ CC VBX NP [1] ( where NP is a non-terminal and the other symbols are terminals – part-of-speech tags of the Tree Bank). The most common structure in the Tree Bank associated with this expansion is (S NP (VP (VP VBX (ADJ JJ) CC (VP VBX NP)))) [2]. So if our parser uses rule [1] in parsing a sentence, it will generate structure [2] for the corresponding part of the sentence. Using 94% of the Penn Tree Bank for training, we extracted 32,296 distinct rules ( 23,386 for S, and 8,910 for NP). We also built a smaller version of the grammar based on higher frequency patterns for use as a back-up when the larger grammar is unable to produce a parse due to memory limitation. We applied this parser to 1,989 Wall Street Journal sentences (separate from the training set and with no limit on sentence length). Of the parsed sentences (1,899), the percentage of no-crossing sentences is 33.9%, and Parseval recall and precision are 73.43% and 72 .61%.