Yuta Koreeda


Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser
Yuta Koreeda | Christopher Manning
Proceedings of the Natural Legal Language Processing Workshop 2021

While many NLP pipelines assume raw, clean texts, many texts we encounter in the wild, including a vast majority of legal documents, are not so clean, with many of them being visually structured documents (VSDs) such as PDFs. Conventional preprocessing tools for VSDs mainly focused on word segmentation and coarse layout analysis, whereas fine-grained logical structure analysis (such as identifying paragraph boundaries and their hierarchies) of VSDs is underexplored. To that end, we proposed to formulate the task as prediction of “transition labels” between text fragments that maps the fragments to a tree, and developed a feature-based machine learning system that fuses visual, textual and semantic cues. Our system is easily customizable to different types of VSDs and it significantly outperformed baselines in identifying different structures in VSDs. For example, our system obtained a paragraph boundary detection F1 score of 0.953 which is significantly better than a popular PDF-to-text tool with an F1 score of 0.739.

ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
Yuta Koreeda | Christopher Manning
Findings of the Association for Computational Linguistics: EMNLP 2021

Reviewing contracts is a time-consuming procedure that incurs large expenses to companies and social inequality to those who cannot afford it. In this work, we propose “document-level natural language inference (NLI) for contracts”, a novel, real-world application of NLI that addresses such problems. In this task, a system is given a set of hypotheses (such as “Some obligations of Agreement may survive termination.”) and a contract, and it is asked to classify whether each hypothesis is “entailed by”, “contradicting to” or “not mentioned by” (neutral to) the contract as well as identifying “evidence” for the decision as spans in the contract. We annotated and release the largest corpus to date consisting of 607 annotated contracts. We then show that existing models fail badly on our task and introduce a strong baseline, which (a) models evidence identification as multi-label classification over spans instead of trying to predict start and end tokens, and (b) employs more sophisticated context segmentation for dealing with long documents. We also show that linguistic characteristics of contracts, such as negations by exceptions, are contributing to the difficulty of this task and that there is much room for improvement.


Hitachi at SemEval-2020 Task 12: Offensive Language Identification with Noisy Labels Using Statistical Sampling and Post-Processing
Manikandan Ravikiran | Amin Ekant Muljibhai | Toshinori Miyoshi | Hiroaki Ozaki | Yuta Koreeda | Sakata Masayuki
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present our participation in SemEval-2020 Task-12 Subtask-A (English Language) which focuses on offensive language identification from noisy labels. To this end, we developed a hybrid system with the BERT classifier trained with tweets selected using Statistical Sampling Algorithm (SA) and Post-Processed (PP) using an offensive wordlist. Our developed system achieved 34th position with Macro-averaged F1-score (Macro-F1) of 0.90913 over both offensive and non-offensive classes. We further show comprehensive results and error analysis to assist future research in offensive language identification with noisy labels.

Towards Better Non-Tree Argument Mining: Proposition-Level Biaffine Parsing with Task-Specific Parameterization
Gaku Morio | Hiroaki Ozaki | Terufumi Morishita | Yuta Koreeda | Kohsuke Yanai
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

State-of-the-art argument mining studies have advanced the techniques for predicting argument structures. However, the technology for capturing non-tree-structured arguments is still in its infancy. In this paper, we focus on non-tree argument mining with a neural model. We jointly predict proposition types and edges between propositions. Our proposed model incorporates (i) task-specific parameterization (TSP) that effectively encodes a sequence of propositions and (ii) a proposition-level biaffine attention (PLBA) that can predict a non-tree argument consisting of edges. Experimental results show that both TSP and PLBA boost edge prediction performance compared to baselines.

Hitachi at MRP 2020: Text-to-Graph-Notation Transducer
Hiroaki Ozaki | Gaku Morio | Yuta Koreeda | Terufumi Morishita | Toshinori Miyoshi
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

This paper presents our proposed parser for the shared task on Meaning Representation Parsing (MRP 2020) at CoNLL, where participant systems were required to parse five types of graphs in different languages. We propose to unify these tasks as a text-to-graph-notation transduction in which we convert an input text into a graph notation. To this end, we designed a novel Plain Graph Notation (PGN) that handles various graphs universally. Then, our parser predicts a PGN-based sequence by leveraging Transformers and biaffine attentions. Notably, our parser can handle any PGN-formatted graphs with fewer framework-specific modifications. As a result, ensemble versions of the parser tied for 1st place in both cross-framework and cross-lingual tracks.


Hitachi at MRP 2019: Unified Encoder-to-Biaffine Network for Cross-Framework Meaning Representation Parsing
Yuta Koreeda | Gaku Morio | Terufumi Morishita | Hiroaki Ozaki | Kohsuke Yanai
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

This paper describes the proposed system of the Hitachi team for the Cross-Framework Meaning Representation Parsing (MRP 2019) shared task. In this shared task, the participating systems were asked to predict nodes, edges and their attributes for five frameworks, each with different order of “abstraction” from input tokens. We proposed a unified encoder-to-biaffine network for all five frameworks, which effectively incorporates a shared encoder to extract rich input features, decoder networks to generate anchorless nodes in UCCA and AMR, and biaffine networks to predict edges. Our system was ranked fifth with the macro-averaged MRP F1 score of 0.7604, and outperformed the baseline unified transition-based MRP. Furthermore, post-evaluation experiments showed that we can boost the performance of the proposed system by incorporating multi-task learning, whereas the baseline could not. These imply efficacy of incorporating the biaffine network to the shared architecture for MRP and that learning heterogeneous meaning representations at once can boost the system performance.


bunji at SemEval-2017 Task 3: Combination of Neural Similarity Features and Comment Plausibility Features
Yuta Koreeda | Takuya Hashito | Yoshiki Niwa | Misa Sato | Toshihiko Yanase | Kenzo Kurotsuchi | Kohsuke Yanai
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper describes a text-ranking system developed by bunji team in SemEval-2017 Task 3: Community Question Answering, Subtask A and C. The goal of the task is to re-rank the comments in a question-and-answer forum such that useful comments for answering the question are ranked high. We proposed a method that combines neural similarity features and hand-crafted comment plausibility features, and we modeled inter-comments relationship using conditional random field. Our approach obtained the fifth place in the Subtask A and the second place in the Subtask C.

StruAP: A Tool for Bundling Linguistic Trees through Structure-based Abstract Pattern
Kohsuke Yanai | Misa Sato | Toshihiko Yanase | Kenzo Kurotsuchi | Yuta Koreeda | Yoshiki Niwa
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a tool for developing tree structure patterns that makes it easy to define the relations among textual phrases and create a search index for these newly defined relations. By using the proposed tool, users develop tree structure patterns through abstracting syntax trees. The tool features (1) intuitive pattern syntax, (2) unique functions such as recursive call of patterns and use of lexicon dictionaries, and (3) whole workflow support for relation development and validation. We report the current implementation of the tool and its effectiveness.


Neural Attention Model for Classification of Sentences that Support Promoting/Suppressing Relationship
Yuta Koreeda | Toshihiko Yanase | Kohsuke Yanai | Misa Sato | Yoshiki Niwa
Proceedings of the Third Workshop on Argument Mining (ArgMining2016)