Takenobu Tokunaga

2024

pdf abs
SIERA: An Evaluation Metric for Text Simplification using the Ranking Model and Data Augmentation by Edit Operations
Hikaru Yamanaka | Takenobu Tokunaga
Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI) @ LREC-COLING 2024

Automatic evaluation metrics are indispensable for text simplification (TS) research. The past TS research adopts three evaluation aspects: fluency, meaning preservation and simplicity. However, there is little consensus on a metric to measure simplicity, a unique aspect of TS compared with other text generation tasks. In addition, many of the existing metrics require reference simplified texts for evaluation. Thus, the cost of collecting reference texts is also an issue. This study proposes a new automatic evaluation metric, SIERA, for sentence simplification. SIERA employs a ranking model for the order relation of simplicity, which is trained by pairs of the original and simplified sentences. It does not require reference sentences for either training or evaluation. The sentence pairs for training are further augmented by the proposed method that utlizes edit operations to generate intermediate sentences with the simplicity between the original and simplified sentences. Using three evaluation datasets for text simplification, we compare SIERA with other metrics by calculating the correlations between metric values and human ratings. The results showed SIERA’s superiority over other metrics with a reservation that the quality of evaluation sentences is consistent with that of the training data.

pdf abs
Analyzing Interpretability of Summarization Model with Eye-gaze Information
Fariz Ikhwantri | Hiroaki Yamada | Takenobu Tokunaga
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Interpretation methods provide saliency scores indicating the importance of input words for neural summarization models. Prior work has analyzed models by comparing them to human behavior, often using eye-gaze as a proxy for human attention in reading tasks such as classification. This paper presents a framework to analyze the model behavior in summarization by comparing it to human summarization behavior using eye-gaze data. We examine two research questions: RQ1) whether model saliency conforms to human gaze during summarization and RQ2) how model saliency and human gaze affect summarization performance. For RQ1, we measure conformity by calculating the correlation between model saliency and human fixation counts. For RQ2, we conduct ablation experiments removing words/sentences considered important by models or humans. Experiments on two datasets with human eye-gaze during summarization partially confirm that model saliency aligns with human gaze (RQ1). However, ablation experiments show that removing highly-attended words/sentences from the human gaze does not significantly degrade performance compared with the removal by the model saliency (RQ2).

2022

pdf abs
Cross-domain Analysis on Japanese Legal Pretrained Language Models
Keisuke Miyazaki | Hiroaki Yamada | Takenobu Tokunaga
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

This paper investigates the pretrained language model (PLM) specialised in the Japanese legal domain. We create PLMs using different pretraining strategies and investigate their performance across multiple domains. Our findings are (i) the PLM built with general domain data can be improved by further pretraining with domain-specific data, (ii) domain-specific PLMs can learn domain-specific and general word meanings simultaneously and can distinguish them, (iii) domain-specific PLMs work better on its target domain; still, the PLMs retain the information learnt in the original PLM even after being further pretrained with domain-specific data, (iv) the PLMs sequentially pretrained with corpora of different domains show high performance for the later learnt domains.

pdf abs
Annotation Study of Japanese Judgments on Tort for Legal Judgment Prediction with Rationales
Hiroaki Yamada | Takenobu Tokunaga | Ryutaro Ohara | Keisuke Takeshita | Mihoko Sumida
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper describes a comprehensive annotation study on Japanese judgment documents in civil cases. We aim to build an annotated corpus designed for Legal Judgment Prediction (LJP), especially for torts. Our annotation scheme contains annotations of whether tort is accepted by judges as well as its corresponding rationales for explainability purpose. Our annotation scheme extracts decisions and rationales at character-level. Moreover, the scheme can capture the explicit causal relation between judge’s decisions and their corresponding rationales, allowing multiple decisions in a document. To obtain high-quality annotation, we developed an annotation scheme with legal experts, and confirmed its reliability by agreement studies with Krippendorff’s alpha metric. The result of the annotation study suggests the proposed annotation scheme can produce a dataset of Japanese LJP at reasonable reliability.

pdf abs
Automating Idea Unit Segmentation and Alignment for Assessing Reading Comprehension via Summary Protocol Analysis
Marcello Gecchele | Hiroaki Yamada | Takenobu Tokunaga | Yasuyo Sawaki | Mika Ishizuka
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we approach summary evaluation from an applied linguistics (AL) point of view. We provide computational tools to AL researchers to simplify the process of Idea Unit (IU) segmentation. The IU is a segmentation unit that can identify chunks of information. These chunks can be compared across documents to measure the content overlap between a summary and its source text. We propose a full revision of the annotation guidelines to allow machine implementation. The new guideline also improves the inter-annotator agreement, rising from 0.547 to 0.785 (Cohen’s Kappa). We release L2WS 2021, a IU gold standard corpus composed of 40 manually annotated student summaries. We propose IUExtract; i.e. the first automatic segmentation algorithm based on the IU. The algorithm was tested over the L2WS 2021 corpus. Our results are promising, achieving a precision of 0.789 and a recall of 0.844. We tested an existing approach to IU alignment via word embeddings with the state of the art model SBERT. The recorded precision for the top 1 aligned pair of IUs was 0.375. We deemed this result insufficient for effective automatic alignment. We propose “SAT”, an online tool to facilitate the collection of alignment gold standards for future training.

2021

pdf abs
Parsing Argumentative Structure in English-as-Foreign-Language Essays
Jan Wira Gotama Putra | Simone Teufel | Takenobu Tokunaga
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

This paper presents a study on parsing the argumentative structure in English-as-foreign-language (EFL) essays, which are inherently noisy. The parsing process consists of two steps, linking related sentences and then labelling their relations. We experiment with several deep learning architectures to address each task independently. In the sentence linking task, a biaffine model performed the best. In the relation labelling task, a fine-tuned BERT model performed the best. Two sentence encoders are employed, and we observed that non-fine-tuning models generally performed better when using Sentence-BERT as opposed to BERT encoder. We trained our models using two types of parallel texts: original noisy EFL essays and those improved by annotators, then evaluate them on the original essays. The experiment shows that an end-to-end in-domain system achieved an accuracy of .341. On the other hand, the cross-domain system achieved 94% performance of the in-domain system. This signals that well-written texts can also be useful to train argument mining system for noisy texts.

This paper describes the system of our team (NHK) for the WAT 2021 Japanese-English restricted machine translation task. In this task, the aim is to improve quality while maintaining consistent terminology for scientific paper translation. This task has a unique feature, where some words in a target sentence are given in addition to a source sentence. In this paper, we use a lexically-constrained neural machine translation (NMT), which concatenates the source sentence and constrained words with a special token to input them into the encoder of NMT. The key to the successful lexically-constrained NMT is the way to extract constraints from a target sentence of training data. We propose two extraction methods: proper-noun constraint and mistranslated-word constraint. These two methods consider the importance of words and fallibility of NMT, respectively. The evaluation results demonstrate the effectiveness of our lexical-constraint method.

pdf bib abs
Multi-task and Multi-corpora Training Strategies to Enhance Argumentative Sentence Linking Performance
Jan Wira Gotama Putra | Simone Teufel | Takenobu Tokunaga
Proceedings of the 8th Workshop on Argument Mining

Argumentative structure prediction aims to establish links between textual units and label the relationship between them, forming a structured representation for a given input text. The former task, linking, has been identified by earlier works as particularly challenging, as it requires finding the most appropriate structure out of a very large search space of possible link combinations. In this paper, we improve a state-of-the-art linking model by using multi-task and multi-corpora training strategies. Our auxiliary tasks help the model to learn the role of each sentence in the argumentative structure. Combining multi-corpora training with a selective sampling strategy increases the training data size while ensuring that the model still learns the desired target distribution well. Experiments on essays written by English-as-a-foreign-language learners show that both strategies significantly improve the model’s performance; for instance, we observe a 15.8% increase in the F1-macro for individual link predictions.

2020

pdf abs
Effective Use of Target-side Context for Neural Machine Translation
Hideya Mino | Hitoshi Ito | Isao Goto | Ichiro Yamada | Takenobu Tokunaga
Proceedings of the 28th International Conference on Computational Linguistics

In this paper, we deal with two problems in Japanese-English machine translation of news articles. The first problem is the quality of parallel corpora. Neural machine translation (NMT) systems suffer degraded performance when trained with noisy data. Because there is no clean Japanese-English parallel data for news articles, we build a novel parallel news corpus consisting of Japanese news articles translated into English in a content-equivalent manner. This is the first content-equivalent Japanese-English news corpus translated specifically for training NMT systems. The second problem involves the domain-adaptation technique. NMT systems suffer degraded performance when trained with mixed data having different features, such as noisy data and clean data. Though the existing methods try to overcome this problem by using tags for distinguishing the differences between corpora, it is not sufficient. We thus extend a domain-adaptation method using multi-tags to train an NMT model effectively with the clean corpus and existing parallel news corpora with some types of noise. Experimental results show that our corpus increases the translation quality, and that our domain-adaptation method is more effective for learning with the multiple types of corpora than existing domain-adaptation methods are.

pdf abs
Content-Equivalent Translated Parallel News Corpus and Extension of Domain Adaptation for NMT
Hideya Mino | Hideki Tanaka | Hitoshi Ito | Isao Goto | Ichiro Yamada | Takenobu Tokunaga
Proceedings of the Twelfth Language Resources and Evaluation Conference

pdf abs
TIARA: A Tool for Annotating Discourse Relations and Sentence Reordering
Jan Wira Gotama Putra | Simone Teufel | Kana Matsumura | Takenobu Tokunaga
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper introduces TIARA, a new publicly available web-based annotation tool for discourse relations and sentence reordering. Annotation tasks such as these, which are based on relations between large textual objects, are inherently hard to visualise without either cluttering the display and/or confusing the annotators. TIARA deals with the visual complexity during the annotation process by systematically simplifying the layout, and by offering interactive visualisation, including coloured links, indentation, and dual-view. TIARA’s text view allows annotators to focus on the analysis of logical sequencing between sentences. A separate tree view allows them to review their analysis in terms of the overall discourse structure. The dual-view gives it an edge over other discourse annotation tools and makes it particularly attractive as an educational tool (e.g., for teaching students how to argue more effectively). As it is based on standard web technologies and can be easily customised to other annotation schemes, it can be easily used by anybody. Apart from the project it was originally designed for, in which hundreds of texts were annotated by three annotators, TIARA has already been adopted by a second discourse annotation study, which uses it in the teaching of argumentation.

pdf abs
Gamification Platform for Collecting Task-oriented Dialogue Data
Haruna Ogawa | Hitoshi Nishikawa | Takenobu Tokunaga | Hikaru Yokono
Proceedings of the Twelfth Language Resources and Evaluation Conference

Demand for massive language resources is increasing as the data-driven approach has established a leading position in Natural Language Processing. However, creating dialogue corpora is still a difficult task due to the complexity of the human dialogue structure and the diversity of dialogue topics. Though crowdsourcing is majorly used to assemble such data, it presents problems such as less-motivated workers. We propose a platform for collecting task-oriented situated dialogue data by using gamification. Combining a video game with data collection benefits such as motivating workers and cost reduction. Our platform enables data collectors to create their original video game in which they can collect dialogue data of various types of tasks by using the logging function of the platform. Also, the platform provides the annotation function that enables players to annotate their own utterances. The annotation can be gamified aswell. We aim at high-quality annotation by introducing such self-annotation method. We implemented a prototype of the proposed platform and conducted a preliminary evaluation to obtain promising results in terms of both dialogue data collection and self-annotation.

2019

pdf abs
Neural Machine Translation System using a Content-equivalently Translated Parallel Corpus for the Newswire Translation Tasks at WAT 2019
Hideya Mino | Hitoshi Ito | Isao Goto | Ichiro Yamada | Hideki Tanaka | Takenobu Tokunaga
Proceedings of the 6th Workshop on Asian Translation

This paper describes NHK and NHK Engineering System (NHK-ES)’s submission to the newswire translation tasks of WAT 2019 in both directions of Japanese→English and English→Japanese. In addition to the JIJI Corpus that was officially provided by the task organizer, we developed a corpus of 0.22M sentence pairs by manually, translating Japanese news sentences into English content- equivalently. The content-equivalent corpus was effective for improving translation quality, and our systems achieved the best human evaluation scores in the newswire translation tasks at WAT 2019.

pdf abs
Supporting content evaluation of student summaries by Idea Unit embedding
Marcello Gecchele | Hiroaki Yamada | Takenobu Tokunaga | Yasuyo Sawaki
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

This paper discusses the computer-assisted content evaluation of summaries. We propose a method to make a correspondence between the segments of the source text and its summary. As a unit of the segment, we adopt “Idea Unit (IU)” which is proposed in Applied Linguistics. Introducing IUs enables us to make a correspondence even for the sentences that contain multiple ideas. The IU correspondence is made based on the similarity between vector representations of IU. An evaluation experiment with two source texts and 20 summaries showed that the proposed method is more robust against rephrased expressions than the conventional ROUGE-based baselines. Also, the proposed method outperformed the baselines in recall. We im-plemented the proposed method in a GUI tool“Segment Matcher” that aids teachers to estab-lish a link between corresponding IUs acrossthe summary and source text.

2018

pdf
Effectiveness of Domain Adaptation in Japanese Predicate-Argument Structure Analysis
Mizuki Sango | Hitoshi Nishikawa | Takenobu Tokunaga
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Neural Japanese Zero Anaphora Resolution using Smoothed Large-scale Case Frames with Word Embedding
Souta Yamashiro | Hitoshi Nishikawa | Takenobu Tokunaga
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf abs
Interpretation of Implicit Conditions in Database Search Dialogues
Shunya Fukunaga | Hitoshi Nishikawa | Takenobu Tokunaga | Hikaru Yokono | Tetsuro Takahashi
Proceedings of the 27th International Conference on Computational Linguistics

Targeting the database search dialogue, we propose to utilise information in the user utterances that do not directly mention the database (DB) field of the backend database system but are useful for constructing database queries. We call this kind of information implicit conditions. Interpreting the implicit conditions enables the dialogue system more natural and efficient in communicating with humans. We formalised the interpretation of the implicit conditions as classifying user utterances into the related DB field while identifying the evidence for that classification at the same time. Introducing this new task is one of the contributions of this paper. We implemented two models for this task: an SVM-based model and an RCNN-based model. Through the evaluation using a corpus of simulated dialogues between a real estate agent and a customer, we found that the SVM-based model showed better performance than the RCNN-based model.

pdf
Analysis of Implicit Conditions in Database Search Dialogues
Shun-ya Fukunaga | Hitoshi Nishikawa | Takenobu Tokunaga | Hikaru Yokono | Tetsuro Takahashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
Key-value Attention Mechanism for Neural Machine Translation
Hideya Mino | Masao Utiyama | Eiichiro Sumita | Takenobu Tokunaga
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In this paper, we propose a neural machine translation (NMT) with a key-value attention mechanism on the source-side encoder. The key-value attention mechanism separates the source-side content vector into two types of memory known as the key and the value. The key is used for calculating the attention distribution, and the value is used for encoding the context representation. Experiments on three different tasks indicate that our model outperforms an NMT model with a conventional attention mechanism. Furthermore, we perform experiments with a conventional NMT framework, in which a part of the initial value of a weight matrix is set to zero so that the matrix is as the same initial-state as the key-value attention mechanism. As a result, we obtain comparable results with the key-value attention mechanism without changing the network structure.

pdf abs
An Eye-tracking Study of Named Entity Annotation
Takenobu Tokunaga | Hitoshi Nishikawa | Tomoya Iwakura
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

Utilising effective features in machine learning-based natural language processing (NLP) is crucial in achieving good performance for a given NLP task. The paper describes a pilot study on the analysis of eye-tracking data during named entity (NE) annotation, aiming at obtaining insights into effective features for the NE recognition task. The eye gaze data were collected from 10 annotators and analysed regarding working time and fixation distribution. The results of the preliminary qualitative analysis showed that human annotators tend to look at broader contexts around the target NE than recent state-of-the-art automatic NE recognition systems and to use predicate argument relations to identify the NE categories.

pdf abs
Evaluating text coherence based on semantic similarity graph
Jan Wira Gotama Putra | Takenobu Tokunaga
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing

Coherence is a crucial feature of text because it is indispensable for conveying its communication purpose and meaning to its readers. In this paper, we propose an unsupervised text coherence scoring based on graph construction in which edges are established between semantically similar sentences represented by vertices. The sentence similarity is calculated based on the cosine similarity of semantic vectors representing sentences. We provide three graph construction methods establishing an edge from a given vertex to a preceding adjacent vertex, to a single similar vertex, or to multiple similar vertices. We evaluated our methods in the document discrimination task and the insertion task by comparing our proposed methods to the supervised (Entity Grid) and unsupervised (Entity Graph) baselines. In the document discrimination task, our method outperformed the unsupervised baseline but could not do the supervised baseline, while in the insertion task, our method outperformed both baselines.

pdf abs
Evaluation of Automatically Generated Pronoun Reference Questions
Arief Yudha Satria | Takenobu Tokunaga
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This study provides a detailed analysis of evaluation of English pronoun reference questions which are created automatically by machine. Pronoun reference questions are multiple choice questions that ask test takers to choose an antecedent of a target pronoun in a reading passage from four options. The evaluation was performed from two perspectives: the perspective of English teachers and that of English learners. Item analysis suggests that machine-generated questions achieve comparable quality with human-made questions. Correlation analysis revealed a strong correlation between the scores of machine-generated questions and that of human-made questions.

pdf abs
Annotation of argument structure in Japanese legal documents
Hiroaki Yamada | Simone Teufel | Takenobu Tokunaga
Proceedings of the 4th Workshop on Argument Mining

We propose a method for the annotation of Japanese civil judgment documents, with the purpose of creating flexible summaries of these. The first step, described in the current paper, concerns content selection, i.e., the question of which material should be extracted initially for the summary. In particular, we utilize the hierarchical argument structure of the judgment documents. Our main contributions are a) the design of an annotation scheme that stresses the connection between legal points (called issue topics) and argument structure, b) an adaptation of rhetorical status to suit the Japanese legal system and c) the definition of a linked argument structure based on legal sub-arguments. In this paper, we report agreement between two annotators on several aspects of the overall task.

2016

pdf bib abs
An extension of ISO-Space for annotating object direction
Daiki Gotou | Hitoshi Nishikawa | Takenobu Tokunaga
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

In this paper, we extend an existing annotation scheme ISO-Space for annotating necessary spatial information for the task placing an specified object at a specified location with a specified direction according to a natural language instruction. We call such task the spatial placement problem. Our extension particularly focuses on describing the object direction, when the object is placed on the 2D plane. We conducted an annotation experiment in which a corpus of 20 situated dialogues were annotated. The annotation result showed the number of newly introduced tags by our proposal is not negligible. We also implemented an analyser that automatically assigns the proposed tags to the corpus and evaluated its performance. The result showed that the performance for entity tag was quite high ranging from 0.68 to 0.99 in F-measure, but not the case for relation tags, i.e. less than 0.4 in F-measure.

pdf abs
Solving the AL Chicken-and-Egg Corpus and Model Problem: Model-free Active Learning for Phenomena-driven Corpus Construction
Dain Kaplan | Neil Rubens | Simone Teufel | Takenobu Tokunaga
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Active learning (AL) is often used in corpus construction (CC) for selecting “informative” documents for annotation. This is ideal for focusing annotation efforts when all documents cannot be annotated, but has the limitation that it is carried out in a closed-loop, selecting points that will improve an existing model. For phenomena-driven and exploratory CC, the lack of existing-models and specific task(s) for using it make traditional AL inapplicable. In this paper we propose a novel method for model-free AL utilising characteristics of phenomena for applying AL to select documents for annotation. The method can also supplement traditional closed-loop AL-based CC to extend the utility of the corpus created beyond a single task. We introduce our tool, MOVE, and show its potential with a real world case-study.

pdf abs
Parameter estimation of Japanese predicate argument structure analysis model using eye gaze information
Ryosuke Maki | Hitoshi Nishikawa | Takenobu Tokunaga
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we propose utilising eye gaze information for estimating parameters of a Japanese predicate argument structure (PAS) analysis model. We employ not only linguistic information in the text, but also the information of annotator eye gaze during their annotation process. We hypothesise that annotator’s frequent looks at certain candidates imply their plausibility of being the argument of the predicate. Based on this hypothesis, we consider annotator eye gaze for estimating the model parameters of the PAS analysis. The evaluation experiment showed that introducing eye gaze information increased the accuracy of the PAS analysis by 0.05 compared with the conventional methods.

2015

pdf
Incrementally Tracking Reference in Human/Human Dialogue Using Linguistic and Extra-Linguistic Information
Casey Kennington | Ryu Iida | Takenobu Tokunaga | David Schlangen
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf abs
Building a Corpus of Manually Revised Texts from Discourse Perspective
Ryu Iida | Takenobu Tokunaga
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents building a corpus of manually revised texts which includes both before and after-revision information. In order to create such a corpus, we propose a procedure for revising a text from a discourse perspective, consisting of dividing a text to discourse units, organising and reordering groups of discourse units and finally modifying referring and connective expressions, each of which imposes limits on freedom of revision. Following the procedure, six revisers who have enough experience in either teaching Japanese or scoring Japanese essays revised 120 Japanese essays written by Japanese native speakers. Comparing the original and revised texts, we found some specific manual revisions frequently occurred between the original and revised texts, e.g. thesis statements were frequently placed at the beginning of a text. We also evaluate text coherence using the original and revised texts on the task of pairwise information ordering, identifying a more coherent text. The experimental results using two text coherence models demonstrated that the two models did not outperform the random baseline.

2013

pdf
Annotation for annotation - Toward eliciting implicit linguistic knowledge through annotation - (Project Note)
Takenobu Tokunaga | Ryu Iida | Koh Mitsuda
Proceedings of the 9th Joint ISO - ACL SIGSEM Workshop on Interoperable Semantic Annotation

pdf
Automatic Voice Selection in Japanese based on Various Linguistic Information
Ryu Iida | Takenobu Tokunaga
Proceedings of the 14th European Workshop on Natural Language Generation

pdf
Investigation of annotator’s behaviour using eye-tracking data
Ryu Iida | Koh Mitsuda | Takenobu Tokunaga
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

pdf
Detecting Missing Annotation Disagreement using Eye Gaze Information
Koh Mitsuda | Ryu Iida | Takenobu Tokunaga
Proceedings of the 11th Workshop on Asian Language Resources

2012

pdf
A Unified Probabilistic Approach to Referring Expressions
Kotaro Funakoshi | Mikio Nakano | Takenobu Tokunaga | Ryu Iida
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf
A Metric for Evaluating Discourse Coherence based on Coreference Resolution
Ryu Iida | Takenobu Tokunaga
Proceedings of COLING 2012: Posters

pdf abs
The REX corpora: A collection of multimodal corpora of referring expressions in collaborative problem solving dialogues
Takenobu Tokunaga | Ryu Iida | Asuka Terai | Naoko Kuriyama
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes a collection of multimodal corpora of referring expressions, the REX corpora. The corpora have two notable features, namely (1) they include time-aligned extra-linguistic information such as participant actions and eye-gaze on top of linguistic information, (2) dialogues were collected with various configurations in terms of the puzzle type, hinting and language. After describing how the corpora were constructed and sketching out each, we present an analysis of various statistics for the corpora with respect to the various configurations mentioned above. The analysis showed that the corpora have different characteristics in the number of utterances and referring expressions in a dialogue, the task completion time and the attributes used in the referring expressions. In this respect, we succeeded in constructing a collection of corpora that included a variety of characteristics by changing the configurations for each set of dialogues, as originally planned. The corpora are now under preparation for publication, to be used for research on human reference behaviour.

pdf abs
Effects of Document Clustering in Modeling Wikipedia-style Term Descriptions
Atsushi Fujii | Yuya Fujii | Takenobu Tokunaga
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Reflecting the rapid growth of science, technology, and culture, it has become common practice to consult tools on the World Wide Web for various terms. Existing search engines provide an enormous volume of information, but retrieved information is not organized. Hand-compiled encyclopedias provide organized information, but the quantity of information is limited. In this paper, aiming to integrate the advantages of both tools, we propose a method to organize a search result based on multiple viewpoints as in Wikipedia. Because viewpoints required for explanation are different depending on the type of a term, such as animal and disease, we model articles in Wikipedia to extract a viewpoint structure for each term type. To identify a set of term types, we independently use manual annotation and automatic document clustering for Wikipedia articles. We also propose an effective feature for clustering of Wikipedia articles. We experimentally show that the document clustering reduces the cost for the manual annotation while maintaining the accuracy for modeling Wikipedia articles.

2011

pdf
Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues
Ryu Iida | Masaaki Yasuhara | Takenobu Tokunaga
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf
Incorporating Extra-Linguistic Information into Reference Resolution in Collaborative Task Dialogue
Ryu Iida | Syumpei Kobayashi | Takenobu Tokunaga
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
Construction of bilingual multimodal corpora of referring expressions in collaborative problem solving
Takenobu Tokunaga | Ryu Iida | Masaaki Yasuhara | Asuka Terai | David Morris | Anja Belz
Proceedings of the Eighth Workshop on Asian Language Resouces

pdf
Towards an Extrinsic Evaluation of Referring Expressions in Situated Dialogs
Philipp Spanger | Ryu Iida | Takenobu Tokunaga | Asuka Terai | Naoko Kuriyama
Proceedings of the 6th International Natural Language Generation Conference

pdf abs
Annotation Process Management Revisited
Dain Kaplan | Ryu Iida | Takenobu Tokunaga
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Proper annotation process management is crucial to the construction of corpora, which are in turn indispensable to the data-driven techniques that have come to the forefront in NLP during the last two decades. It is still common to see ad-hoc tools created for a specific annotation project, but it is time this changed; creation of such tools is labor and time expensive, and is secondary to corpus creation. In addition, such tools likely lack proper annotation process management, increasingly more important as corpora sizes grow in size and complexity. This paper first raises a list of ten needs that any general purpose annotation system should address moving forward, such as user & role management, delegation & monitoring of work, diffing & merging annotators work, versioning of corpora, multilingual support, import/export format flexibility, and so on. A framework to address these needs is then proposed, and how having proper annotation process management can be beneficial to the creation and maintenance of corpora explained. The paper then introduces SLATE (Segment and Link-based Annotation Tool Enhanced), the second iteration of a web-based annotation tool, which is being rewritten to implement the proposed framework.

2009

pdf
Obituaries: Hozumi Tanaka
Timothy Baldwin | Takenobu Tokunaga | Jun’ichi Tsujii
Computational Linguistics, Volume 35, Number 4, December 2009

pdf
A Japanese Corpus of Referring Expressions Used in a Situated Collaboration Task
Philipp Spanger | Masaaki Yasuhara | Ryu Iida | Takenobu Tokunaga
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf
A Probabilistic Model of Referring Expressions for Complex Objects
Kotaro Funakoshi | Philipp Spanger | Mikio Nakano | Takenobu Tokunaga
Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

pdf
Automatic Extraction of Citation Contexts for Research Paper Summarization: A Coreference-chain based Approach
Dain Kaplan | Ryu Iida | Takenobu Tokunaga
Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL)

2008

Corpus-based approaches and statistical approaches have been the main stream of natural language processing research for the past two decades. Language resources play a key role in such approaches, but there is an insufficient amount of language resources in many Asian languages. In this situation, standardisation of language resources would be of great help in developing resources in new languages. This paper presents the latest development efforts of our project which aims at creating a common standard for Asian language resources that is compatible with an international standard. In particular, the paper focuses on i) lexical specification and data categories relevant for building multilingual lexical resources for Asian languages; ii) a core upper-layer ontology needed for ensuring multilingual interoperability and iii) the evaluation platform used to test the entire architectural framework.

pdf
On “Redundancy” in Selecting Attributes for Generating Referring Expressions
Philipp Spanger | Takehiro Kurosawa | Takenobu Tokunaga
Coling 2008: Companion volume: Posters

2007

pdf
Extracting phrasal alignments from comparable corpora by using joint probability SMT model
Tadashi Kumano | Hideki Tanaka | Takenobu Tokunaga
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2006

pdf abs
A new approach to syntactic annotation
Masaki Noguchi | Hiroshi Ichikawa | Taiichi Hashimoto | Takenobu Tokunaga
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Many systems have been developed for creating syntactically annotated corpora. However, they mainly focus on interface usability and hardly pay attention toknowledge sharing among annotators in the task. In order to incorporate the functionality of knowledge sharing, we emphasized the importance of normalizingthe annotation process. As a first step toward knowledge sharing, this paper proposes a method of system initiative annotation in which the system suggests annotators the order of ambiguities to solve. To be more concrete, the system forces annotators to solve ambiguity of constituent structure in a top-down and depth-first manner, and then to solve ambiguity of grammatical category in a bottom-up and breadth-first manner. We implemented the system on top of eBonsai, our annotation tool, and conducted experiments to compare eBonsai and the proposed system in terms of annotation accuracy and efficiency. We found that at least for novice annotators, the proposed system is more efficient while keeping annotation accuracy comparable with eBonsai.

pdf
Efficient Sentence Retrieval Based on Syntactic Structure
Hiroshi Ichikawa | Keita Hakoda | Taiichi Hashimoto | Takenobu Tokunaga
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf
Identifying Repair Targets in Action Control Dialogue
Kotaro Funakoshi | Takenobu Tokunaga
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Group-Based Generation of Referring Expressions
Kotaro Funakoshi | Satoru Watanabe | Takenobu Tokunaga
Proceedings of the Fourth International Natural Language Generation Conference

2005

pdf
eBonsai: An Integrated Environment for Annotating Treebanks
Hiroshi Ichikawa | Masaki Noguchi | Taiichi Hashimoto | Takenobu Tokunaga | Hozumi Tanaka
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf bib
Evaluation of a Japanese CFG Derived from a Syntactically Annotated Corpus with Respect to Dependency Measures
Tomoya Noro | Chimato Koike | Taiichi Hashimoto | Takenobu Tokunaga | Hozumi Tanaka
Proceedings of the Fifth Workshop on Asian Language Resources (ALR-05) and First Symposium on Asian Language Resources Network (ALRN)

2004

pdf
Generation of Relative Referring Expressions based on Perceptual Grouping
Kotaro Funakoshi | Satoru Watanabe | Naoko Kuriyama | Takenobu Tokunaga
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf abs
Retrieving Annotated Corpora for Corpus Annotation
Kyôsuke Yoshida | Taiichi Hashimoto | Takenobu Tokunaga | Hozumi Tanaka
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper introduces a tool \Bonsai which supports human in annotating corpora with morphosyntactic information, and in retrieving syntactic structures stored in the database. Integrating annotation and retrieval enables users to annotate a new instance while looking back at the already annotated sentences which share the similar morphosyntactic structure. We focus on the retrieval part of the system, and describe a method to decompose a large input query into smaller ones in order to gain retrieval efficiency. The proposed method is evaluated with the Penn Treebank corpus, showing significant improvements.

pdf abs
Classification of Japanese Spatial Nouns
Takenobu Tokunaga | Tomofumi Koyama | Suguru Saito | Masayuki Nakajima
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

We have already proposed a framework to represent a location in terms of both symbolic and numeric aspects. In order to deal with vague linguistic expressions of a location, the representation adopts a potential function mapping a location to its plausibility. This paper proposes classification of Japanese spatial nouns and potential functions corresponding to each class. We focused on a common Japanese spatial expression ``X no Y (Y of X)'' where X is a reference object and Y is a spatial noun. For example, ``tukue no migi (the right of the desk)'' denotes a location with reference to the desk. This expression were collected from corpora, and spatial nouns appearing in the Y position were classified into two major classes; designating a part of the reference object and designating a location apart from the reference object . And the latter class were further classified into two subclasses; direction-oriented and distance-oriented. For each class, a potential function were designed for providing meaning of spatial nouns.

In Japanese constructions of the form [N1 no Adj N2], the adjective Adj modifies either N1 or N2. Determing the semantic dependencies of adjective in such phrase is an important task for machine translation. This paper describes a method for determining the adjective dependency in such constructions using decision lists, and inducing decision lists from training contexts with correct semantic dependencies and without. Based on evaluation, our method is able to determine adjective dependency with an precision of about 94%. We further analyze rules in the induced decision lists and examine effective features to determine the semantic dependencies of adjectives.

pdf
The Japanese Translation Task: Lexical and Structural Perspectives
Timothy Baldwin | Atsushi Okazaki | Takenobu Tokunaga | Hozumi Tanaka
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

2000

pdf
Semi-automatic Construction of a Tree-annotated Corpus Using an Iterative Learning Statistical Language Model
Kiyoaki Shirai | Hozumi Tanaka | Takenobu Tokunaga
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1999

pdf abs
Sharing syntactic structures
Masahiro Ueki | Takenobu Tokunaga | Hozumi Tanaka
Proceedings of Machine Translation Summit VII

Bracketed corpora are a very useful resource for natural language processing, but hard to build efficiently, leading to quantitative insufficiency for practical use. Disparities in morphological information, such as word segmentation and part-of-speech tag sets, are also troublesome. An application specific to a particular corpus often cannot be applied to another corpus. In this paper, we sketch out a method to build a corpus that has a fixed syntactic structure but varying morphological annotation based on the different tag set schemes utilized. Our system uses a two layered grammar, one layer of which is made up of replaceable tag-set-dependent rules while the other has no such tag set dependency. The input sentences of our system are bracketed corresponding to structural information of corpus. The parser can work using any tag set and grammar, and using the same input bracketing, we obtain corpus that shares partial syntactic structure.

pdf
Complementing WordNet with Roget’s and Corpus-based Thesauri for Information Retrieval
Rila Mandala | Takenobu Tokunaga | Hozumi Tanaka
Ninth Conference of the European Chapter of the Association for Computational Linguistics

1998

pdf bib
Selective Sampling for Example-based Word Sense Disambiguation
Atsushi Fujii | Kentaro Inui | Takenobu Tokunaga | Hozumi Tanaka
Computational Linguistics, Volume 24, Number 4, December 1998

pdf
The Use of WordNet in Information Retrieval
Mandala Rila | Takenobu Tokunaga | Hozumi Tanaka
Usage of WordNet in Natural Language Processing Systems

pdf
An Empirical Evaluation on Statistical Parsing of Japanese Sentences Using Lexical Association Statistics
Kiyoaki Shirai | Kentaro Inui | Takenobu Tokunaga | Hozumi Tanaka
Proceedings of the Third Conference on Empirical Methods for Natural Language Processing

1997

pdf
Extending a thesaurus by classifying words
Takenobu Tokunaga | Atsushi Fujii | Naoyuki Sakurai | Hozumi Tanaka
Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications

pdf
Integration of Hand-Crafted and Statistical Resources in Measuring Word Similarity
Atsushi Fujii | Toshihiro Hasegawa | Takenobu Tokunaga
Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications

pdf
Book Reviews: The Balancing Act: Combining Symbolic and Statistical Approaches to Language
Takenobu Tokunaga
Computational Linguistics, Volume 23, Number 4, December 1997

pdf abs
A New Formalization of Probabilistic GLR Parsing
Kentaro Unui | Virach Sornlertlamvanich | Hozumi Tanaka | Takenobu Tokunaga
Proceedings of the Fifth International Workshop on Parsing Technologies

This paper presents a new formalization of probabilistic GLR language modeling for statistical parsing. Our model inherits its essential features from Briscoe and Carroll’s generalized probabilistic LR model, which obtains context-sensitivity by assigning a probability to each LR parsing action according to its left and right context. Briscoe and Carroll’s model, however, has a drawback in that it is not formalized in any probabilistically well-founded way, which may degrade its parsing performance. Our formulation overcomes this drawback with a few significant refinements, while maintaining all the advantages of Briscoe and Carroll’s modeling.