Koji Murakami

2024

pdf abs
Search Query Refinement for Japanese Named Entity Recognition in E-commerce Domain
Yuki Nakayama | Ryutaro Tatsushima | Erick Mendieta | Koji Murakami | Keiji Shinzato
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

In the E-Commerce domain, search query refinement reformulates malformed queries into canonicalized forms by preprocessing operations such as “term splitting” and “term merging”. Unfortunately, most relevant research is rather limited to English. In particular, there is a severe lack of study on search query refinement for the Japanese language. Furthermore, no attempt has ever been made to apply refinement methods to data improvement for downstream NLP tasks in real-world scenarios.This paper presents a novel query refinement approach for the Japanese language. Experimental results show that our method achieves significant improvement by 3.5 points through comparison with BERT-CRF as a baseline. Further experiments are also conducted to measure beneficial impact of query refinement on named entity recognition (NER) as the downstream task. Evaluations indicate that the proposed query refinement method contributes to better data quality, leading to performance boost on E-Commerce specific NER tasks by 11.7 points, compared to search query data preprocessed by MeCab, a very popularly adopted Japanese tokenizer.

2022

pdf abs
A Stacking-based Efficient Method for Toxic Language Detection on Live Streaming Chat
Yuto Oikawa | Yuki Nakayama | Koji Murakami
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

In a live streaming chat on a video streaming service, it is crucial to filter out toxic comments with online processing to prevent users from reading comments in real-time. However, recent toxic language detection methods rely on deep learning methods, which can not be scalable considering inference speed. Also, these methods do not consider constraints of computational resources expected depending on a deployed system (e.g., no GPU resource).This paper presents an efficient method for toxic language detection that is aware of real-world scenarios. Our proposed architecture is based on partial stacking that feeds initial results with low confidence to meta-classifier. Experimental results show that our method achieves a much faster inference speed than BERT-based models with comparable performance.

pdf abs
A Large-Scale Japanese Dataset for Aspect-based Sentiment Analysis
Yuki Nakayama | Koji Murakami | Gautam Kumar | Sudha Bhingardive | Ikuko Hardaway
Proceedings of the Thirteenth Language Resources and Evaluation Conference

There has been significant progress in the field of sentiment analysis. However, aspect-based sentiment analysis (ABSA) has not been explored in the Japanese language even though it has a huge scope in many natural language processing applications such as 1) tracking sentiment towards products, movies, politicians etc; 2) improving customer relation models. The main reason behind this is that there is no standard Japanese dataset available for ABSA task. In this paper, we present the first standard Japanese dataset for the hotel reviews domain. The proposed dataset contains 53,192 review sentences with seven aspect categories and two polarity labels. We perform experiments on this dataset using popular ABSA approaches and report error analysis. Our experiments show that contextual models such as BERT works very well for the ABSA task in the Japanese language and also show the need to focus on other NLP tasks for better performance through our error analysis.

2020

pdf
ILP-based Opinion Sentence Extraction from User Reviews for Question DB Construction
Masakatsu Hamashita | Takashi Inui | Koji Murakami | Keiji Shinzato
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2016

pdf abs
Large-scale Multi-class and Hierarchical Product Categorization for an E-commerce Giant
Ali Cevahir | Koji Murakami
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In order to organize the large number of products listed in e-commerce sites, each product is usually assigned to one of the multi-level categories in the taxonomy tree. It is a time-consuming and difficult task for merchants to select proper categories within thousands of options for the products they sell. In this work, we propose an automatic classification tool to predict the matching category for a given product title and description. We used a combination of two different neural models, i.e., deep belief nets and deep autoencoders, for both titles and descriptions. We implemented a selective reconstruction approach for the input layer during the training of the deep neural networks, in order to scale-out for large-sized sparse feature vectors. GPUs are utilized in order to train neural networks in a reasonable time. We have trained our models for around 150 million products with a taxonomy tree with at most 5 levels that contains 28,338 leaf categories. Tests with millions of products show that our first predictions matches 81% of merchants’ assignments, when “others” categories are excluded.

2011

pdf
Safety Information Mining — What can NLP do in a disaster—
Graham Neubig | Yuichiroh Matsubayashi | Masato Hagiwara | Koji Murakami
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf abs
Annotating Event Mentions in Text with Modality, Focus, and Source Information
Suguru Matsuyoshi | Megumi Eguchi | Chitose Sao | Koji Murakami | Kentaro Inui | Yuji Matsumoto
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Many natural language processing tasks, including information extraction, question answering and recognizing textual entailment, require analysis of the polarity, focus of polarity, tense, aspect, mood and source of the event mentions in a text in addition to its predicate-argument structure analysis. We refer to modality, polarity and other associated information as extended modality. In this paper, we propose a new annotation scheme for representing the extended modality of event mentions in a sentence. Our extended modality consists of the following seven components: Source, Time, Conditional, Primary modality type, Actuality, Evaluation and Focus. We reviewed the literature about extended modality in Linguistics and Natural Language Processing (NLP) and defined appropriate labels of each component. In the proposed annotation scheme, information of extended modality of an event mention is summarized at the core predicate of the event mention for immediate use in NLP applications. We also report on the current progress of our manual annotation of a Japanese corpus of about 50,000 event mentions, showing a reasonably high ratio of inter-annotator agreement.