2024
pdf
abs
Search Query Refinement for Japanese Named Entity Recognition in E-commerce Domain
Yuki Nakayama
|
Ryutaro Tatsushima
|
Erick Mendieta
|
Koji Murakami
|
Keiji Shinzato
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)
In the E-Commerce domain, search query refinement reformulates malformed queries into canonicalized forms by preprocessing operations such as “term splitting” and “term merging”. Unfortunately, most relevant research is rather limited to English. In particular, there is a severe lack of study on search query refinement for the Japanese language. Furthermore, no attempt has ever been made to apply refinement methods to data improvement for downstream NLP tasks in real-world scenarios.This paper presents a novel query refinement approach for the Japanese language. Experimental results show that our method achieves significant improvement by 3.5 points through comparison with BERT-CRF as a baseline. Further experiments are also conducted to measure beneficial impact of query refinement on named entity recognition (NER) as the downstream task. Evaluations indicate that the proposed query refinement method contributes to better data quality, leading to performance boost on E-Commerce specific NER tasks by 11.7 points, compared to search query data preprocessed by MeCab, a very popularly adopted Japanese tokenizer.
2023
pdf
abs
A Unified Generative Approach to Product Attribute-Value Identification
Keiji Shinzato
|
Naoki Yoshinaga
|
Yandi Xia
|
Wei-Te Chen
Findings of the Association for Computational Linguistics: ACL 2023
Product attribute-value identification (PAVI) has been studied to link products on e-commerce sites with their attribute values (e.g., ⟨Material, Cotton⟩) using product text as clues. Technical demands from real-world e-commerce platforms require PAVI methods to handle unseen values, multi-attribute values, and canonicalized values, which are only partly addressed in existing extraction- and classification-based approaches. Motivated by this, we explore a generative approach to the PAVI task. We finetune a pre-trained generative model, T5, to decode a set of attribute-value pairs as a target sequence from the given product text. Since the attribute value pairs are unordered set elements, how to linearize them will matter; we, thus, explore methods of composing an attribute-value pair and ordering the pairs for the task. Experimental results confirm that our generation-based approach outperforms the existing extraction and classification-based methods on large-scale real-world datasets meant for those methods.
pdf
abs
Does Named Entity Recognition Truly Not Scale Up to Real-world Product Attribute Extraction?
Wei-Te Chen
|
Keiji Shinzato
|
Naoki Yoshinaga
|
Yandi Xia
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
The key challenge in the attribute-value extraction (AVE) task from e-commerce sites is the scalability to diverse attributes for a large number of products in real-world e-commerce sites. To make AVE scalable to diverse attributes, recent researchers adopted a question-answering (QA)-based approach that additionally inputs the target attribute as a query to extract its values, and confirmed its advantage over a classical approach based on named-entity recognition (NER) on real-word e-commerce datasets. In this study, we argue the scalability of the NER-based approach compared to the QA-based approach, since researchers have compared BERT-based QA-based models to only a weak BiLSTM-based NER baseline trained from scratch in terms of only accuracy on datasets designed to evaluate the QA-based approach. Experimental results using a publicly available real-word dataset revealed that, under a fair setting, BERT-based NER models rival BERT-based QA models in terms of the accuracy, and their inference is faster than the QA model that processes the same product text several times to handle multiple target attributes.
2022
pdf
abs
Extreme Multi-Label Classification with Label Masking for Product Attribute Value Extraction
Wei-Te Chen
|
Yandi Xia
|
Keiji Shinzato
Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)
Although most studies have treated attribute value extraction (AVE) as named entity recognition, these approaches are not practical in real-world e-commerce platforms because they perform poorly, and require canonicalization of extracted values. Furthermore, since values needed for actual services is static in many attributes, extraction of new values is not always necessary. Given the above, we formalize AVE as extreme multi-label classification (XMC). A major problem in solving AVE as XMC is that the distribution between positive and negative labels for products is heavily imbalanced. To mitigate the negative impact derived from such biased distribution, we propose label masking, a simple and effective method to reduce the number of negative labels in training. We exploit attribute taxonomy designed for e-commerce platforms to determine which labels are negative for products. Experimental results using a dataset collected from a Japanese e-commerce platform demonstrate that the label masking improves micro and macro F1 scores by 3.38 and 23.20 points, respectively.
pdf
abs
Cross-Encoder Data Annotation for Bi-Encoder Based Product Matching
Justin Chiu
|
Keiji Shinzato
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track
Matching a seller listed item to an appropriate product is an important step for an e-commerce platform. With the recent advancement in deep learning, there are different encoder based approaches being proposed as solution. When textual data for two products are available, cross-encoder approaches encode them jointly while bi-encoder approaches encode them separately. Since cross-encoders are computationally heavy, approaches based on bi-encoders are a common practice for this challenge. In this paper, we propose cross-encoder data annotation; a technique to annotate or refine human annotated training data for bi-encoder models using a cross-encoder model. This technique enables us to build a robust model without annotation on newly collected training data or further improve model performance on annotated training data. We evaluate the cross-encoder data annotation on the product matching task using a real-world e-commerce dataset containing 104 million products. Experimental results show that the cross-encoder data annotation improves 4% absolute accuracy when no annotation for training data is available, and 2% absolute accuracy when annotation for training data is available.
pdf
abs
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product Attribute Extraction
Keiji Shinzato
|
Naoki Yoshinaga
|
Yandi Xia
|
Wei-Te Chen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
A key challenge in attribute value extraction (AVE) from e-commerce sites is how to handle a large number of attributes for diverse products. Although this challenge is partially addressed by a question answering (QA) approach which finds a value in product data for a given query (attribute), it does not work effectively for rare and ambiguous queries. We thus propose simple knowledge-driven query expansion based on possible answers (values) of a query (attribute) for QA-based AVE. We retrieve values of a query (attribute) from the training data to expand the query. We train a model with two tricks, knowledge dropout and knowledge token mixing, which mimic the imperfection of the value knowledge in testing. Experimental results on our cleaned version of AliExpress dataset show that our method improves the performance of AVE (+6.08 macro F1), especially for rare and ambiguous attributes (+7.82 and +6.86 macro F1, respectively).
2020
pdf
ILP-based Opinion Sentence Extraction from User Reviews for Question DB Construction
Masakatsu Hamashita
|
Takashi Inui
|
Koji Murakami
|
Keiji Shinzato
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation
2017
pdf
abs
Large-Scale Categorization of Japanese Product Titles Using Neural Attention Models
Yandi Xia
|
Aaron Levine
|
Pradipto Das
|
Giuseppe Di Fabbrizio
|
Keiji Shinzato
|
Ankur Datta
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
We propose a variant of Convolutional Neural Network (CNN) models, the Attention CNN (ACNN); for large-scale categorization of millions of Japanese items into thirty-five product categories. Compared to a state-of-the-art Gradient Boosted Tree (GBT) classifier, the proposed model reduces training time from three weeks to three days while maintaining more than 96% accuracy. Additionally, our proposed model characterizes products by imputing attentive focus on word tokens in a language agnostic way. The attention words have been observed to be semantically highly correlated with the predicted categories and give us a choice of automatic feature extraction for downstream processing.
2013
pdf
Precise Information Retrieval Exploiting Predicate-Argument Structures
Daisuke Kawahara
|
Keiji Shinzato
|
Tomohide Shibata
|
Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing
pdf
Unsupervised Extraction of Attributes and Their Values from Product Description
Keiji Shinzato
|
Satoshi Sekine
Proceedings of the Sixth International Joint Conference on Natural Language Processing
2010
pdf
bib
Exploiting Term Importance Categories and Dependency Relations for Natural Language Search
Keiji Shinzato
|
Sadao Kurohashi
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)
2008
pdf
abs
A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure
Keiji Shinzato
|
Daisuke Kawahara
|
Chikara Hashimoto
|
Sadao Kurohashi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In recent years, language resources acquired from theWeb are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.
pdf
TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology
Keiji Shinzato
|
Tomohide Shibata
|
Daisuke Kawahara
|
Chikara Hashimoto
|
Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I
2004
pdf
Acquiring Hyponymy Relations from Web Documents
Keiji Shinzato
|
Kentaro Torisawa
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004
pdf
Extracting Hyponyms of Prespecified Hypernyms from Itemizations and Headings in Web Documents
Keiji Shinzato
|
Kentaro Torisawa
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics