Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Young-bum Kim, Yunyao Li, Owen Rambow (Editors)


Anthology ID:
2021.naacl-industry
Month:
June
Year:
2021
Address:
Online
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2021.naacl-industry
DOI:
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2021.naacl-industry.pdf

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers
Young-bum Kim | Yunyao Li | Owen Rambow

pdf bib
When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages
Stojan Trajanovski | Chad Atalla | Kunho Kim | Vipul Agarwal | Milad Shokouhi | Chris Quirk

Email and chat communication tools are increasingly important for completing daily tasks. Accurate real-time phrase completion can save time and bolster productivity. Modern text prediction algorithms are based on large language models which typically rely on the prior words in a message to predict a completion. We examine how additional contextual signals (from previous messages, time, and subject) affect the performance of a commercial text prediction model. We compare contextual text prediction in chat and email messages from two of the largest commercial platforms Microsoft Teams and Outlook, finding that contextual signals contribute to performance differently between these scenarios. On emails, time context is most beneficial with small relative gains of 2% over baseline. Whereas, in chat scenarios, using a tailored set of previous messages as context yields relative improvements over the baseline between 9.3% and 18.6% across various critical service-oriented text prediction metrics.

pdf bib
Identifying and Resolving Annotation Changes for Natural Language Understanding
Jose Garrido Ramas | Giorgio Pessot | Abdalghani Abujabal | Martin Rajman

Annotation conflict resolution is crucial towards building machine learning models with acceptable performance. Past work on annotation conflict resolution had assumed that data is collected at once, with a fixed set of annotators and fixed annotation guidelines. Moreover, previous work dealt with atomic labeling tasks. In this paper, we address annotation conflict resolution for Natural Language Understanding (NLU), a structured prediction task, in a real-world setting of commercial voice-controlled personal assistants, where (1) regular data collections are needed to support new and existing functionalities, (2) annotation guidelines evolve over time, and (3) the pool of annotators change across data collections. We devise an approach combining information-theoretic measures and a supervised neural model to resolve conflicts in data annotation. We evaluate our approach both intrinsically and extrinsically on a real-world dataset with 3.5M utterances of a commercial dialog system in German. Our approach leads to dramatic improvements over a majority baseline especially in contentious cases. On the NLU task, our approach achieves 2.75% error reduction over a no-resolution baseline.

pdf
Optimizing NLU Reranking Using Entity Resolution Signals in Multi-domain Dialog Systems
Tong Wang | Jiangning Chen | Mohsen Malmir | Shuyan Dong | Xin He | Han Wang | Chengwei Su | Yue Liu | Yang Liu

In dialog systems, the Natural Language Understanding (NLU) component typically makes the interpretation decision (including domain, intent and slots) for an utterance before the mentioned entities are resolved. This may result in intent classification and slot tagging errors. In this work, we propose to leverage Entity Resolution (ER) features in NLU reranking and introduce a novel loss term based on ER signals to better learn model weights in the reranking framework. In addition, for a multi-domain dialog scenario, we propose a score distribution matching method to ensure scores generated by the NLU reranking models for different domains are properly calibrated. In offline experiments, we demonstrate our proposed approach significantly outperforms the baseline model on both single-domain and cross-domain evaluations.

pdf
Entity Resolution in Open-domain Conversations
Mingyue Shang | Tong Wang | Mihail Eric | Jiangning Chen | Jiyang Wang | Matthew Welch | Tiantong Deng | Akshay Grewal | Han Wang | Yue Liu | Yang Liu | Dilek Hakkani-Tur

In recent years, incorporating external knowledge for response generation in open-domain conversation systems has attracted great interest. To improve the relevancy of retrieved knowledge, we propose a neural entity linking (NEL) approach. Different from formal documents, such as news, conversational utterances are informal and multi-turn, which makes it more challenging to disambiguate the entities. Therefore, we present a context-aware named entity recognition model (NER) and entity resolution (ER) model to utilize dialogue context information. We conduct NEL experiments on three open-domain conversation datasets and validate that incorporating context information improves the performance of NER and ER models. The end-to-end NEL approach outperforms the baseline by 62.8% relatively in F1 metric. Furthermore, we verify that using external knowledge based on NEL benefits the neural response generation model.

pdf
Pretrain-Finetune Based Training of Task-Oriented Dialogue Systems in a Real-World Setting
Manisha Srivastava | Yichao Lu | Riley Peschon | Chenyang Li

One main challenge in building task-oriented dialogue systems is the limited amount of supervised training data available. In this work, we present a method for training retrieval-based dialogue systems using a small amount of high-quality, annotated data and a larger, unlabeled dataset. We show that pretraining using unlabeled data can bring better model performance with a 31% boost in Recall@1 compared with no pretraining. The proposed finetuning technique based on a small amount of high-quality, annotated data resulted in 26% offline and 33% online performance improvement in Recall@1 over the pretrained model. The model is deployed in an agent-support application and evaluated on live customer service contacts, providing additional insights into the real-world implications compared with most other publications in the domain often using asynchronous transcripts (e.g. Reddit data). The high performance of 74% Recall@1 shown in the customer service example demonstrates the effectiveness of this pretrain-finetune approach in dealing with the limited supervised data challenge.

pdf
Contextual Domain Classification with Temporal Representations
Tzu-Hsiang Lin | Yipeng Shi | Chentao Ye | Yang Fan | Weitong Ruan | Emre Barut | Wael Hamza | Chengwei Su

In commercial dialogue systems, the Spoken Language Understanding (SLU) component tends to have numerous domains thus context is needed to help resolve ambiguities. Previous works that incorporate context for SLU have mostly focused on domains where context is limited to a few minutes. However, there are domains that have related context that could span up to hours and days. In this paper, we propose temporal representations that combine wall-clock second difference and turn order offset information to utilize both recent and distant context in a novel large-scale setup. Experiments on the Contextual Domain Classification (CDC) task with various encoder architectures show that temporal representations combining both information outperforms only one of the two. We further demonstrate that our contextual Transformer is able to reduce 13.04% of classification errors compared to a non-contextual baseline. We also conduct empirical analyses to study recent versus distant context and opportunities to lower deployment costs.

pdf
Bootstrapping a Music Voice Assistant with Weak Supervision
Sergio Oramas | Massimo Quadrana | Fabien Gouyon

One of the first building blocks to create a voice assistant relates to the task of tagging entities or attributes in user queries. This can be particularly challenging when entities are in the tenth of millions, as is the case of e.g. music catalogs. Training slot tagging models at an industrial scale requires large quantities of accurately labeled user queries, which are often hard and costly to gather. On the other hand, voice assistants typically collect plenty of unlabeled queries that often remain unexploited. This paper presents a weakly-supervised methodology to label large amounts of voice query logs, enhanced with a manual filtering step. Our experimental evaluations show that slot tagging models trained on weakly-supervised data outperform models trained on hand-annotated or synthetic data, at a lower cost. Further, manual filtering of weakly-supervised data leads to a very significant reduction in Sentence Error Rate, while allowing us to drastically reduce human curation efforts from weeks to hours, with respect to hand-annotation of queries. The method is applied to successfully bootstrap a slot tagging system for a major music streaming service that currently serves several tens of thousands of daily voice queries.

pdf
Continuous Model Improvement for Language Understanding with Machine Translation
Abdalghani Abujabal | Claudio Delli Bovi | Sungho Ryu | Turan Gojayev | Fabian Triefenbach | Yannick Versley

Scaling conversational personal assistants to a multitude of languages puts high demands on collecting and labelling data, a setting in which cross-lingual learning techniques can help to reconcile the need for well-performing Natural Language Understanding (NLU) with a desideratum to support many languages without incurring unacceptable cost. In this work, we show that automatically annotating unlabeled utterances using Machine Translation in an offline fashion and adding them to the training data can improve performance for existing NLU features for low-resource languages, where a straightforward translate-test approach as considered in existing literature would fail the latency requirements of a live environment. We demonstrate the effectiveness of our method with intrinsic and extrinsic evaluation using a real-world commercial dialog system in German. Beyond an intrinsic evaluation, where 56% of the resulting automatically labeled utterances had a perfect match with ground-truth labels, we see significant performance improvements in an extrinsic evaluation settings when manual labeled data is available in small quantities.

pdf
A Hybrid Approach to Scalable and Robust Spoken Language Understanding in Enterprise Virtual Agents
Ryan Price | Mahnoosh Mehrabani | Narendra Gupta | Yeon-Jun Kim | Shahab Jalalvand | Minhua Chen | Yanjie Zhao | Srinivas Bangalore

Spoken language understanding (SLU) extracts the intended mean- ing from a user utterance and is a critical component of conversational virtual agents. In enterprise virtual agents (EVAs), language understanding is substantially challenging. First, the users are infrequent callers who are unfamiliar with the expectations of a pre-designed conversation flow. Second, the users are paying customers of an enterprise who demand a reliable, consistent and efficient user experience when resolving their issues. In this work, we describe a general and robust framework for intent and entity extraction utilizing a hybrid of statistical and rule-based approaches. Our framework includes confidence modeling that incorporates information from all components in the SLU pipeline, a critical addition for EVAs to en- sure accuracy. Our focus is on creating accurate and scalable SLU that can be deployed rapidly for a large class of EVA applications with little need for human intervention.

pdf
Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems
Shubhi Tyagi | Antonio Bonafonte | Jaime Lorenzo-Trueba | Javier Latorre

Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard. We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English. We treat TN as a sequence classification problem and propose a granular tokenization mechanism that enables the system to learn majority of the classes and their normalizations from the training data itself. This is further combined with minimal precoded linguistic knowledge for other classes. We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English. All annotated datasets used for experimentation will be released.

pdf
Addressing the Vulnerability of NMT in Input Perturbations
Weiwen Xu | Ai Ti Aw | Yang Ding | Kui Wu | Shafiq Joty

Neural Machine Translation (NMT) has achieved significant breakthrough in performance but is known to suffer vulnerability to input perturbations. As real input noise is difficult to predict during training, robustness is a big issue for system deployment. In this paper, we improve the robustness of NMT models by reducing the effect of noisy words through a Context-Enhanced Reconstruction (CER) approach. CER trains the model to resist noise in two steps: (1) perturbation step that breaks the naturalness of input sequence with made-up words; (2) reconstruction step that defends the noise propagation by generating better and more robust contextual representation. Experimental results on Chinese-English (ZH-EN) and French-English (FR-EN) translation tasks demonstrate robustness improvement on both news and social media text. Further fine-tuning experiments on social media text show our approach can converge at a higher position and provide a better adaptation.

pdf
Cross-lingual Supervision Improves Unsupervised Neural Machine Translation
Mingxuan Wang | Hongxiao Bai | Hai Zhao | Lei Li

We propose to improve unsupervised neural machine translation with cross-lingual supervision (), which utilizes supervision signals from high resource language pairs to improve the translation of zero-source languages. Specifically, for training En-Ro system without parallel corpus, we can leverage the corpus from En-Fr and En-De to collectively train the translation from one language into many languages under one model. % is based on multilingual models which require no changes to the standard unsupervised NMT. Simple and effective, significantly improves the translation quality with a big margin in the benchmark unsupervised translation tasks, and even achieves comparable performance to supervised NMT. In particular, on WMT’14 -tasks achieves 37.6 and 35.18 BLEU score, which is very close to the large scale supervised setting and on WMT’16 -tasks achieves 35.09 BLEU score which is even better than the supervised Transformer baseline.

pdf
Should we find another model?: Improving Neural Machine Translation Performance with ONE-Piece Tokenization Method without Model Modification
Chanjun Park | Sugyeong Eo | Hyeonseok Moon | Heuiseok Lim

Most of the recent Natural Language Processing(NLP) studies are based on the Pretrain-Finetuning Approach (PFA), but in small and medium-sized enterprises or companies with insufficient hardware there are many limitations to servicing NLP application software using such technology due to slow speed and insufficient memory. The latest PFA technologies require large amounts of data, especially for low-resource languages, making them much more difficult to work with. We propose a new tokenization method, ONE-Piece, to address this limitation that combines the morphology-considered subword tokenization method and the vocabulary method used after probing for an existing method that has not been carefully considered before. Our proposed method can also be used without modifying the model structure. We experiment by applying ONE-Piece to Korean, a morphologically-rich and low-resource language. We derive an optimal subword tokenization result for Korean-English machine translation by conducting a case study that combines the subword tokenization method, morphological segmentation, and vocabulary method. Through comparative experiments with all the tokenization methods currently used in NLP research, ONE-Piece achieves performance comparable to the current Korean-English machine translation state-of-the-art model.

pdf
Autocorrect in the Process of Translation — Multi-task Learning Improves Dialogue Machine Translation
Tao Wang | Chengqi Zhao | Mingxuan Wang | Lei Li | Deyi Xiong

Automatic translation of dialogue texts is a much needed demand in many real life scenarios. However, the currently existing neural machine translation delivers unsatisfying results. In this paper, we conduct a deep analysis of a dialogue corpus and summarize three major issues on dialogue translation, including pronoun dropping (), punctuation dropping (), and typos (). In response to these challenges, we propose a joint learning method to identify omission and typo, and utilize context to translate dialogue utterances. To properly evaluate the performance, we propose a manually annotated dataset with 1,931 Chinese-English parallel utterances from 300 dialogues as a benchmark testbed for dialogue translation. Our experiments show that the proposed method improves translation quality by 3.2 BLEU over the baselines. It also elevates the recovery rate of omitted pronouns from 26.09% to 47.16%. We will publish the code and dataset publicly at https://xxx.xx.

pdf
LightSeq: A High Performance Inference Library for Transformers
Xiaohui Wang | Ying Xiong | Yang Wei | Mingxuan Wang | Lei Li

Transformer and its variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose , a highly efficient inference library for models in the Transformer family. includes a series of GPU optimization techniques to both streamline the computation of Transformer layers and reduce memory footprint. supports models trained using PyTorch and Tensorflow. Experimental results on standard machine translation benchmarks show that achieves up to 14x speedup compared with TensorFlow and 1.4x speedup compared with , a concurrent CUDA implementation. The code will be released publicly after the review.

pdf
Practical Transformer-based Multilingual Text Classification
Cindy Wang | Michele Banko

Transformer-based methods are appealing for multilingual text classification, but common research benchmarks like XNLI (Conneau et al., 2018) do not reflect the data availability and task variety of industry applications. We present an empirical comparison of transformer-based text classification models in a variety of practical monolingual and multilingual pretraining and fine-tuning settings. We evaluate these methods on two distinct tasks in five different languages. Departing from prior work, our results show that multilingual language models can outperform monolingual ones in some downstream tasks and target languages. We additionally show that practical modifications such as task- and domain-adaptive pretraining and data augmentation can improve classification performance without the need for additional labeled data.

pdf
An Emotional Comfort Framework for Improving User Satisfaction in E-Commerce Customer Service Chatbots
Shuangyong Song | Chao Wang | Haiqing Chen | Huan Chen

E-commerce has grown substantially over the last several years, and chatbots for intelligent customer service are concurrently drawing attention. We presented AliMe Assist, a Chinese intelligent assistant designed for creating an innovative online shopping experience in E-commerce. Based on question answering (QA), AliMe Assist offers assistance service, customer service, and chatting service. According to the survey of user studies and the real online testing, emotional comfort of customers’ negative emotions, which make up more than 5% of whole number of customer visits on AliMe, is a key point for providing considerate service. In this paper, we propose a framework to obtain proper answer to customers’ emotional questions. The framework takes emotion classification model as a core, and final answer selection is based on topic classification and text matching. Our experiments on real online systems show that the framework is very promising.

pdf
Language Scaling for Universal Suggested Replies Model
Qianlan Ying | Payal Bajaj | Budhaditya Deb | Yu Yang | Wei Wang | Bojia Lin | Milad Shokouhi | Xia Song | Yang Yang | Daxin Jiang

We consider the problem of scaling automated suggested replies for a commercial email application to multiple languages. Faced with increased compute requirements and low language resources for language expansion, we build a single universal model for improving the quality and reducing run-time costs of our production system. However, restricted data movement across regional centers prevents joint training across languages. To this end, we propose a multi-lingual multi-task continual learning framework, with auxiliary tasks and language adapters to train universal language representation across regions. The experimental results show positive cross-lingual transfer across languages while reducing catastrophic forgetting across regions. Our online results on real user traffic show significant CTR and Char-saved gain as well as 65% training cost reduction compared with per-language models. As a consequence, we have scaled the feature in multiple languages including low-resource markets.

pdf
Graph-based Multilingual Product Retrieval in E-Commerce Search
Hanqing Lu | Youna Hu | Tong Zhao | Tony Wu | Yiwei Song | Bing Yin

Nowadays, with many e-commerce platforms conducting global business, e-commerce search systems are required to handle product retrieval under multilingual scenarios. Moreover, comparing with maintaining per-country specific e-commerce search systems, having an universal system across countries can further reduce the operational and computational costs, and facilitate business expansion to new countries. In this paper, we introduce an universal end-to-end multilingual retrieval system, and discuss our learnings and technical details when training and deploying the system to serve billion-scale product retrieval for e-commerce search. In particular, we propose a multilingual graph attention based retrieval network by leveraging recent advances in transformer-based multilingual language models and graph neural network architectures to capture the interactions between search queries and items in e-commerce search. Offline experiments on five countries data show that our algorithm outperforms the state-of-the-art baselines by 35% recall and 25% mAP on average. Moreover, the proposed model shows significant increase of conversion/revenue in online A/B experiments and has been deployed in production for multiple countries.

pdf
Query2Prod2Vec: Grounded Word Embeddings for eCommerce
Federico Bianchi | Jacopo Tagliabue | Bingqing Yu

We present Query2Prod2Vec, a model that grounds lexical representations for product search in product embeddings: in our model, meaning is a mapping between words and a latent space of products in a digital shop. We leverage shopping sessions to learn the underlying space and use merchandising annotations to build lexical analogies for evaluation: our experiments show that our model is more accurate than known techniques from the NLP and IR literature. Finally, we stress the importance of data efficiency for product search outside of retail giants, and highlight how Query2Prod2Vec fits with practical constraints faced by most practitioners.

pdf
An Architecture for Accelerated Large-Scale Inference of Transformer-Based Language Models
Amir Ganiev | Colton Chapin | Anderson De Andrade | Chen Liu

This work demonstrates the development process of a machine learning architecture for inference that can scale to a large volume of requests. We used a BERT model that was fine-tuned for emotion analysis, returning a probability distribution of emotions given a paragraph. The model was deployed as a gRPC service on Kubernetes. Apache Spark was used to perform inference in batches by calling the service. We encountered some performance and concurrency challenges and created solutions to achieve faster running time. Starting with 200 successful inference requests per minute, we were able to achieve as high as 18 thousand successful requests per minute with the same batch job resource allocation. As a result, we successfully stored emotion probabilities for 95 million paragraphs within 96 hours.

pdf
When and Why a Model Fails? A Human-in-the-loop Error Detection Framework for Sentiment Analysis
Zhe Liu | Yufan Guo | Jalal Mahmud

Although deep neural networks have been widely employed and proven effective in sentiment analysis tasks, it remains challenging for model developers to assess their models for erroneous predictions that might exist prior to deployment. Once deployed, emergent errors can be hard to identify in prediction run-time and impossible to trace back to their sources. To address such gaps, in this paper we propose an error detection framework for sentiment analysis based on explainable features. We perform global-level feature validation with human-in-the-loop assessment, followed by an integration of global and local-level feature contribution analysis. Experimental results show that, given limited human-in-the-loop intervention, our method is able to identify erroneous model predictions on unseen data with high precision.

pdf
Technical Question Answering across Tasks and Domains
Wenhao Yu | Lingfei Wu | Yu Deng | Qingkai Zeng | Ruchi Mahindru | Sinem Guven | Meng Jiang

Building automatic technical support system is an important yet challenge task. Conceptually, to answer a user question on a technical forum, a human expert has to first retrieve relevant documents, and then read them carefully to identify the answer snippet. Despite huge success the researchers have achieved in coping with general domain question answering (QA), much less attentions have been paid for investigating technical QA. Specifically, existing methods suffer from several unique challenges (i) the question and answer rarely overlaps substantially and (ii) very limited data size. In this paper, we propose a novel framework of deep transfer learning to effectively address technical QA across tasks and domains. To this end, we present an adjustable joint learning approach for document retrieval and reading comprehension tasks. Our experiments on the TechQA demonstrates superior performance compared with state-of-the-art methods.

pdf
Cost-effective Deployment of BERT Models in Serverless Environment
Marek Suppa | Katarína Benešová | Andrej Švec

In this study, we demonstrate the viability of deploying BERT-style models to AWS Lambda in a production environment. Since the freely available pre-trained models are too large to be deployed in this environment, we utilize knowledge distillation and fine-tune the models on proprietary datasets for two real-world tasks: sentiment analysis and semantic textual similarity. As a result, we obtain models that are tuned for a specific domain and deployable in the serverless environment. The subsequent performance analysis shows that this solution does not only report latency levels acceptable for production use but that it is also a cost-effective alternative to small-to-medium size deployments of BERT models, all without any infrastructure overhead.

pdf
Noise Robust Named Entity Understanding for Voice Assistants
Deepak Muralidharan | Joel Ruben Antony Moniz | Sida Gao | Xiao Yang | Justine Kao | Stephen Pulman | Atish Kothari | Ray Shen | Yinying Pan | Vivek Kaul | Mubarak Seyed Ibrahim | Gang Xiang | Nan Dun | Yidan Zhou | Andy O | Yuan Zhang | Pooja Chitkara | Xuan Wang | Alkesh Patel | Kushal Tayal | Roger Zheng | Peter Grasch | Jason D Williams | Lin Li

Named Entity Recognition (NER) and Entity Linking (EL) play an essential role in voice assistant interaction, but are challenging due to the special difficulties associated with spoken user queries. In this paper, we propose a novel architecture that jointly solves the NER and EL tasks by combining them in a joint reranking module. We show that our proposed framework improves NER accuracy by up to 3.13% and EL accuracy by up to 3.6% in F1 score. The features used also lead to better accuracies in other natural language understanding tasks, such as domain classification and semantic parsing.

pdf
Goodwill Hunting: Analyzing and Repurposing Off-the-Shelf Named Entity Linking Systems
Karan Goel | Laurel Orr | Nazneen Fatema Rajani | Jesse Vig | Christopher Ré

Named entity linking (NEL) or mapping “strings” to “things” in a knowledge base is a fundamental preprocessing step in systems that require knowledge of entities such as information extraction and question answering. In this work, we lay out and investigate two challenges faced by individuals or organizations building NEL systems. Can they directly use an off-the-shelf system? If not, how easily can such a system be repurposed for their use case? First, we conduct a study of off-the-shelf commercial and academic NEL systems. We find that most systems struggle to link rare entities, with commercial solutions lagging their academic counterparts by 10%+. Second, for a use case where the NEL model is used in a sports question-answering (QA) system, we investigate how to close the loop in our analysis by repurposing the best off-the-shelf model (Bootleg) to correct sport-related errors. We show how tailoring a simple technique for patching models using weak labeling can provide a 25% absolute improvement in accuracy of sport-related errors.

pdf
Intent Features for Rich Natural Language Understanding
Brian Lester | Sagnik Ray Choudhury | Rashmi Prasad | Srinivas Bangalore

Complex natural language understanding modules in dialog systems have a richer understanding of user utterances, and thus are critical in providing a better user experience. However, these models are often created from scratch, for specific clients and use cases and require the annotation of large datasets. This encourages the sharing of annotated data across multiple clients. To facilitate this we introduce the idea of intent features: domain and topic agnostic properties of intents that can be learnt from the syntactic cues only, and hence can be shared. We introduce a new neural network architecture, the Global-Local model, that shows significant improvement over strong baselines for identifying these features in a deployed, multi-intent natural language understanding module, and more generally in a classification setting where a part of an utterance has to be classified utilizing the whole context.

pdf
Development of an Enterprise-Grade Contract Understanding System
Arvind Agarwal | Laura Chiticariu | Poornima Chozhiyath Raman | Marina Danilevsky | Diman Ghazi | Ankush Gupta | Shanmukha Guttula | Yannis Katsis | Rajasekar Krishnamurthy | Yunyao Li | Shubham Mudgal | Vitobha Munigala | Nicholas Phan | Dhaval Sonawane | Sneha Srinivasan | Sudarshan R. Thitte | Mitesh Vasa | Ramiya Venkatachalam | Vinitha Yaski | Huaiyu Zhu

Contracts are arguably the most important type of business documents. Despite their significance in business, legal contract review largely remains an arduous, expensive and manual process. In this paper, we describe TECUS: a commercial system designed and deployed for contract understanding and used by a wide range of enterprise users for the past few years. We reflect on the challenges and design decisions when building TECUS. We also summarize the data science life cycle of TECUS and share lessons learned.

pdf
Discovering Better Model Architectures for Medical Query Understanding
Wei Zhu | Yuan Ni | Xiaoling Wang | Guotong Xie

In developing an online question-answering system for the medical domains, natural language inference (NLI) models play a central role in question matching and intention detection. However, which models are best for our datasets? Manually selecting or tuning a model is time-consuming. Thus we experiment with automatically optimizing the model architectures on the task at hand via neural architecture search (NAS). First, we formulate a novel architecture search space based on the previous NAS literature, supporting cross-sentence attention (cross-attn) modeling. Second, we propose to modify the ENAS method to accelerate and stabilize the search results. We conduct extensive experiments on our two medical NLI tasks. Results show that our system can easily outperform the classical baseline models. We compare different NAS methods and demonstrate our approach provides the best results.

pdf
OodGAN: Generative Adversarial Network for Out-of-Domain Data Generation
Petr Marek | Vishal Ishwar Naik | Anuj Goyal | Vincent Auvray

Detecting an Out-of-Domain (OOD) utterance is crucial for a robust dialog system. Most dialog systems are trained on a pool of annotated OOD data to achieve this goal. However, collecting the annotated OOD data for a given domain is an expensive process. To mitigate this issue, previous works have proposed generative adversarial networks (GAN) based models to generate OOD data for a given domain automatically. However, these proposed models do not work directly with the text. They work with the text’s latent space instead, enforcing these models to include components responsible for encoding text into latent space and decoding it back, such as auto-encoder. These components increase the model complexity, making it difficult to train. We propose OodGAN, a sequential generative adversarial network (SeqGAN) based model for OOD data generation. Our proposed model works directly on the text and hence eliminates the need to include an auto-encoder. OOD data generated using OodGAN model outperforms state-of-the-art in OOD detection metrics for ROSTD (67% relative improvement in FPR 0.95) and OSQ datasets (28% relative improvement in FPR 0.95)

pdf
Coherent and Concise Radiology Report Generation via Context Specific Image Representations and Orthogonal Sentence States
Litton J Kurisinkel | Ai Ti Aw | Nancy F Chen

Neural models for text generation are often designed in an end-to-end fashion, typically with zero control over intermediate computations, limiting their practical usability in downstream applications. In this work, we incorporate explicit means into neural models to ensure topical continuity, informativeness and content diversity of generated radiology reports. For the purpose we propose a method to compute image representations specific to each sentential context and eliminate redundant content by exploiting diverse sentence states. We conduct experiments to generate radiology reports from medical images of chest x-rays using MIMIC-CXR. Our model outperforms baselines by up to 18% and 29% respective in the evaluation for informativeness and content ordering respectively, relative on objective metrics and 16% on human evaluation.

pdf
An Empirical Study of Generating Texts for Search Engine Advertising
Hidetaka Kamigaito | Peinan Zhang | Hiroya Takamura | Manabu Okumura

Although there are many studies on neural language generation (NLG), few trials are put into the real world, especially in the advertising domain. Generating ads with NLG models can help copywriters in their creation. However, few studies have adequately evaluated the effect of generated ads with actual serving included because it requires a large amount of training data and a particular environment. In this paper, we demonstrate a practical use case of generating ad-text with an NLG model. Specially, we show how to improve the ads’ impact, deploy models to a product, and evaluate the generated ads.

pdf
Ad Headline Generation using Self-Critical Masked Language Model
Yashal Shakti Kanungo | Sumit Negi | Aruna Rajan

For any E-commerce website it is a nontrivial problem to build enduring advertisements that attract shoppers. It is hard to pass the creative quality bar of the website, especially at a large scale. We thus propose a programmatic solution to generate product advertising headlines using retail content. We propose a state of the art application of Reinforcement Learning (RL) Policy gradient methods on Transformer (Vaswani et al., 2017) based Masked Language Models (Devlin et al., 2019). Our method creates the advertising headline by jointly conditioning on multiple products that a seller wishes to advertise. We demonstrate that our method outperforms existing Transformer and LSTM + RL methods in overlap metrics and quality audits. We also show that our model generated headlines outperform human submitted headlines in terms of both grammar and creative quality as determined by audits.

pdf
LATEX-Numeric: Language Agnostic Text Attribute Extraction for Numeric Attributes
Kartik Mehta | Ioana Oprea | Nikhil Rasiwasia

In this paper, we present LATEX-Numeric - a high-precision fully-automated scalable framework for extracting E-commerce numeric attributes from unstructured product text like product description. Most of the past work on attribute extraction is not scalable as they rely on manually curated training data, either with or without use of active learning. We rely on distant supervision for training data generation, removing dependency on manual labels. One issue with distant supervision is that it leads to incomplete training annotation due to missing attribute values while matching. We propose a multi-task learning architecture to deal with missing labels in the training data, leading to F1 improvement of 9.2% for numeric attributes over state-of-the-art single-task architecture. While multi-task architecture benefits both numeric and non-numeric attributes, we present automated techniques to further improve the numeric attributes extraction models. Numeric attributes require a list of units (or aliases) for better matching with distant supervision. We propose an automated algorithm for alias creation using unstructured text and attribute values, leading to a 20.2% F1 improvement. Extensive experiments on real world datasets for 20 numeric attributes across 5 product categories and 3 English marketplaces show that LATEX-numeric achieves a high F1-score, without any manual intervention, making it suitable for practical applications. Finally we show that the improvements are language-agnostic and LATEX-Numeric achieves 13.9% F1 improvement for 3 non-English languages.

pdf
Training Language Models under Resource Constraints for Adversarial Advertisement Detection
Eshwar Shamanna Girishekar | Shiv Surya | Nishant Nikhil | Dyut Kumar Sil | Sumit Negi | Aruna Rajan

Advertising on e-commerce and social media sites deliver ad impressions at web scale on a daily basis driving value to both shoppers and advertisers. This scale necessitates programmatic ways of detecting unsuitable content in ads to safeguard customer experience and trust. This paper focusses on techniques for training text classification models under resource constraints, built as part of automated solutions for advertising content moderation. We show how weak supervision, curriculum learning and multi-lingual training can be applied effectively to fine-tune BERT and its variants for text classification tasks in conjunction with different data augmentation strategies. Our extensive experiments on multiple languages show that these techniques detect adversarial ad categories with a substantial gain in precision at high recall threshold over the baseline.

pdf
Combining Weakly Supervised ML Techniques for Low-Resource NLU
Victor Soto | Konstantine Arkoudas

Recent advances in transfer learning have improved the performance of virtual assistants considerably. Nevertheless, creating sophisticated voice-enabled applications for new domains remains a challenge, and meager training data is often a key bottleneck. Accordingly, unsupervised learning and SSL (semi-supervised learning) techniques continue to be of vital importance. While a number of such methods have been explored previously in isolation, in this paper we investigate the synergistic use of a number of weakly supervised techniques with a view to improving NLU (Natural Language Understanding) accuracy in low-resource settings. We explore three different approaches incorporating anonymized, unlabeled and automatically transcribed user utterances into the training process, two focused on data augmentation via SSL and another one focused on unsupervised and transfer learning. We show promising results, obtaining gains that range from 4.73% to 7.65% relative improvements on semantic error rate for each individual approach. Moreover, the combination of all three methods together yields a relative improvement of 11.77% over our current baseline model. Our methods are applicable to any new domain with minimal training data, and can be deployed over time into a cycle of continual learning.

pdf
Label-Guided Learning for Item Categorization in e-Commerce
Lei Chen | Hirokazu Miyake

Item categorization is an important application of text classification in e-commerce due to its impact on the online shopping experience of users. One class of text classification techniques that has gained attention recently is using the semantic information of the labels to guide the classification task. We have conducted a systematic investigation of the potential benefits of these methods on a real data set from a major e-commerce company in Japan. Furthermore, using a hyperbolic space to embed product labels that are organized in a hierarchical structure led to better performance compared to using a conventional Euclidean space embedding. These findings demonstrate how label-guided learning can improve item categorization systems in the e-commerce domain.

pdf
Benchmarking Commercial Intent Detection Services with Practice-Driven Evaluations
Haode Qi | Lin Pan | Atin Sood | Abhishek Shah | Ladislav Kunc | Mo Yu | Saloni Potdar

Intent detection is a key component of modern goal-oriented dialog systems that accomplish a user task by predicting the intent of users’ text input. There are three primary challenges in designing robust and accurate intent detection models. First, typical intent detection models require a large amount of labeled data to achieve high accuracy. Unfortunately, in practical scenarios it is more common to find small, unbalanced, and noisy datasets. Secondly, even with large training data, the intent detection models can see a different distribution of test data when being deployed in the real world, leading to poor accuracy. Finally, a practical intent detection model must be computationally efficient in both training and single query inference so that it can be used continuously and re-trained frequently. We benchmark intent detection methods on a variety of datasets. Our results show that Watson Assistant’s intent detection model outperforms other commercial solutions and is comparable to large pretrained language models while requiring only a fraction of computational resources and training data. Watson Assistant demonstrates a higher degree of robustness when the training and test distributions differ.

pdf
Industry Scale Semi-Supervised Learning for Natural Language Understanding
Luoxin Chen | Francisco Garcia | Varun Kumar | He Xie | Jianhua Lu

This paper presents a production Semi-Supervised Learning (SSL) pipeline based on the student-teacher framework, which leverages millions of unlabeled examples to improve Natural Language Understanding (NLU) tasks. We investigate two questions related to the use of unlabeled data in production SSL context: 1) how to select samples from a huge unlabeled data pool that are beneficial for SSL training, and 2) how does the selected data affect the performance of different state-of-the-art SSL techniques. We compare four widely used SSL techniques, Pseudo-label (PL), Knowledge Distillation (KD), Virtual Adversarial Training (VAT) and Cross-View Training (CVT) in conjunction with two data selection methods including committee-based selection and submodular optimization based selection. We further examine the benefits and drawbacks of these techniques when applied to intent classification (IC) and named entity recognition (NER) tasks, and provide guidelines specifying when each of these methods might be beneficial to improve large scale NLU systems.