Avijit Thawani


BPE beyond Word Boundary: How NOT to use Multi Word Expressions in Neural Machine Translation
Dipesh Kumar | Avijit Thawani
Proceedings of the Third Workshop on Insights from Negative Results in NLP

BPE tokenization merges characters into longer tokens by finding frequently occurring contiguous patterns within the word boundary. An intuitive relaxation would be to extend a BPE vocabulary with multi-word expressions (MWEs): bigrams (in_a), trigrams (out_of_the), and skip-grams (he . his). In the context of Neural Machine Translation (NMT), we replace the least frequent subword/whole-word tokens with the most frequent MWEs. We find that these modifications to BPE end up hurting the model, resulting in a net drop of BLEU and chrF scores across two language pairs. We observe that naively extending BPE beyond word boundaries results in incoherent tokens which are themselves better represented as individual words. Moreover, we find that Pointwise Mutual Information (PMI) instead of frequency finds better MWEs (e.g., New_York, Statue_of_Liberty, neither . nor) which consistently improves translation performance.We release all code at https://github.com/pegasus-lynx/mwe-bpe.


Representing Numbers in NLP: a Survey and a Vision
Avijit Thawani | Jay Pujara | Filip Ilievski | Pedro Szekely
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

NLP systems rarely give special consideration to numbers found in text. This starkly contrasts with the consensus in neuroscience that, in the brain, numbers are represented differently from words. We arrange recent NLP work on numeracy into a comprehensive taxonomy of tasks and methods. We break down the subjective notion of numeracy into 7 subtasks, arranged along two dimensions: granularity (exact vs approximate) and units (abstract vs grounded). We analyze the myriad representational choices made by over a dozen previously published number encoders and decoders. We synthesize best practices for representing numbers in text and articulate a vision for holistic numeracy in NLP, comprised of design trade-offs and a unified evaluation.

Numeracy enhances the Literacy of Language Models
Avijit Thawani | Jay Pujara | Filip Ilievski
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Specialized number representations in NLP have shown improvements on numerical reasoning tasks like arithmetic word problems and masked number prediction. But humans also use numeracy to make better sense of world concepts, e.g., you can seat 5 people in your ‘room’ but not 500. Does a better grasp of numbers improve a model’s understanding of other concepts and words? This paper studies the effect of using six different number encoders on the task of masked word prediction (MWP), as a proxy for evaluating literacy. To support this investigation, we develop Wiki-Convert, a 900,000 sentence dataset annotated with numbers and units, to avoid conflating nominal and ordinal number occurrences. We find a significant improvement in MWP for sentences containing numbers, that exponent embeddings are the best number encoders, yielding over 2 points jump in prediction accuracy over a BERT baseline, and that these enhanced literacy skills also generalize to contexts without annotated numbers. We release all code at https://git.io/JuZXn.


SWOW-8500: Word Association task for Intrinsic Evaluation of Word Embeddings
Avijit Thawani | Biplav Srivastava | Anil Singh
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

Downstream evaluation of pretrained word embeddings is expensive, more so for tasks where current state of the art models are very large architectures. Intrinsic evaluation using word similarity or analogy datasets, on the other hand, suffers from several disadvantages. We propose a novel intrinsic evaluation task employing large word association datasets (particularly the Small World of Words dataset). We observe correlations not just between performances on SWOW-8500 and previously proposed intrinsic tasks of word similarity prediction, but also with downstream tasks (eg. Text Classification and Natural Language Inference). Most importantly, we report better confidence intervals for scores on our word association task, with no fall in correlation with downstream performance.


IJCNLP-2017 Task 3: Review Opinion Diversification (RevOpiD-2017)
Anil Kumar Singh | Avijit Thawani | Mayank Panchal | Anubhav Gupta | Julian McAuley
Proceedings of the IJCNLP 2017, Shared Tasks

Unlike Entity Disambiguation in web search results, Opinion Disambiguation is a relatively unexplored topic. RevOpiD shared task at IJCNLP-2107 aimed to attract attention towards this research problem. In this paper, we summarize the first run of this task and introduce a new dataset that we have annotated for the purpose of evaluating Opinion Mining, Summarization and Disambiguation methods.