Shubhanshu Mishra


2021

pdf bib
LMSOC: An Approach for Socially Sensitive Pretraining
Vivek Kulkarni | Shubhanshu Mishra | Aria Haghighi
Findings of the Association for Computational Linguistics: EMNLP 2021

While large-scale pretrained language models have been shown to learn effective linguistic representations for many NLP tasks, there remain many real-world contextual aspects of language that current approaches do not capture. For instance, consider a cloze test “I enjoyed the _____ game this weekend”: the correct answer depends heavily on where the speaker is from, when the utterance occurred, and the speaker’s broader social milieu and preferences. Although language depends heavily on the geographical, temporal, and other social contexts of the speaker, these elements have not been incorporated into modern transformer-based language models. We propose a simple but effective approach to incorporate speaker social context into the learned representations of large-scale language models. Our method first learns dense representations of social contexts using graph representation learning algorithms and then primes language model pretraining with these social context representations. We evaluate our approach on geographically-sensitive language modeling tasks and show a substantial improvement (more than 100% relative lift on MRR) compared to baselines.

pdf bib
Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction
Shubhanshu Mishra | Aria Haghighi
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus by adding a pretraining task called translation pair prediction (TPP), which predicts whether a pair of cross-lingual texts are a valid translation. Our approach assumes access to translations (exact or approximate) between source-target language pairs, where we fine-tune a model on source language task data and evaluate the model in the target language. In particular, we focus on language pairs where transfer learning is difficult for mBERT: those where source and target languages are different in script, vocabulary, and linguistic typology. We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese on two social media tasks: NER (a 37% average relative improvement in F1 across target languages) and sentiment classification (12% relative improvement in F1) on social media text, while also benchmarking on a non-social media task of Universal Dependency POS tagging (6.7% relative improvement in accuracy). Our results are promising given the lack of social media bitext corpus. Our code can be found at: https://github.com/twitter-research/multilingual-alignment-tpp.

2020

pdf bib
Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020
Sudhanshu Mishra | Shivangi Prasad | Shubhanshu Mishra
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

We present our team ‘3Idiots’ (referred as ‘sdhanshu’ in the official rankings) approach for the Trolling, Aggression and Cyberbullying (TRAC) 2020 shared tasks. Our approach relies on fine-tuning various Transformer models on the different datasets. We also investigated the utility of task label marginalization, joint label classification, and joint training on multilingual datasets as possible improvements to our models. Our team came second in English sub-task A, a close fourth in the English sub-task B and third in the remaining 4 sub-tasks. We find the multilingual joint training approach to be the best trade-off between computational efficiency of model deployment and model’s evaluation performance. We open source our approach at https://github.com/socialmediaie/TRAC2020.

pdf bib
Scubed at 3C task A - A simple baseline for citation context purpose classification
Shubhanshu Mishra | Sudhanshu Mishra
Proceedings of the 8th International Workshop on Mining Scientific Publications

We present our team Scubed’s approach in the ‘3C’ Citation Context Classification Task, Subtask A, citation context purpose classification. Our approach relies on text based features transformed via tf-idf features followed by training a variety of models which are capable of capturing non-linear features. Our best model on the leaderboard is a multi-layer perceptron which also performs best during our rerun. Our submission code for replicating experiments is at: https://github.com/napsternxg/Citation_Context_Classification.

pdf bib
Scubed at 3C task B - A simple baseline for citation context influence classification
Shubhanshu Mishra | Sudhanshu Mishra
Proceedings of the 8th International Workshop on Mining Scientific Publications

We present our team Scubed’s approach in the 3C Citation Context Classification Task, Subtask B, citation context influence classification. Our approach relies on text based features transformed via tf-idf features followed by training a variety of simple models resulting in a strong baseline. Our best model on the leaderboard is a random forest classifier using only the citation context text. A replication of our analysis finds logistic regression and gradient boosted tree classifier to be the best performing model. Our submission code can be found at: https://github.com/napsternxg/Citation_Context_Classification.

2016

pdf bib
Semi-supervised Named Entity Recognition in noisy-text
Shubhanshu Mishra | Jana Diesner
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Many of the existing Named Entity Recognition (NER) solutions are built based on news corpus data with proper syntax. These solutions might not lead to highly accurate results when being applied to noisy, user generated data, e.g., tweets, which can feature sloppy spelling, concept drift, and limited contextualization of terms and concepts due to length constraints. The models described in this paper are based on linear chain conditional random fields (CRFs), use the BIEOU encoding scheme, and leverage random feature dropout for up-sampling the training data. The considered features include word clusters and pre-trained distributed word representations, updated gazetteer features, and global context predictions. The latter feature allows for ingesting the meaning of new or rare tokens into the system via unsupervised learning and for alleviating the need to learn lexicon based features, which usually tend to be high dimensional. In this paper, we report on the solution [ST] we submitted to the WNUT 2016 NER shared task. We also present an improvement over our original submission [SI], which we built by using semi-supervised learning on labelled training data and pre-trained resourced constructed from unlabelled tweet data. Our ST solution achieved an F1 score of 1.2% higher than the baseline (35.1% F1) for the task of extracting 10 entity types. The SI resulted in an increase of 8.2% in F1 score over the base-line (7.08% over ST). Finally, the SI model’s evaluation on the test data achieved a F1 score of 47.3% (~1.15% increase over the 2nd best submitted solution). Our experimental setup and results are available as a standalone twitter NER tool at https://github.com/napsternxg/TwitterNER.