Saurabh Garg


2023

pdf
Downstream Datasets Make Surprisingly Good Pretraining Corpora
Kundan Krishna | Saurabh Garg | Jeffrey Bigham | Zachary Lipton
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent these gainsare attributable to the massive background corpora employed for pretraining versus to the pretraining objectives themselves. This paper introduces a large-scale study of self-pretraining, where the same (downstream) training data is used for both pretraining and finetuning.In experiments addressing both ELECTRA and RoBERTa models and 10 distinct downstream classification datasets, we observe that self-pretraining rivals standard pretraining on the BookWiki corpus (despite using around 10x–500x less data), outperforming the latter on 7 and 5 datasets, respectively. Surprisingly, these task-specific pretrained models often perform well on other tasks,including the GLUE benchmark. Besides classification tasks, self-pretraining also provides benefits on structured output prediction tasks such as span based question answering and commonsense inference, often providing more than 50% of the performance boosts provided by pretraining on the BookWiki corpus. Our results hint that in many scenarios, performance gains attributable to pretraining are driven primarily by the pretraining objective itself and are not always attributable to the use of external pretraining data in massive amounts. These findings are especially relevant in light of concerns about intellectual property and offensive content in web-scale pretraining data.

2018

pdf
Code-switched Language Models Using Dual RNNs and Same-Source Pretraining
Saurabh Garg | Tanmay Parekh | Preethi Jyothi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This work focuses on building language models (LMs) for code-switched text. We propose two techniques that significantly improve these LMs: 1) A novel recurrent neural network unit with dual components that focus on each language in the code-switched text separately 2) Pretraining the LM using synthetic text from a generative model estimated using the training data. We demonstrate the effectiveness of our proposed techniques by reporting perplexities on a Mandarin-English task and derive significant reductions in perplexity.

2004

pdf
Evaluation of Transcription and Annotation Tools for a Multi-modal, Multi-party Dialogue Corpus
Saurabh Garg | Bilyana Martinovski | Susan Robinson | Jens Stephan | Joel Tetreault | David R. Traum
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf
Issues in Corpus Development for Multi-party Multi-modal Task-oriented Dialogue
Susan Robinson | Bilyana Martinovski | Saurabh Garg | Jens Stephan | David Traum
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)