Md. Rizwan Parvez

Also published as: Md Rizwan Parvez


2023

pdf
Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies
Md Rizwan Parvez | Jianfeng Chi | Wasi Uddin Ahmad | Yuan Tian | Kai-Wei Chang
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascaded them with noise reduction oracles. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%. Our ablation studies provide further insights into the effectiveness of our approach.

2021

pdf
Retrieval Augmented Code Generation and Summarization
Md Rizwan Parvez | Wasi Ahmad | Saikat Chakraborty | Baishakhi Ray | Kai-Wei Chang
Findings of the Association for Computational Linguistics: EMNLP 2021

Software developers write a lot of source code and documentation during software development. Intrinsically, developers often recall parts of source code or code summaries that they had written in the past while implementing software or documenting them. To mimic developers’ code or summary generation behavior, we propose a retrieval augmented framework, REDCODER, that retrieves relevant code or summaries from a retrieval database and provides them as a supplement to code generation or summarization models. REDCODER has a couple of uniqueness. First, it extends the state-of-the-art dense retrieval technique to search for relevant code or summaries. Second, it can work with retrieval databases that include unimodal (only code or natural language description) or bimodal instances (code-description pairs). We conduct experiments and extensive analysis on two benchmark datasets of code generation and summarization in Java and Python, and the promising results endorse the effectiveness of our proposed retrieval augmented framework.

pdf
Evaluating the Values of Sources in Transfer Learning
Md Rizwan Parvez | Kai-Wei Chang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Transfer learning that adapts a model trained on data-rich sources to low-resource targets has been widely applied in natural language processing (NLP). However, when training a transfer model over multiple sources, not every source is equally useful for the target. To better transfer a model, it is essential to understand the values of the sources. In this paper, we develop , an efficient source valuation framework for quantifying the usefulness of the sources (e.g., ) in transfer learning based on the Shapley value method. Experiments and comprehensive analyses on both cross-domain and cross-lingual transfers demonstrate that our framework is not only effective in choosing useful transfer sources but also the source values match the intuitive source-target similarity.

2019

pdf
Robust Text Classifier on Test-Time Budgets
Md Rizwan Parvez | Tolga Bolukbasi | Kai-Wei Chang | Venkatesh Saligrama
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We design a generic framework for learning a robust text classification model that achieves high accuracy under different selection budgets (a.k.a selection rates) at test-time. We take a different approach from existing methods and learn to dynamically filter a large fraction of unimportant words by a low-complexity selector such that any high-complexity state-of-art classifier only needs to process a small fraction of text, relevant for the target task. To this end, we propose a data aggregation method to train the classifier, allowing it to achieve competitive performance on fractured sentences. On four benchmark text classification tasks, we demonstrate that the framework gains consistent speedup with little degradation in accuracy on various selection budgets.

2018

pdf
Building Language Models for Text with Named Entities
Md Rizwan Parvez | Saikat Chakraborty | Baishakhi Ray | Kai-Wei Chang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text in many domains involves a significant amount of named entities. Predicting the entity names is often challenging for a language model as they appear less frequent on the training corpus. In this paper, we propose a novel and effective approach to building a language model which can learn the entity names by leveraging their entity type information. We also introduce two benchmark datasets based on recipes and Java programming codes, on which we evaluate the proposed model. Experimental results show that our model achieves 52.2% better perplexity in recipe generation and 22.06% on code generation than state-of-the-art language models.

pdf
A Corpus of Drug Usage Guidelines Annotated with Type of Advice
Sarah Masud Preum | Md. Rizwan Parvez | Kai-Wei Chang | John Stankovic
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)