M Saiful Bari


PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts
Stephen Bach | Victor Sanh | Zheng Xin Yong | Albert Webson | Colin Raffel | Nihal V. Nayak | Abheesht Sharma | Taewoon Kim | M Saiful Bari | Thibault Fevry | Zaid Alyafeai | Manan Dey | Andrea Santilli | Zhiqing Sun | Srulik Ben-david | Canwen Xu | Gunjan Chhablani | Han Wang | Jason Fries | Maged Al-shaibani | Shanya Sharma | Urmish Thakker | Khalid Almubarak | Xiangru Tang | Dragomir Radev | Mike Tian-jian Jiang | Alexander Rush
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

PromptSource is a system for creating, sharing, and using natural language prompts. Prompts are functions that map an example from a dataset to a natural language input and target output. Using prompts to train and query language models is an emerging area in NLP that requires new tools that let users develop and refine these prompts collaboratively. PromptSource addresses the emergent challenges in this new setting with (1) a templating language for defining data-linked prompts, (2) an interface that lets users quickly iterate on prompt development by observing outputs of their prompts on many examples, and (3) a community-driven set of guidelines for contributing new prompts to a common pool. Over 2,000 prompts for roughly 170 datasets are already available in PromptSource. PromptSource is available at https://github.com/bigscience-workshop/promptsource.

What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao | Thomas Wang | Daniel Hesslow | Stas Bekman | M Saiful Bari | Stella Biderman | Hady Elsahar | Niklas Muennighoff | Jason Phang | Ofir Press | Colin Raffel | Victor Sanh | Sheng Shen | Lintang Sutawika | Jaesung Tae | Zheng Xin Yong | Julien Launay | Iz Beltagy
Findings of the Association for Computational Linguistics: EMNLP 2022

The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone.In the process of building BLOOM–the Big Science Large Open-science Open-access Multilingual language model–our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget.Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization.In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience.


Nearest Neighbour Few-Shot Learning for Cross-lingual Classification
M Saiful Bari | Batool Haider | Saab Mansour
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Even though large pre-trained multilingual models (e.g. mBERT, XLM-R) have led to significant performance gains on a wide range of cross-lingual NLP tasks, success on many downstream tasks still relies on the availability of sufficient annotated data. Traditional fine-tuning of pre-trained models using only a few target samples can cause over-fitting. This can be quite limiting as most languages in the world are under-resourced. In this work, we investigate cross-lingual adaptation using a simple nearest-neighbor few-shot (<15 samples) inference technique for classification tasks. We experiment using a total of 16 distinct languages across two NLP tasks- XNLI and PAWS-X. Our approach consistently improves traditional fine-tuning using only a handful of labeled samples in target locales. We also demonstrate its generalization capability across tasks.

AugVic: Exploiting BiText Vicinity for Low-Resource NMT
Tasnim Mohiuddin | M Saiful Bari | Shafiq Joty
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

UXLA: A Robust Unsupervised Data Augmentation Framework for Zero-Resource Cross-Lingual NLP
M Saiful Bari | Tasnim Mohiuddin | Shafiq Joty
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose UXLA, a novel unsupervised data augmentation framework for zero-resource transfer learning scenarios. In particular, UXLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution, assuming no training label in the target language. At its core, UXLA performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on three diverse zero-resource cross-lingual transfer tasks. UXLA achieves SoTA results in all the tasks, outperforming the baselines by a good margin. With an in-depth framework dissection, we demonstrate the cumulative contributions of different components to its success.


LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space
Tasnim Mohiuddin | M Saiful Bari | Shafiq Joty
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Most of the successful and predominant methods for Bilingual Lexicon Induction (BLI) are mapping-based, where a linear mapping function is learned with the assumption that the word embedding spaces of different languages exhibit similar geometric structures (i.e. approximately isomorphic). However, several recent studies have criticized this simplified assumption showing that it does not hold in general even for closely related languages. In this work, we propose a novel semi-supervised method to learn cross-lingual word embeddings for BLI. Our model is independent of the isomorphic assumption and uses non-linear mapping in the latent space of two independently pre-trained autoencoders. Through extensive experiments on fifteen (15) different language pairs (in both directions) comprising resource-rich and low-resource languages from two different datasets, we demonstrate that our method outperforms existing models by a good margin. Ablation studies show the importance of different model components and the necessity of non-linear mapping.


A Unified Linear-Time Framework for Sentence-Level Discourse Parsing
Xiang Lin | Shafiq Joty | Prathyusha Jwalapuram | M Saiful Bari
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We propose an efficient neural framework for sentence-level discourse analysis in accordance with Rhetorical Structure Theory (RST). Our framework comprises a discourse segmenter to identify the elementary discourse units (EDU) in a text, and a discourse parser that constructs a discourse tree in a top-down fashion. Both the segmenter and the parser are based on Pointer Networks and operate in linear time. Our segmenter yields an F1 score of 95.4%, and our parser achieves an F1 score of 81.7% on the aggregated labeled (relation) metric, surpassing previous approaches by a good margin and approaching human agreement on both tasks (98.3 and 83.0 F1).