Mahsa Yarmohammadi


2023

pdf
The Effect of Alignment Correction on Cross-Lingual Annotation Projection
Shabnam Behzad | Seth Ebner | Marc Marone | Benjamin Van Durme | Mahsa Yarmohammadi
Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)

Cross-lingual annotation projection is a practical method for improving performance on low resource structured prediction tasks. An important step in annotation projection is obtaining alignments between the source and target texts, which enables the mapping of annotations across the texts. By manually correcting automatically generated alignments, we examine the impact of alignment quality—automatic, manual, and mixed—on downstream performance for two information extraction tasks and quantify the trade-off between annotation effort and model performance.

2021

pdf
Gradual Fine-Tuning for Low-Resource Domain Adaptation
Haoran Xu | Seth Ebner | Mahsa Yarmohammadi | Aaron Steven White | Benjamin Van Durme | Kenton Murray
Proceedings of the Second Workshop on Domain Adaptation for NLP

Fine-tuning is known to improve NLP models by adapting an initial model trained on more plentiful but less domain-salient examples to data in a target domain. Such domain adaptation is typically done using one stage of fine-tuning. We demonstrate that gradually fine-tuning in a multi-step process can yield substantial further gains and can be applied without modifying the model or learning objective.

pdf
Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction
Mahsa Yarmohammadi | Shijie Wu | Marc Marone | Haoran Xu | Seth Ebner | Guanghui Qin | Yunmo Chen | Jialiang Guo | Craig Harman | Kenton Murray | Aaron Steven White | Mark Dredze | Benjamin Van Durme
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Zero-shot cross-lingual information extraction (IE) describes the construction of an IE model for some target language, given existing annotations exclusively in some other language, typically English. While the advance of pretrained multilingual encoders suggests an easy optimism of “train on English, run on any language”, we find through a thorough exploration and extension of techniques that a combination of approaches, both new and old, leads to better performance than any one cross-lingual strategy in particular. We explore techniques including data projection and self-training, and how different pretrained encoders impact them. We use English-to-Arabic IE as our initial example, demonstrating strong performance in this setting for event extraction, named entity recognition, part-of-speech tagging, and dependency parsing. We then apply data projection and self-training to three tasks across eight target languages. Because no single set of techniques performs the best across all tasks, we encourage practitioners to explore various configurations of the techniques described in this work when seeking to improve on zero-shot training.

2020

pdf bib
CopyNext: Explicit Span Copying and Alignment in Sequence to Sequence Models
Abhinav Singh | Patrick Xia | Guanghui Qin | Mahsa Yarmohammadi | Benjamin Van Durme
Proceedings of the Fourth Workshop on Structured Prediction for NLP

Copy mechanisms are employed in sequence to sequence (seq2seq) models to generate reproductions of words from the input to the output. These frameworks, operating at the lexical type level, fail to provide an explicit alignment that records where each token was copied from. Further, they require contiguous token sequences from the input (spans) to be copied individually. We present a model with an explicit token-level copy operation and extend it to copying entire spans. Our model provides hard alignments between spans in the input and output, allowing for nontraditional applications of seq2seq, like information extraction. We demonstrate the approach on Nested Named Entity Recognition, achieving near state-of-the-art accuracy with an order of magnitude increase in decoding speed.

pdf
Collecting Verified COVID-19 Question Answer Pairs
Adam Poliak | Max Fleming | Cash Costello | Kenton Murray | Mahsa Yarmohammadi | Shivani Pandya | Darius Irani | Milind Agarwal | Udit Sharma | Shuo Sun | Nicola Ivanov | Lingxi Shang | Kaushik Srinivasan | Seolhwa Lee | Xu Han | Smisha Agarwal | João Sedoc
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

We release a dataset of over 2,100 COVID19 related Frequently asked Question-Answer pairs scraped from over 40 trusted websites. We include an additional 24, 000 questions pulled from online sources that have been aligned by experts with existing answered questions from our dataset. This paper describes our efforts in collecting the dataset and summarizes the resulting data. Our dataset is automatically updated daily and available at https://github.com/JHU-COVID-QA/ scraping-qas. So far, this data has been used to develop a chatbot providing users information about COVID-19. We encourage others to build analytics and tools upon this dataset as well.

2019

pdf bib
Robust Document Representations for Cross-Lingual Information Retrieval in Low-Resource Settings
Mahsa Yarmohammadi | Xutai Ma | Sorami Hisamoto | Muhammad Rahman | Yiming Wang | Hainan Xu | Daniel Povey | Philipp Koehn | Kevin Duh
Proceedings of Machine Translation Summit XVII: Research Track

2014

pdf
Transforming trees into hedges and parsing with “hedgebank” grammars
Mahsa Yarmohammadi | Aaron Dunlop | Brian Roark
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Applications of Lexicographic Semirings to Problems in Speech and Language Processing
Richard Sproat | Mahsa Yarmohammadi | Izhak Shafran | Brian Roark
Computational Linguistics, Volume 40, Issue 4 - December 2014

2013

pdf
Incremental Segmentation and Decoding Strategies for Simultaneous Translation
Mahsa Yarmohammadi | Vivek Kumar Rangarajan Sridhar | Srinivas Bangalore | Baskaran Sankaran
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf
Harvesting Parallel Text in Multiple Languages with Limited Supervision
Luciano Barbosa | Vivek Kumar Rangarajan Sridhar | Mahsa Yarmohammadi | Srinivas Bangalore
Proceedings of COLING 2012