Prasenjit Mitra


2022

pdf
STAPI: An Automatic Scraper for Extracting Iterative Title-Text Structure from Web Documents
Nan Zhang | Shomir Wilson | Prasenjit Mitra
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Formal documents often are organized into sections of text, each with a title, and extracting this structure remains an under-explored aspect of natural language processing. This iterative title-text structure is valuable data for building models for headline generation and section title generation, but there is no corpus that contains web documents annotated with titles and prose texts. Therefore, we propose the first title-text dataset on web documents that incorporates a wide variety of domains to facilitate downstream training. We also introduce STAPI (Section Title And Prose text Identifier), a two-step system for labeling section titles and prose text in HTML documents. To filter out unrelated content like document footers, its first step involves a filter that reads HTML documents and proposes a set of textual candidates. In the second step, a typographic classifier takes the candidates from the filter and categorizes each one into one of the three pre-defined classes (title, prose text, and miscellany). We show that STAPI significantly outperforms two baseline models in terms of title-text identification. We release our dataset along with a web application to facilitate supervised and semi-supervised training in this domain.

2021

pdf
Extractive Research Slide Generation Using Windowed Labeling Ranking
Athar Sefid | Prasenjit Mitra | Jian Wu | C Lee Giles
Proceedings of the Second Workshop on Scholarly Document Processing

Presentation slides generated from original research papers provide an efficient form to present research innovations. Manually generating presentation slides is labor-intensive. We propose a method to automatically generates slides for scientific articles based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures the importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.

pdf
Are BERTs Sensitive to Native Interference in L2 Production?
Zixin Tang | Prasenjit Mitra | David Reitter
Proceedings of the Second Workshop on Insights from Negative Results in NLP

With the essays part from The International Corpus Network of Asian Learners of English (ICNALE) and the TOEFL11 corpus, we fine-tuned neural language models based on BERT to predict English learners’ native languages. Results showed neural models can learn to represent and detect such native language impacts, but multilingually trained models have no advantage in doing so.

2020

pdf
Recognition of Implicit Geographic Movement in Text
Scott Pezanowski | Prasenjit Mitra
Proceedings of the Twelfth Language Resources and Evaluation Conference

Analyzing the geographic movement of humans, animals, and other phenomena is a growing field of research. This research has benefited urban planning, logistics, animal migration understanding, and much more. Typically, the movement is captured as precise geographic coordinates and time stamps with Global Positioning Systems (GPS). Although some research uses computational techniques to take advantage of implicit movement in descriptions of route directions, hiking paths, and historical exploration routes, innovation would accelerate with a large and diverse corpus. We created a corpus of sentences labeled as describing geographic movement or not and including the type of entity moving. Creating this corpus proved difficult without any comparable corpora to start with, high human labeling costs, and since movement can at times be interpreted differently. To overcome these challenges, we developed an iterative process employing hand labeling, crowd voting for confirmation, and machine learning to predict more labels. By merging advances in word embeddings with traditional machine learning models and model ensembling, prediction accuracy is at an acceptable level to produce a large silver-standard corpus despite the small gold-standard corpus training set. Our corpus will likely benefit computational processing of geography in text and spatial cognition, in addition to detection of movement.

2016

pdf
Twitter as a Lifeline: Human-annotated Twitter Corpora for NLP of Crisis-related Messages
Muhammad Imran | Prasenjit Mitra | Carlos Castillo
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Microblogging platforms such as Twitter provide active communication channels during mass convergence and emergency events such as earthquakes, typhoons. During the sudden onset of a crisis situation, affected people post useful information on Twitter that can be used for situational awareness and other humanitarian disaster response efforts, if processed timely and effectively. Processing social media information pose multiple challenges such as parsing noisy, brief and informal messages, learning information categories from the incoming stream of messages and classifying them into different classes among others. One of the basic necessities of many of these tasks is the availability of data, in particular human-annotated data. In this paper, we present human-annotated Twitter corpora collected during 19 different crises that took place between 2013 and 2015. To demonstrate the utility of the annotations, we train machine learning classifiers. Moreover, we publish first largest word2vec word embeddings trained on 52 million crisis-related tweets. To deal with tweets language issues, we present human-annotated normalized lexical resources for different lexical variations.

2015

pdf
WikiKreator: Improving Wikipedia Stubs Automatically
Siddhartha Banerjee | Prasenjit Mitra
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf
Identifying Emotional and Informational Support in Online Health Communities
Prakhar Biyani | Cornelia Caragea | Prasenjit Mitra | John Yen
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Summarizing Online Forum Discussions – Can Dialog Acts of Individual Messages Help?
Sumit Bhatia | Prakhar Biyani | Prasenjit Mitra
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2012

pdf
Thread Specific Features are Helpful for Identifying Subjectivity Orientation of Online Forum Threads
Prakhar Biyani | Sumit Bhatia | Cornelia Caragea | Prasenjit Mitra
Proceedings of COLING 2012