Koustuv Dasgupta


2023

pdf
Financial Numeric Extreme Labelling: A dataset and benchmarking
Soumya Sharma | Subhendu Khatuya | Manjunath Hegde | Afreen Shaikh | Koustuv Dasgupta | Pawan Goyal | Niloy Ganguly
Findings of the Association for Computational Linguistics: ACL 2023

The U.S. Securities and Exchange Commission (SEC) mandates all public companies to file periodic financial statements that should contain numerals annotated with a particular label from a taxonomy. In this paper, we formulate the task of automating the assignment of a label to a particular numeral span in a sentence from an extremely large label set. Towards this task, we release a dataset, Financial Numeric Extreme Labelling (FNXL), annotated with 2,794 labels. We benchmark the performance of the FNXL dataset by formulating the task as (a) a sequence labelling problem and (b) a pipeline with span extraction followed by Extreme Classification. Although the two approaches perform comparably, the pipeline solution provides a slight edge for the least frequent labels.

2022

pdf
ECTSum: A New Benchmark Dataset For Bullet Point Summarization of Long Earnings Call Transcripts
Rajdeep Mukherjee | Abhinav Bohra | Akash Banerjee | Soumya Sharma | Manjunath Hegde | Afreen Shaikh | Shivani Shrivastava | Koustuv Dasgupta | Niloy Ganguly | Saptarshi Ghosh | Pawan Goyal
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Despite tremendous progress in automatic summarization, state-of-the-art methods are predominantly trained to excel in summarizing short newswire articles, or documents with strong layout biases such as scientific articles or government reports. Efficient techniques to summarize financial documents, discussing facts and figures, have largely been unexplored, majorly due to the unavailability of suitable datasets. In this work, we present ECTSum, a new dataset with transcripts of earnings calls (ECTs), hosted by publicly traded companies, as documents, and experts-written short telegram-style bullet point summaries derived from corresponding Reuters articles. ECTs are long unstructured documents without any prescribed length limit or format. We benchmark our dataset with state-of-the-art summarization methods across various metrics evaluating the content quality and factual consistency of the generated summaries. Finally, we present a simple yet effective approach, ECT-BPS, to generate a set of bullet points that precisely capture the important facts discussed in the calls.