Tamali Banerjee

2024

pdf bib abs
Standardizing Genomic Reports: A Dataset, A Standardized Format, and A Prompt-Based Technique for Structured Data Extraction
Tamali Banerjee | Akshit Varmora | Jay J. Gorakhiya | Sanand Sasidharan | Anuradha Kanamarlapudi | Pushpak Bhattacharyya
Proceedings of the 21st International Conference on Natural Language Processing (ICON)

Extracting information from genomic reports of cancer patients is crucial for both healthcare professionals and cancer research. While Large Language Models (LLMs) have shown promise in extracting information, their potential for handling genomic reports remains unexplored. These reports are complex, multi-page documents that feature a variety of visually rich, structured layouts and contain many domain-specific terms. Two primary challenges complicate the process: (i) extracting data from PDFs with intricate layouts and domain-specific terminology and (ii) dealing with variations in report layouts from different laboratories, making extraction layout-dependent and posing challenges for subsequent data processing. To tackle these issues, we propose GR-PROMPT, a prompt-based technique, and GR-FORMAT, a standardized format. Together, these two convert a genomic report in PDF format into GR-FORMAT as a JSON file using a multimodal LLM. To address the lack of available datasets for this task, we introduce GR-DATASET, a synthetic collection of 100 cancer genomic reports in PDF format. Each report is accompanied by key-value information presented in a layout-specific format, as well as structured key-value information in GR-FORMAT. This is the first dataset in this domain to promote further research for the task. We performed our experiment on this dataset.

2023

pdf bib abs
Comparing DAE-based and MASS-based UNMT: Robustness to Word-Order Divergence in English–>Indic Language Pairs
Tamali Banerjee | Rudra Murthy | Pushpak Bhattacharyya
Proceedings of the 20th International Conference on Natural Language Processing (ICON)

The proliferation of fake news poses a significant challenge in the digital era. Detecting false information, especially in non-English languages, is crucial to combating misinformation effectively. In this research, we introduce a novel approach for Dravidian fake news detection by harnessing the capabilities of the MuRIL transformer model, further enhanced by gradient accumulation techniques. Our study focuses on the Dravidian languages, a diverse group of languages spoken in South India, which are often underserved in natural language processing research. We optimize memory usage, stabilize training, and improve the model’s overall performance by accumulating gradients over multiple batches. The proposed model exhibits promising results in terms of both accuracy and efficiency. Our findings underline the significance of adapting state-of-the-art techniques, such as MuRIL-based models and gradient accumulation, to non-English language.

2021

pdf bib abs
Crosslingual Embeddings are Essential in UNMT for distant languages: An English to IndoAryan Case Study
Tamali Banerjee | Rudra V Murthy | Pushpak Bhattacharya
Proceedings of Machine Translation Summit XVIII: Research Track

Recent advances in Unsupervised Neural Machine Translation (UNMT) has minimized the gap between supervised and unsupervised machine translation performance for closely related language-pairs. However and the situation is very different for distant language pairs. Lack of overlap in lexicon and low syntactic similarity such as between English and IndoAryan languages leads to poor translation quality in existing UNMT systems. In this paper and we show that initialising the embedding layer of UNMT models with cross-lingual embeddings leads to significant BLEU score improvements over existing UNMT models where the embedding layer weights are randomly initialized. Further and freezing the embedding layer weights leads to better gains compared to updating the embedding layer weights during training. We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs. The proposed cross-lingual embedding initialization yields BLEU score improvement of as much as ten times over the baseline for English-Hindi and English-Bengali and English-Gujarati. Our analysis shows that initialising embedding layer with static cross-lingual embedding mapping is essential for training of UNMT models for distant language-pairs.

pdf bib abs
Scrambled Translation Problem: A Problem of Denoising UNMT
Tamali Banerjee | Rudra V Murthy | Pushpak Bhattacharya
Proceedings of Machine Translation Summit XVIII: Research Track

In this paper and we identify an interesting kind of error in the output of Unsupervised Neural Machine Translation (UNMT) systems like Undreamt1. We refer to this error type as Scrambled Translation problem. We observe that UNMT models which use word shuffle noise (as in case of Undreamt) can generate correct words and but fail to stitch them together to form phrases. As a result and words of the translated sentence look scrambled and resulting in decreased BLEU. We hypothesise that the reason behind scrambled translation problem is ’shuffling noise’ which is introduced in every input sentence as a denoising strategy. To test our hypothesis and we experiment by retraining UNMT models with a simple retraining strategy. We stop the training of the Denoising UNMT model after a pre-decided number of iterations and resume the training for the remaining iterations- which number is also pre-decided- using original sentence as input without adding any noise. Our proposed solution achieves significant performance improvement UNMT models that train conventionally. We demonstrate these performance gains on four language pairs and viz. and English-French and English-German and English-Spanish and Hindi-Punjabi. Our qualitative and quantitative analysis shows that the retraining strategy helps achieve better alignment as observed by attention heatmap and better phrasal translation and leading to statistically significant improvement in BLEU scores.

2018

pdf bib abs
Meaningless yet meaningful: Morphology grounded subword-level NMT
Tamali Banerjee | Pushpak Bhattacharyya
Proceedings of the Second Workshop on Subword/Character LEvel Models

We explore the use of two independent subsystems Byte Pair Encoding (BPE) and Morfessor as basic units for subword-level neural machine translation (NMT). We show that, for linguistically distant language-pairs Morfessor-based segmentation algorithm produces significantly better quality translation than BPE. However, for close language-pairs BPE-based subword-NMT may translate better than Morfessor-based subword-NMT. We propose a combined approach of these two segmentation algorithms Morfessor-BPE (M-BPE) which outperforms these two baseline systems in terms of BLEU score. Our results are supported by experiments on three language-pairs: English-Hindi, Bengali-Hindi and English-Bengali.

pdf bib
Multilingual Indian Language Translation System at WAT 2018: Many-to-one Phrase-based SMT
Tamali Banerjee | Anoop Kunchukuttan | Pushpak Bhattacharya
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation