Yashvardhan Sharma


2024

pdf
Empowering Low-Resource Language Translation: Methodologies for Bhojpuri-Hindi and Marathi-Hindi ASR and MT
Harpreet Singh Anand | Amulya Ratna Dash | Yashvardhan Sharma
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

The paper describes our submission for the unconstrained track of ‘Dialectal and Low-Resource Task’ proposed in IWSLT-2024. We designed cascaded Speech Translation systems for the language pairs Marathi-Hindi and Bhojpuri-Hindi utilising and fine-tuning different pre-trained models for carrying out Automatic Speech Recognition (ASR) and Machine Translation (MT).

pdf
Impact of Decoding Methods on Human Alignment of Conversational LLMs
Shaz Furniturewala | Kokil Jaidka | Yashvardhan Sharma
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

To be included into chatbot systems, Large language models (LLMs) must be aligned with human conversational conventions. However, being trained mainly on web-scraped data gives existing LLMs a voice closer to informational text than actual human speech. In this paper, we examine the effect of decoding methods on the alignment between LLM-generated and human conversations, including Beam Search, Top K Sampling, and Nucleus Sampling. We present new measures of alignment in substance, style, and psychometric orientation, and experiment with two conversation datasets. Our results provide subtle insights: better alignment is attributed to fewer beams in Beam Search and lower values of P in Nucleus Sampling. We also find that task-oriented and open-ended datasets perform differently in terms of alignment, indicating the significance of taking into account the context of the interaction.

pdf
BITS Pilani at SemEval-2024 Task 10: Fine-tuning BERT and Llama 2 for Emotion Recognition in Conversation
Dilip Venkatesh | Pasunti Prasanjith | Yashvardhan Sharma
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Emotion Recognition in Conversation (ERC)aims to assign an emotion to a dialogue in aconversation between people. The first subtaskof EDiReF shared task aims to assign an emo-tions to a Hindi-English code mixed conversa-tion. For this, our team proposes a system toidentify the emotion based on fine-tuning largelanguage models on the MaSaC dataset. Forour study we have fine tuned 2 LLMs BERTand Llama 2 to perform sequence classificationto identify the emotion of the text.

pdf
BITS Pilani at SemEval-2024 Task 9: Prompt Engineering with GPT-4 for Solving Brainteasers
Dilip Venkatesh | Yashvardhan Sharma
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Solving brainteasers is a task that requires complex reasoning prowess. The increase of research in natural language processing has leadto the development of massive large languagemodels with billions (or trillions) of parameters that are able to solve difficult questionsdue to their advanced reasoning capabilities.The SemEval BRAINTEASER shared tasks consists of sentence and word puzzles along withoptions containing the answer for the puzzle.Our team uses OpenAI’s GPT-4 model alongwith prompt engineering to solve these brainteasers.

2023

pdf
Steno AI at SemEval-2023 Task 6: Rhetorical Role Labelling of Legal Documents using Transformers and Graph Neural Networks
Anshika Gupta | Shaz Furniturewala | Vijay Kumari | Yashvardhan Sharma
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

A legal document is usually long and dense requiring human effort to parse it. It also contains significant amounts of jargon which make deriving insights from it using existing models a poor approach. This paper presents the approaches undertaken to perform the task of rhetorical role labelling on Indian Court Judgements. We experiment with graph based approaches like Graph Convolutional Networks and Label Propagation Algorithm, and transformer-based approaches including variants of BERT to improve accuracy scores on text classification of complex legal documents.

pdf
BITS-P at WAT 2023: Improving Indic Language Multimodal Translation by Image Augmentation using Diffusion Models
Amulya Dash | Hrithik Raj Gupta | Yashvardhan Sharma
Proceedings of the 10th Workshop on Asian Translation

This paper describes the proposed system for mutlimodal machine translation. We have participated in multimodal translation tasks for English into three Indic languages: Hindi, Bengali, and Malayalam. We leverage the inherent richness of multimodal data to bridge the gap of ambiguity in translation. We fine-tuned the ‘No Language Left Behind’ (NLLB) machine translation model for multimodal translation, further enhancing the model accuracy by image data augmentation using latent diffusion. Our submission achieves the best BLEU score for English-Hindi, English-Bengali, and English-Malayalam language pairs for both Evaluation and Challenge test sets.

2022

pdf
BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers
Shaz Furniturewala | Vijay Kumari | Amulya Ratna Dash | Hriday Kedia | Yashvardhan Sharma
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

Code-Mixed text data consists of sentences having words or phrases from more than one language. Most multi-lingual communities worldwide communicate using multiple languages, with English usually one of them. Hinglish is a Code-Mixed text composed of Hindi and English but written in Roman script. This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system. For the HinglishEval task, the proposed model uses multilingual BERT to find the similarity between synthetically generated and human-generated sentences to predict the quality of synthetically generated Hinglish sentences.

2021

pdf
Open Machine Translation for Low Resource South American Languages (AmericasNLP 2021 Shared Task Contribution)
Shantipriya Parida | Subhadarshi Panda | Amulya Dash | Esau Villatoro-Tello | A. Seza Doğruöz | Rosa M. Ortega-Mendoza | Amadeo Hernández | Yashvardhan Sharma | Petr Motlicek
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper describes the team (“Tamalli”)’s submission to AmericasNLP2021 shared task on Open Machine Translation for low resource South American languages. Our goal was to evaluate different Machine Translation (MT) techniques, statistical and neural-based, under several configuration settings. We obtained the second-best results for the language pairs “Spanish-Bribri”, “Spanish-Asháninka”, and “Spanish-Rarámuri” in the category “Development set not used for training”. Our performed experiments will serve as a point of reference for researchers working on MT with low-resource languages.

pdf
NLPHut’s Participation at WAT2021
Shantipriya Parida | Subhadarshi Panda | Ketan Kotwal | Amulya Ratna Dash | Satya Ranjan Dash | Yashvardhan Sharma | Petr Motlicek | Ondřej Bojar
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This paper provides the description of shared tasks to the WAT 2021 by our team “NLPHut”. We have participated in the English→Hindi Multimodal translation task, English→Malayalam Multimodal translation task, and Indic Multi-lingual translation task. We have used the state-of-the-art Transformer model with language tags in different settings for the translation task and proposed a novel “region-specific” caption generation approach using a combination of image CNN and LSTM for the Hindi and Malayalam image captioning. Our submission tops in English→Malayalam Multimodal translation task (text-only translation, and Malayalam caption), and ranks second-best in English→Hindi Multimodal translation task (text-only translation, and Hindi caption). Our submissions have also performed well in the Indic Multilingual translation tasks.

pdf
Towards Offensive Language Identification for Dravidian Languages
Siva Sai | Yashvardhan Sharma
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

Offensive speech identification in countries like India poses several challenges due to the usage of code-mixed and romanized variants of multiple languages by the users in their posts on social media. The challenge of offensive language identification on social media for Dravidian languages is harder, considering the low resources available for the same. In this paper, we explored the zero-shot learning and few-shot learning paradigms based on multilingual language models for offensive speech detection in code-mixed and romanized variants of three Dravidian languages - Malayalam, Tamil, and Kannada. We propose a novel and flexible approach of selective translation and transliteration to reap better results from fine-tuning and ensembling multilingual transformer networks like XLMRoBERTa and mBERT. We implemented pretrained, fine-tuned, and ensembled versions of XLM-RoBERTa for offensive speech classification. Further, we experimented with interlanguage, inter-task, and multi-task transfer learning techniques to leverage the rich resources available for offensive speech identification in the English language and to enrich the models with knowledge transfer from related tasks. The proposed models yielded good results and are promising for effective offensive speech identification in low resource settings.

pdf
Sentiment Analysis of Dravidian Code Mixed Data
Asrita Venkata Mandalam | Yashvardhan Sharma
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

This paper presents the methodologies implemented while classifying Dravidian code-mixed comments according to their polarity. With datasets of code-mixed Tamil and Malayalam available, three methods are proposed - a sub-word level model, a word embedding based model and a machine learning based architecture. The sub-word and word embedding based models utilized Long Short Term Memory (LSTM) network along with language-specific preprocessing while the machine learning model used term frequency–inverse document frequency (TF-IDF) vectorization along with a Logistic Regression model. The sub-word level model was submitted to the the track ‘Sentiment Analysis for Dravidian Languages in Code-Mixed Text’ proposed by Forum of Information Retrieval Evaluation in 2020 (FIRE 2020). Although it received a rank of 5 and 12 for the Tamil and Malayalam tasks respectively in the FIRE 2020 track, this paper improves upon the results by a margin to attain final weighted F1-scores of 0.65 for the Tamil task and 0.68 for the Malayalam task. The former score is equivalent to that attained by the highest ranked team of the Tamil track.

2020

pdf
Character aware models with similarity learning for metaphor detection
Tarun Kumar | Yashvardhan Sharma
Proceedings of the Second Workshop on Figurative Language Processing

Recent work on automatic sequential metaphor detection has involved recurrent neural networks initialized with different pre-trained word embeddings and which are sometimes combined with hand engineered features. To capture lexical and orthographic information automatically, in this paper we propose to add character based word representation. Also, to contrast the difference between literal and contextual meaning, we utilize a similarity network. We explore these components via two different architectures - a BiLSTM model and a Transformer Encoder model similar to BERT to perform metaphor identification. We participate in the Second Shared Task on Metaphor Detection on both the VUA and TOFEL datasets with the above models. The experimental results demonstrate the effectiveness of our method as it outperforms all the systems which participated in the previous shared task.