Ketaki Shetye


2026

Product reviews on e-commerce platforms are a critical form of user-generated content that influence consumer decisions. However, these reviews are predominantly in English, creating a significant accessibility barrier for users who are not fluent in English. When translating into major Indian languages using the current models, the outputs often fail to capture domain-specific features and colloquial style, resulting in stylistically unnatural texts. To address this gap, we introduce **STAR-IL**, a human-annotated, multilingual, parallel corpus for style-aware translation of product reviews. We evaluate the performance of several state-of-the-art models on our dataset for the task of product review translation. Our experiments show that models fine-tuned on STAR-IL achieve significant average performance gain of **5.77** points in BLEU and **3.78** points in COMET, when compared to their baselines, across all languages. Our dataset provides a valuable benchmark for future research in style-aware product review translation. The STAR-IL dataset is publicly available at https://github.com/ltrc/STAR-IL-Corpus.

2025

This study addresses the critical challenge of data scarcity in machine translation for Indian languages, particularly given their morphological complexity and limited parallel data. We investigate an effective strategy to maximize the utility of existing data by generating negative samples from positive training instances using a progressive perturbation approach. This is used for aligning the model with preferential data using Kahneman-Tversky Optimization (KTO). Comparing it against traditional Supervised Fine-Tuning (SFT), we demonstrate how generating negative samples and leveraging KTO enhances data efficiency. By creating rejected samples through progressively perturbed translations from the available dataset, we fine-tune the Llama 3.1 Instruct 8B model using QLoRA across 16 language directions, including English, Hindi, Bangla, Tamil, Telugu, and Santali. Our results show that KTO-based preference alignment with progressive perturbation consistently outperforms SFT, achieving significant gains in translation quality with an average BLEU increase of 1.84 to 2.47 and CHRF increase of 2.85 to 4.01 compared to SFT for selected languages, while using the same positive training samples and under similar computational constraints. This highlights the potential of our negative sample generation strategy within KTO, especially in low resource scenarios.

2024

Machine Translation for low-resource languages presents significant challenges, primarily due to limited data availability. We have a baseline model and a primary model. For the baseline model, we first fine-tune the mBART model (mbart-large-50-many-to-many-mmt) for the language pairs English-Khasi, Khasi-English, English-Manipuri, and Manipuri-English. We then augment the dataset by back-translating from Indic languages to English. To enhance data quality, we fine-tune the LaBSE model specifically for Khasi and Manipuri, generating sentence embeddings and applying a cosine similarity threshold of 0.84 to filter out low-quality back-translations. The filtered data is combined with the original training data and used to further fine-tune the mBART model, creating our primary model. The results show that the primary model slightly outperforms the baseline model, with the best performance achieved by the English-to-Khasi (en-kh) primary model, which recorded a BLEU score of 0.0492, a chrF score of 0.3316, and a METEOR score of 0.2589 (on a scale of 0 to 1), with similar results for other language pairs.