Satya Lokam

2026

Exploring Two-Phase Continual Instruction Fine-tuning for Multilingual Adaptation in Large Language Models
Divyanshu Aggarwal | Sankarshan Damle | Navin Goyal | Satya Lokam | Sunayana Sitaram
Findings of the Association for Computational Linguistics: ACL 2026

A key challenge for Large Language Models (LLMs) is improving their Multilingual instruction-following ability over time without deteriorating their ability in languages they already excel at, typically English. In this paper, we study a two-phase Continual Fine-tuning (CFT) setup toward improving a model’s Multilingual adaptability. Concretely, we consider a two-phase CFT process in which an English-only end-to-end instruction fine-tuned LLM (Phase 1) is sequentially fine-tuned on a multilingual instruction dataset (Phase 2). Across MISTRAL-7B and LLAMA-3-8B and multiple dataset pairs, we show that instructional similarity between phases is critical: aligned datasets preserve or improve English while boosting multilingual ability, whereas misaligned datasets cause English degradation. We show that this degradation arises from representation shift during CFT, and that targeted mitigation strategies, including generative replay and heuristic-based layer freezing, reduce this shift and improve multilingual adaptation.

2025

pdf bib abs

CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models
Pavan Kalyan Tankala | Shubhra Mishra | Satya Lokam | Navin Goyal
Proceedings of the First BabyLM Workshop

We introduce a comprehensive continual learning dataset and benchmark CurLL grounded in human developmental trajectories from ages 5–10, enabling systematic and fine-grained assessment of models’ ability to progressively acquire new skills. CurLL spans five developmental stages (0–4) covering ages 5–10, with a skill graph of 32 high-level skills, 128 sub-skills, 350+ goals, and 1,300+ indicators explicitly modeling prerequisite relationships. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction–response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.

Co-authors

Pavan Kalyan Tankala 1

Venues

BabyLM1
Findings1

Fix author