Eeham Khan
2026
Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
Eeham Khan | Firas Saidani | Owen Van Esbroeck | Richard Khoury | Leila Kosseim
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Eeham Khan | Firas Saidani | Owen Van Esbroeck | Richard Khoury | Leila Kosseim
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Despite the widespread adoption of Large Language Models (LLMs), their strongest capabilities remain largely confined to a small number of high-resource languages for which there is abundant training data. Recently, continual pre-training (CPT) has emerged as a means to fine-tune these models to low-resource regional dialects. In this paper, we study the use of CPT for dialect learning under tight data and compute budgets. Using low-rank adaptation (LoRA) and compute-efficient continual pre-training, we adapt three LLMs to the Québec French dialect using a very small dataset and benchmark them on the COLE suite. Our experiments demonstrate an improvement on the minority dialect benchmarks with minimal regression on the prestige language benchmarks with around 1% of model parameters updated. Analysis of the results demonstrate that gains are highly contingent on corpus composition. These findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can narrow the dialect gap by providing cost-effective and sustainable language resource creation, expanding high-quality LLM access to minority linguistic communities. To support reproducibility and broaden access, we release the first Québec French LLMs on Hugging Face.
2025
CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification
Nawar Turk | Eeham Khan | Leila Kosseim
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Nawar Turk | Eeham Khan | Leila Kosseim
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
This paper presents our approach to the PromiseEval task at SemEval-2025, which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports. We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing. Our first model utilizes ESG-BERT with task-specific classifier heads, while our second model enhances this architecture with linguistic features tailored for each subtask. Our third approach implements a combined subtask model with attention-based sequence pooling, transformer representations augmented with document metadata, and multi-objective learning. Experiments on the English portion of the ML-Promise dataset demonstrate progressive improvement across our models, with our combined subtask approach achieving a private leaderboard score of 0.5268, outperforming the provided baseline of 0.5227. Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.