Soumedhik Bharati

2026

FROST: Factual Reasoning via Optimized Stochastic Trajectories in Large Language Models during Inference
Soumedhik Bharati | Ebad Shabbir | Jiechao Gao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Large language models face a trade-off between factual consistency and reasoningdiversity: deterministic decoding prioritizes reliability but may miss alternativesolution paths, while high-temperature sampling increases exploration at the costof accuracy. We present FROST (Factual Reasoning via Optimized StochasticTrajectories), an inference-time framework that balances exploration andexploitation without additional training or context augmentation. FROST combinesdeterministic inference from a large model with targeted stochastic sampling froma smaller model, selecting outputs via multi-criteria validation over coherence,factual grounding, and semantic novelty. Across HotpotQA, CommonsenseQA, andMMLU, FROST achieves 2–5 percentage point improvements over standard chain-of-thoughtprompting and reduces unsupported outputs by 40% relative to Standard CoT. Comparedto Self-Consistency ensembles, FROST delivers comparable accuracy at 28% lowerinference cost through strategic delegation to smaller models. On an adversarialsubset with unanswerable queries, FROST abstains on 34% of cases versus 8% forstandard chain-of-thought, reducing false positives by 45%. Task-stratifiedevaluation shows that exploration benefits scale with problem ambiguity.Generalization to mathematical reasoning, code generation, and multimodal domainsremains future work.

pdf bib abs

Morphological Feature Extraction for Fine-Grained Sorani Kurdish Dialect Identification: A Hybrid Transformer-Linguistic Approach
Soumedhik Bharati | Shibam Mandal | Subham Majumdar | Swarup Kr Ghosh | Sayani Mondal
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

As reported, approximately 6 million people in Iraq and Iran speak in Sorani Kurdish, which exhibits substantial regional variation but lacks computational resources for dialect identification. We present the first fine-grained sub-dialect classification system for six Sorani varieties namely, Sulaymaniyah, Erbil, Iranian Sorani, Ardalani, Babani, and Mukriani. This investigation combines cross-lingual contextual embeddings (XLM-RoBERTa) with morphological features derived from explicit linguistic rules, including 24 patterns capturing verb prefixes, pronominal clitics, and definite markers. The suggested morphology-augmented XLM-R model has been trained on a unified dataset of 16,409 sentences without manual annotation, and achieves 91.91% accuracy, outperforming pure transformers (91.79%) and traditional machine learning baselines (SVM 86.41%). Key ablation studies reveal that morphological features serve as effective regularizers for geographically proximate dialects.

pdf bib abs

AjamiMorph: Zero-Annotation Morphological Discovery for Hausa Ajami via Multi-Method Consensus
Soumedhik Bharati | Shibam Mandal | Prithwish Ghosh | Swarup Kr Ghosh | Sayani Mondal
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

Hausa Ajami (Hausa written in Arabic script) remains severely under-resourced for computational morphology. We present AjamiMorph, a zero-annotation framework that discovers morphemes through consensus among three unsupervised methods, namely, Byte Pair Encoding (BPE), transition-based boundary detection using Pointwise Mutual Information (PMI), and computational linguistics based Distributional Affix Mining (DAM). Using a Hausa Ajami Bible corpus consisting of 637,414 tokens, AjamiMorph identifies 1,611 high-confidence morphemes, achieving 99.9% coverage. The inventory exhibits a linguistically realistic distribution (66.0% stems, 22.6% suffixes, 11.4% prefixes) and recovers 77.8% of known Hausa affixes. A permutation test that shuffles method assignments (preserving per-method selection sizes) confirms that the observed agreement is above-chance; chi-square remains as a secondary check. A lightweight 5-gram LM comparison (characters vs. consensus morphemes) provides an extrinsic signal. We also report negative results for script-driven Arabic assumptions and LLM-first annotation. This work provides the first unsupervised morpheme inventory for Hausa Ajami and demonstrates consensus as a robust strategy for zero-resource morphology.

Co-authors

Subham Majumdar 1

Ebad Shabbir 1

Venues

Fix author