Mehak Gupta
2026
Multilingual Language Models Encode Script Over Linguistic Structure
Aastha A K Verma | Anwoy Chatterjee | Mehak Gupta | Tanmoy Chakraborty
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Aastha A K Verma | Anwoy Chatterjee | Mehak Gupta | Tanmoy Chakraborty
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multilingual language models (LMs) organize representations for typologically and orthographically diverse languages into a shared parameter space, yet the nature of this internal organization remains elusive. In this work, we investigate which linguistic properties — abstract language identity or surface-form cues — shape multilingual representations. To do so, we analyze language-associated units across different model families and scales using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows that typological structure becomes increasingly accessible in deeper layers, while causal interventions indicate that generation is most sensitive to units that are invariant to surface-form perturbations rather than to units identified by typological alignment alone. Overall, our results suggest that multilingual LMs organize representations around surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua.
2025
AI Assistant for Socioeconomic Empowerment Using Federated Learning
Nahed Abdelgaber | Labiba Jahan | Nino Castellano | Joshua Oltmanns | Mehak Gupta | Jia Zhang | Akshay Pednekar | Ashish Basavaraju | Ian Velazquez | Zerui Ma
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Nahed Abdelgaber | Labiba Jahan | Nino Castellano | Joshua Oltmanns | Mehak Gupta | Jia Zhang | Akshay Pednekar | Ashish Basavaraju | Ian Velazquez | Zerui Ma
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Socioeconomic status (SES) reflects an individual’s standing in society, from a holistic set of factors including income, education level, and occupation. Identifying individuals in low-SES groups is crucial to ensuring they receive necessary support. However, many individuals may be hesitant to disclose their SES directly. This study introduces a federated learning-powered framework capable of verifying individuals’ SES levels through the analysis of their communications described in natural language. We propose to study language usage patterns among individuals from different SES groups using clustering and topic modeling techniques. An empirical study leveraging life narrative interviews demonstrates the effectiveness of our proposed approach.