Harshit Surana


2025

pdf bib
Latent Factor Models Meets Instructions: Goal-conditioned Latent Factor Discovery without Task Supervision
Zhouhang Xie | Tushar Khot | Bhavana Dalvi Mishra | Harshit Surana | Julian McAuley | Peter Clark | Bodhisattwa Prasad Majumder
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Instruction-following LLMs have recently allowed systems to discover hidden concepts from a collection of unstructured documents based on a natural language description of the purpose of the discovery (i.e., goal). Still, the quality of the discovered concepts remains mixed, as it depends heavily on LLM’s reasoning ability and drops when the data is noisy or beyond LLM’s knowledge. We present Instruct-LF, a goal-oriented latent factor discovery system that integrates LLM’s instruction-following ability with statistical models to handle large, noisy datasets where LLM reasoning alone falls short. Instruct-LF uses LLMs to propose fine-grained, goal-related properties from documents, estimates their presence across the dataset, and applies gradient-based optimization to uncover hidden factors, where each factor is represented by a cluster of co-occurring properties. We evaluate latent factors produced by Instruct-LF on movie recommendation, text-world navigation, and legal document categorization tasks. These interpretable representations improve downstream task performance by 5-52% than the best baselines and were preferred 1.8 times as often as the best alternative, on average, in human evaluation.

2008

pdf bib
A More Discerning and Adaptable Multilingual Transliteration Mechanism for Indian Languages
Harshit Surana | Anil Kumar Singh
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition
Karthik Gali | Harshit Surana | Ashwini Vaidya | Praneeth Shishtla | Dipti Misra Sharma
Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages

pdf bib
Estimating the Resource Adaption Cost from a Resource Rich Language to a Similar Resource Poor Language
Anil Kumar Singh | Kiran Pala | Harshit Surana
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Developing resources which can be used for Natural Language Processing is an extremely difficult task for any language, but is even more so for less privileged (or less computerized) languages. One way to overcome this difficulty is to adapt the resources of a linguistically close resource rich language. In this paper we discuss how the cost of such adaption can be estimated using subjective and objective measures of linguistic similarity for allocating financial resources, time, manpower etc. Since this is the first work of its kind, the method described in this paper should be seen as only a preliminary method, indicative of how better methods can be developed. Corpora of several less computerized languages had to be collected for the work described in the paper, which was difficult because for many of these varieties there is not much electronic data available. Even if it is, it is in non-standard encodings, which means that we had to build encoding converters for these varieties. The varieties we have focused on are some of the varieties spoken in the South Asian region.

2007

pdf bib
Can Corpus Based Measures be Used for Comparative Study of Languages?
Anil Kumar Singh | Harshit Surana
Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology