This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
AshishKulkarni
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at https://github.com/Debrup-61/MathSensei.
In this work, we provide the system description of our submission as part of the English-to-Lowres Multimodal Translation Task at theWorkshop on Asian Translation (WAT2024). We introduce Chitranuvad, a multimodal model that effectively integrates Multilingual LLMand a vision module for Multimodal Translation. Our method uses a ViT image encoder to extract visual representations as visual tokenembeddings which are projected to the LLM space by an adapter layer and generates translation in an autoregressive fashion. We participated in all the three tracks (Image Captioning, Text-only and Multimodal translationtasks) for Indic languages (ie. English translation to Hindi, Bengali and Malyalam) and achieved SOTA results for Hindi in all of themon the Challenge set while remaining competitive for the other languages in the shared task.
We study the problem of ontology population for a domain ontology and present solutions based on semi-automatic techniques. A domain ontology for an organization, often consists of classes whose instances are either specific to, or independent of the organization. E.g. in an academic domain ontology, classes like Professor, Department could be organization (university) specific, while Conference, Programming languages are organization independent. This distinction allows us to leverage data sources both―within the organization and those in the Internet ― to extract entities and populate an ontology. We propose techniques that build on those for open domain IE. Together with user input, we show through comprehensive evaluation, how these semi-automatic techniques achieve high precision. We experimented with the academic domain and built an ontology comprising of over 220 classes. Intranet documents from five universities formed our organization specific corpora and we used open domain knowledge bases like Wikipedia, Linked Open Data, and web pages from the Internet as the organization independent data sources. The populated ontology that we built for one of the universities comprised of over 75,000 instances. We adhere to the semantic web standards and tools and make the resources available in the OWL format. These could be useful for applications such as information extraction, text annotation, and information retrieval.