Anurag Acharya

2025

Wind energy project assessments present significant challenges for decision-makers, who must navigate and synthesize hundreds of pages of environmental and scientific documentation. These documents often span different regions and project scales, covering multiple domains of expertise. This process traditionally demands immense time and specialized knowledge from decision-makers. The advent of Large Language Models (LLM) and Retrieval Augmented Generation (RAG) approaches offer a transformative solution, enabling rapid, accurate cross-document information retrieval and synthesis. As the landscape of Natural Language Processing (NLP) and text generation continues to evolve, benchmarking becomes essential to evaluate and compare the performance of different RAG-based LLMs. In this paper, we present a comprehensive framework to generate a domain relevant RAG benchmark. Our framework is based on automatic question-answer generation with Human (domain experts)-AI (LLM) teaming. As a case study, we demonstrate the framework by introducing WeQA, a first-of-its-kind benchmark on the wind energy domain which comprises of multiple scientific documents/reports related to environmental aspects of wind energy projects. Our framework systematically evaluates RAG performance using diverse metrics and multiple question types with varying complexity level, providing a foundation for rigorous assessment of RAG-based systems in complex scientific domains and enabling researchers to identify areas for improvement in domain-specific applications.

2024

pdf bib abs
GOLEM: GOld Standard for Learning and Evaluation of Motifs
W. Victor Yarlott | Anurag Acharya | Diego Castro Estrada | Diana Gomez | Mark Finlayson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Motifs are distinctive, recurring, widely used idiom-like words or phrases, often originating from folklore, whose meaning are anchored in a narrative. Motifs have significance as communicative devices because they concisely imply a constellation of culturally relevant information. Their broad usage suggests their cognitive importance as touchstones of cultural knowledge. We present GOLEM, the first dataset annotated for motific information. The dataset comprises 7,955 English articles (2,039,424 words). The corpus identifies 26,078 motif candidates across 34 motif types from three cultural or national groups: Jewish, Irish, and Puerto Rican. Each motif candidate is labeled with the type of usage (Motific, Referential, Eponymic, or Unrelated), resulting in 1,723 actual motific instances. Annotation was performed by individuals identifying as members of each group and achieved a Fleiss’ kappa of >0.55. We demonstrate that classification of candidate type is a challenging task for LLMs using a few-shot approach; recent models such as T5, FLAN-T5, GPT-2, and Llama 2 (7B) achieved a performance of 41% accuracy at best. These data will support development of new models and approaches for detecting (and reasoning about) motific information in text. We release the corpus, the annotation guide, and the code to support other researchers building on this work.

pdf bib abs
Discovering Implicit Meanings of Cultural Motifs from Text
Anurag Acharya | Diego Estrada | Shreeja Dahal | W. Victor H. Yarlott | Diana Gomez | Mark Finlayson
Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)

Motifs are distinctive, recurring, widely used idiom-like words or phrases, often originating in folklore and usually strongly anchored to a particular cultural or national group. Motifs are significant communicative devices across a wide range of media—including news, literature, and propaganda—because they can concisely imply a large set of culturally relevant associations. One difficulty of understanding motifs is that their meaning is usually implicit, so for an out-group person the meaning is inaccessible. We present the Motif Implicit Meaning Extractor (MIME), a proof-of-concept system designed to automatically identify a motif’s implicit meaning, as evidenced by textual uses of the motif across a large set data. MIME uses several sources (including motif indices, Wikipedia pages on the motifs, explicit explanations of motifs from in-group informants, and news/social media posts where the motif is used) and can generate a structured report of information about a motif understandable to an out-group person. In addition to a variety of examples and information drawn from structured sources, the report includes implicit information about a motif such as the type of reference (e.g., a person, an organization, etc.), it’s general connotation (strongly negative, slightly negative, neutral, etc.), and it’s associations (typically adjectives). We describe how MIME works and demonstrate its operation on a small set of manually curated motifs. We perform a qualitative evaluation of the output, and assess the difficulty of the problem, showing that explicit motif information provided by cultural informants is critical to high quality output, although mining motif usages in news and social media provides useful additional depth. A system such as MIME, appropriately scaled up, would potentially be quite useful to an out-group person trying to understand in-group usages of motifs, and has wide potential applications in domains such as literary criticism, cultural heritage, marketed and branding, and intelligence analysis.

pdf bib abs
Evaluating the Effectiveness of Retrieval-Augmented Large Language Models in Scientific Document Reasoning
Sai Munikoti | Anurag Acharya | Sridevi Wagle | Sameera Horawalavithana
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

Despite the dramatic progress in Large Language Model (LLM) development, LLMs often provide seemingly plausible but not factual information, often referred to as hallucinations. Retrieval-augmented LLMs provide a non-parametric approach to solve these issues by retrieving relevant information from external data sources and augment the training process. These models help to trace evidence from an externally provided knowledge base allowing the model predictions to be better interpreted and verified. In this work, we critically evaluate these models in their ability to perform in scientific document reasoning tasks. To this end, we tuned multiple such model variants with science-focused instructions and evaluated them on a scientific document reasoning benchmark for the usefulness of the retrieved document passages. Our findings suggest that models justify predictions in science tasks with fabricated evidence and leveraging scientific corpus as pretraining data does not alleviate the risk of evidence fabrication.