Abstention Ability (AA) is a critical aspect of Large Language Model (LLM) reliability, referring to an LLM’s capability to withhold responses when uncertain or lacking a definitive answer, without compromising performance. Although previous studies have attempted to improve AA, they lack a standardized evaluation method and remain unsuitable for black-box models where token prediction probabilities are inaccessible. This makes comparative analysis challenging, especially for state-of-the-art closed-source commercial LLMs. This paper bridges this gap by introducing a black-box evaluation approach and a new dataset, Abstain-QA, crafted to rigorously assess AA across varied question types (answerable and unanswerable), domains (well-represented and under-represented), and task types (fact-centric and reasoning). We also propose a new confusion matrix, the ”Answerable-Unanswerable Confusion Matrix” (AUCM) which serves as the basis for evaluating AA, by offering a structured and precise approach for assessment. Finally, we explore the impact of three prompting strategies — Strict Prompting, Verbal Confidence Thresholding, and Chain-of-Thought (CoT) — on improving AA. Our results indicate that even powerful models like GPT-4, Mixtral 8x22b encounter difficulties with abstention; however, strategic approaches such as Strict prompting and CoT can enhance this capability.
Multilingual LLMs have achieved remarkable benchmark performance, but we find they continue to underperform on non-Latin script languages across contemporary LLM families. This discrepancy arises from the fact that LLMs are pretrained with orthographic scripts, which are dominated by Latin characters that obscure their shared phonology with non-Latin scripts. We propose leveraging phonemic transcriptions as complementary signals to induce script-invariant representations. Our study demonstrates that integrating phonemic signals improves performance across both non-Latin and Latin languages, with a particularly significant impact on closing the performance gap between the two. Through detailed experiments, we show that phonemic and orthographic scripts retrieve distinct examples for in-context learning (ICL). This motivates our proposed Mixed-ICL retrieval strategy, where further aggregation leads to our significant performance improvements for both Latin script languages (up to 12.6%) and non-Latin script languages (up to 15.1%) compared to randomized ICL retrieval.
Graph databases like Neo4j are gaining popularity for handling complex, interconnected data, over traditional relational databases in modeling and querying relationships. While translating natural language into SQL queries is well-researched, generating Cypher queries for Neo4j remains relatively underexplored. In this work, we present an automated, LLM Supervised, pipeline to generate high quality synthetic data for Text2Cypher. Our Cypher data generation pipeline introduces LLM-As-Database-Filler, a novel strategy for ensuring Cypher query correctness, thus resulting in high quality generations. Using our pipeline, we generate high quality Text2Cypher data - SynthCypher containing 29.8k instances across various domains and queries with varying complexities. Training open-source LLMs like LLaMa-3.1-8B, Mistral-7B, and QWEN7B on SynthCypher results in performance gains of up to 40% on the Text2Cypher test split and 30% on the SPIDER benchmark, adapted for graph databases.
Existing benchmarks are becoming saturated and less effective in evaluating model performance due to factors such as data contamination and the advancing capabilities of the Large Language Models (LLMs). This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric designed to revitalize existing benchmarks. EMDM implements a weighting schema for samples based on their complexity and requisite knowledge, utilizing the performance of a baseline LLM in two experimental setups: “Unguided”, where the model has no prior exposure to test samples, and “Guided”, where the model has prior knowledge about the desired answer. This schema is leveraged in an optimization objective to assign weights to test samples, distinguishing instances of varying complexity. EMDM accounts for both answer correctness and the depth and accuracy of reasoning, offering a more nuanced evaluation of model performance. By weighting test examples based on their required reasoning and knowledge, EMDM achieves a distinguishing range of evaluation scores of 46% among various LLMs, compared to just 17% with traditional exact match (EM) metrics, thereby highlighting the saturation of current evaluation methods.