Shashank Mujumdar
2025
Group, Embed and Reason: A Hybrid LLM and Embedding Framework for Semantic Attribute Alignment
Shramona Chakraborty
|
Shashank Mujumdar
|
Nitin Gupta
|
Sameep Mehta
|
Ronen Kat
|
Itay Etelis
|
Mohamed Mahameed
|
Itai Guez
|
Rachel Tzoref-Brill
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
In enterprise systems, tasks like API integration, ETL pipeline creation, customer record merging, and data consolidation rely on accurately aligning attributes that refer to the same real-world concept but differ across schemas. This semantic attribute alignment is critical for enabling schema unification, reporting, and analytics. The challenge is amplified in schema only settings where no instance data is available due to ambiguous names, inconsistent descriptions, and varied naming conventions.We propose a hybrid, unsupervised framework that combines the contextual reasoning of Large Language Models (LLMs) with the stability of embedding-based similarity and schema grouping to address token limitations and hallucinations. Our method operates solely on metadata and scales to large schemas by grouping attributes and refining LLM outputs through embedding-based enhancement, justification filtering, and ranking. Experiments on real-world healthcare schemas show strong performance, highlighting the effectiveness of the framework in privacy-constrained scenarios.
Mind the Query: A Benchmark Dataset towards Text2Cypher Task
Vashu Chauhan
|
Shobhit Raj
|
Shashank Mujumdar
|
Avirup Saha
|
Anannay Jain
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
We present a high-quality, multi-domain dataset for the Text2Cypher task which is enabling the translation of natural language (NL) questions into executable Cypher queries over graph databases. The dataset comprises 27,529 NL queries and corresponding Cyphers spanning across 11 real-world graph datasets, each accompanied by its corresponding graph database for grounded query execution. To ensure correctness, the queries are validated through a rigorous pipeline combining automated schema, runtime and value checks, along with manual review for logical correctness. Queries are further categorized by complexity to support fine-grained evaluation. We have released our benchmark dataset and code to replicate our data synthesis pipeline on new graph datasets, supporting extensibility and future research for the task of Text2Cypher.
Search
Fix author
Co-authors
- Shramona Chakraborty 1
- Vashu Chauhan 1
- Itay Etelis 1
- Itai Guez 1
- Nitin Gupta 1
- show all...