Group, Embed and Reason: A Hybrid LLM and Embedding Framework for Semantic Attribute Alignment

Shramona Chakraborty, Shashank Mujumdar, Nitin Gupta, Sameep Mehta, Ronen Kat, Itay Etelis, Mohamed Mahameed, Itai Guez, Rachel Tzoref-Brill


Abstract
In enterprise systems, tasks like API integration, ETL pipeline creation, customer record merging, and data consolidation rely on accurately aligning attributes that refer to the same real-world concept but differ across schemas. This semantic attribute alignment is critical for enabling schema unification, reporting, and analytics. The challenge is amplified in schema only settings where no instance data is available due to ambiguous names, inconsistent descriptions, and varied naming conventions.We propose a hybrid, unsupervised framework that combines the contextual reasoning of Large Language Models (LLMs) with the stability of embedding-based similarity and schema grouping to address token limitations and hallucinations. Our method operates solely on metadata and scales to large schemas by grouping attributes and refining LLM outputs through embedding-based enhancement, justification filtering, and ranking. Experiments on real-world healthcare schemas show strong performance, highlighting the effectiveness of the framework in privacy-constrained scenarios.
Anthology ID:
2025.emnlp-industry.120
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2025
Address:
Suzhou (China)
Editors:
Saloni Potdar, Lina Rojas-Barahona, Sebastien Montella
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1703–1710
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.120/
DOI:
Bibkey:
Cite (ACL):
Shramona Chakraborty, Shashank Mujumdar, Nitin Gupta, Sameep Mehta, Ronen Kat, Itay Etelis, Mohamed Mahameed, Itai Guez, and Rachel Tzoref-Brill. 2025. Group, Embed and Reason: A Hybrid LLM and Embedding Framework for Semantic Attribute Alignment. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1703–1710, Suzhou (China). Association for Computational Linguistics.
Cite (Informal):
Group, Embed and Reason: A Hybrid LLM and Embedding Framework for Semantic Attribute Alignment (Chakraborty et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-industry.120.pdf