Naïm Es-Sebbani

Also published as: Naïm Es-sebbani


2026

Learning faithful embeddings for long documents remains challenging, especially in domains like law and medicine where inputs are long, structured, and semantically heterogeneous. We introduce the Chunk Prediction Encoder (CPE), a self-supervised framework that treats chunk–context compatibility as an unsupervised NLI problem. Given a document, CPE masks a chunk and learns (i) a contrastive objective that aligns the masked document with its held-out chunk against in-batch negatives, and (ii) a binary entailment head that predicts whether a candidate chunk belongs to the document. This joint objective encourages both geometric smoothness and directional semantic consistency, yielding robust document-level embeddings. We evaluate CPE with hierarchical and sparse-attention backbones on five benchmarks spanning legal and biomedical domains under frozen-embedding and end-to-end fine-tuning protocols. CPE consistently outperforms baselines, and is more compute-efficient than prompt-only LLM baselines under matched token budgets. Ablations demonstrate the effect of chunk length, the contrastive-vs-entailment balance, and skimming strategies.

2025

Modeling relationships between concepts and entities is essential for many applications. While Large Language Models (LLMs) capture relational and commonsense knowledge effectively, they are computationally expensive and often underperform in tasks requiring efficient relational encoding, such as relation induction, extraction, and information retrieval. Despite advancements in learning relational embeddings, existing methods often fail to capture nuanced representations and the rich semantics needed for high-quality embeddings. In this work, we propose different relational encoders designed to capture diverse relational aspects and semantic properties of entity pairs. Although several datasets exist for training such encoders, they often rely on structured knowledge bases or predefined schemas, which primarily encode simple and static relations. To overcome this limitation, we also introduce a novel dataset generation method leveraging LLMs to create a diverse spectrum of relationships. Our experiments demonstrate the effectiveness of our proposed encoders and the benefits of our generated dataset.