SGG-R 3: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, Weiping Li


Abstract
Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R 3, a structured reasoning framework that integrates task-specific Chain-of-Thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R 3 achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.
Anthology ID:
2026.findings-acl.992
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19811–19830
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.992/
DOI:
Bibkey:
Cite (ACL):
Jiaye Feng, Qixiang Yin, Yuankun Liu, Tong Mo, and Weiping Li. 2026. SGG-R 3: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19811–19830, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SGG-R 3: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation (Feng et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.992.pdf
Checklist:
 2026.findings-acl.992.checklist.pdf