Michael Katz


2026

For large language models (LLMs), reasoning over graphs can help solve many problems. Prior work has tried to improve LLM graph reasoning through different training methods, but the merits of such approaches remain unclear and the limitations of each approach with respect to generalizability of reasoning are often not thoroughly explored. In this paper we systematically compare the ability of LLMs to learn fundamental graph tasks across a variety of training methods and their ability to generalize out of distribution across various dimensions. We highlight key tradeoffs between training methods, e.g., training specialized graph encoders and fusing their embeddings with LLMs consistently collapses in terms of generalizability; however, no single method shows clear superiority across all dimensions of generalizability, regardless of the size of the model.