Zhe Pan


2026

In response to the increasing demand for largescale machine learning training jobs, many organizations have deployed GPU clusters across geographically distributed regions. However, existing ILP- or genetic-based cross-cluster training approaches largely overlook the topology of decentralized clusters, lacking both topologyaware task scheduling mechanisms and automated model parallelization strategies. As a result, naively applying these optimization-based methods in cross-cluster settings leads to prohibitive scheduling overhead, due to the drastically enlarged search space induced by complex inter-cluster topologies. To address these challenges, we propose SpiderFlow, a topologyaware scheduling system specifically designed for decentralized GPU clusters. We formulate cross-cluster task scheduling as a graph optimization problem and introduce SpinSearch, a low-overhead topology-aware scheduling algorithm. In addition, for automated model parallelization, we propose TPA, a two-level scheduling framework that combines heuristic methods at the inter-cluster level with ILP-based optimization within clusters, effectively reducing the search space while maintaining high training throughput with substantially lower scheduling overhead. We evaluate SpiderFlow on a physical platform comprising 8 decentralized clusters, as well as on a simulation platform with up to 64 decentralized clusters. Experimental results demonstrate that SpiderFlow reduces job completion time (JCT) by 1.2-1.3×, improves throughput by 1.12-1.25×, and reduces scheduling overhead by 20-90× on average compared to state-of-the-art scheduling systems.

2021

Knowledge graph embedding (KGE) using low-dimensional representations to predict missing information is widely applied in knowledge completion. Existing embedding methods are mostly built on Euclidean space, which are difficult to handle hierarchical structures. Hyperbolic embedding methods have shown the promise of high fidelity and concise representation for hierarchical data. However, the logical patterns in knowledge graphs are not considered well in these methods. To address this problem, we propose a novel KGE model with extended Poincaré Ball and polar coordinate system to capture hierarchical structures. We use the tangent space and exponential transformation to initialize and map the corresponding vectors to the Poincaré Ball in hyperbolic space. To solve the boundary conditions, the boundary is stretched and zoomed by expanding the modulus length in the Poincaré Ball. We optimize our model using polar coordinate and changing operators in the extended Poincaré Ball. Experiments achieve new state-of-the-art results on part of link prediction tasks, which demonstrates the effectiveness of our method.