Bryan Dai

2026

Although the Universal Transformer (UT) mitigates the diminishing returns of standard LLM scaling by decoupling parameter count from depth, it remains constrained by linear computational costs and rigid weight-sharing mechanisms. These limitations lead to severe functional homogeneity, which subsequently induces over-smoothing, representation rank collapse, and degraded reasoning performance. In this work, we present the first systematic study of Compute Distribution Skew, identifying it as the primary driver of extrapolation failure. This is a pathological phenomenon in ultra-deep recurrent Transformers characterized by a disproportionate distribution of contributions across recurrent steps, resulting in distinct functional states during prefix and suffix processing phases. To address this challenge, we propose the Polymorphic Transformer, which aims to achieve functional polymorphism and depth sparsity within a shared-parameter framework. By integrating conditional sparse subspaces, SiLU Attention, and an uncertainty-aware depth scheduler, our architecture mitigates power-method collapse and effectively decouples logical depth from computational cost. Experiments demonstrate that our model significantly enhances representation rank and robustness, achieving complex reasoning performance comparable to baseline while reducing computation by 64.7%.

pdf bib abs

Large language models (LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish scaling laws for code LLMs across multiple programming languages, showing that interpreted languages benefit more from increased scale than compiled ones. Multilingual pre-training provides synergistic benefits, especially between syntactically similar languages, with parallel pairing (concatenating code with translations) significantly enhancing cross-lingual abilities. We propose a proportion-dependent multilingual scaling law that optimally allocates training tokens by prioritizing high-utility languages (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (e.g., Rust), achieving superior performance across all languages compared to uniform distribution.

pdf bib abs

While large language models show promise in medical applications, achieving expert-level clinical reasoning efficiently remains challenging due to the need for massive amounts of manually labeled data and large-scale models. To address this challenge, we propose Clinical-Oriented Reinforcement Learning (CORL), the first fully open-source, end-to-end reinforcement learning training pipeline in the clinical reasoning domain, incorporating a Reasoning-Oriented Data Strategy (RODS) based on topological synthesis, CoT cold-start, and two-stage reinforcement learning. Through CORL, we trained the Fleming-R1 series of models. Among them, Fleming-R1-7B significantly outperforms models of comparable size while approaching or even surpassing certain 32B and 72B models. Fleming-R1-32B achieves near-parity with GPT-4o and outperforms the strongest open-source alternatives up to 671B in MedXpertQA. This demonstrates that in clinical reasoning field, a meticulously designed training pipeline holds greater importance than scaling model size alone. Data and Models are available at https://github.com/UbiquantAI/Fleming-R1 and https://huggingface.co/collections/IQuestLab/fleming.

pdf bib abs

Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose Cat, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. Cat formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CaT-Generator, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.

While large language models (LLMs) have mastered syntax-level code generation, complex algorithmic reasoning remains a challenge, typically addressed by scaling model depth and parameter count. Universal Transformers (UT) offer a compelling alternative by introducing a recurrent inductive bias that aligns with the recursive nature of programming logic. However, training looped architectures at scale has historically been hindered by severe instability and optimization difficulties associated with backpropagation through time (BPTT). We present LoopCoder (40B-A80B) pre-trained on 12T+ code and general tokens, along with LoopCoder-Thinking and LoopCoder-Instruct variants—the first large-scale looped transformer for code, achieving comparable performance to standard dense architectures with more parameters. Unlike prior approaches that restrict recurrence to small-scale tasks, we implement a comprehensive looped training protocol spanning both pre-training and post-training phases. We initiate the model via dense-to-loop transformation, folding a pre-trained dense checkpoint to initialize a recurrent block, followed by rigorous looped pre-training and specialized post-training for instruction following and reasoning. Our results establish a robust recipe for scaling coding intelligence via recurrent computation, proving that dense checkpoints serve as an optimal foundation for evolving into dynamic, looped reasoners.