Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics
Jingdi Lei Di Zhang Soujanya Poria
Abstract
Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.
One-sentence Summary
Jingdi Lei, Di Zhang, and Soujanya Poria from Nanyang Technological University and Fudan University propose Error-Free Linear Attention (EFLA), a theoretically exact, numerically stable linear-time attention mechanism derived from solving the continuous-time ODE governing linear attention via an infinite-order Runge–Kutta method. By exploiting the rank-1 structure of the dynamics matrix, EFLA achieves closed-form, error-free updates with full parallelism and linear complexity, outperforming DeltaNet in robustness to noise and downstream tasks while eliminating discretization errors inherent in prior Euler-based approximations.
Key Contributions
-
Existing linear attention methods suffer from inherent numerical instability and truncation errors due to their reliance on first-order Euler discretization of continuous-time dynamics, which limits their accuracy and robustness in long-context scenarios despite their computational efficiency.
-
The paper reformulates linear attention as a continuous-time dynamical system governed by a first-order ODE, revealing that the standard approach corresponds to a low-order numerical integration scheme that fails to capture the true evolution of the state.
-
By exploiting the rank-1 structure of the dynamics matrix, the authors derive an exact closed-form solution equivalent to the infinite-order Runge–Kutta limit, achieving error-free integration with linear time complexity, full parallelism, and consistent performance gains over DeltaNet and other baselines.
Introduction
The authors leverage the growing role of large language models as autonomous agents in complex, long-context tasks—such as reasoning and tool use—where standard attention mechanisms become computationally prohibitive due to their quadratic time complexity. Prior linear attention methods, while efficient, rely on low-order numerical approximations like Euler integration to solve the underlying continuous-time dynamics, introducing truncation errors and instability, especially under long sequences or high decay rates. These approximations are inherently limited, with heuristic fixes like gating or adaptive coefficients only mitigating symptoms rather than eliminating the root cause. The authors’ main contribution is EFLA, a principled reformulation of linear attention as a continuous-time dynamical system governed by a first-order ODE. By exploiting the rank-1 structure of the system, they derive an exact closed-form solution equivalent to the infinite-order Runge–Kutta limit, achieving error-free integration while preserving linear time complexity. This approach not only ensures numerical stability and robustness in noisy settings but also outperforms existing methods like DeltaNet across benchmarks, offering a theoretically sound and practically scalable foundation for high-fidelity attention.

Method
The authors leverage a continuous-time dynamical systems perspective to derive an exact, error-free solution for linear attention, addressing the numerical instability and error accumulation inherent in low-order discretization schemes. The core insight is to model the online learning update of the associative memory state St as a first-order ordinary differential equation (ODE). This ODE is defined as dtdS(t)=−AtS(t)+bt, where the dynamics matrix At=ktkt⊤ and the forcing term bt=ktvt⊤ are derived from the key and value vectors at time t. This formulation generalizes the delta rule update, which corresponds to a first-order explicit Euler discretization of this ODE. By recognizing that the dynamics matrix At is rank-1, the authors exploit its algebraic properties to compute the exact analytical solution of the ODE. This solution, which corresponds to the infinite-order limit of the Runge-Kutta family of methods, is given by St=e−βtAtSt−1+∫0βte−(βt−τ)Atbtdτ. The rank-1 structure allows the matrix exponential e−βtAt to be computed in closed form as I−λt1−e−βtλtAt, where λt=kt⊤kt. Similarly, the integral term simplifies to λt1−e−βtλtbt. Substituting these closed-form expressions yields the final update rule for the Error-Free Linear Attention (EFLA) mechanism. This update maintains linear time complexity with respect to sequence length, enabling efficient computation while capturing the exact continuous dynamics.
[[IMG:|The framework diagram illustrates the continuous-time dynamical system formulation of linear attention, showing the state S(t) evolving according to the ODE dtdS(t)=−AtS(t)+bt. The figure highlights the transition from the discrete delta rule update to the continuous-time model, emphasizing the role of the dynamics matrix At and the forcing term bt. The exact solution of this ODE, derived using the rank-1 structure of At, is the foundation of the EFLA mechanism.]]
Experiment
- Numerical Stability and Robustness Verification: On sMNIST with pixel dropout, OOD intensity scaling, and additive Gaussian noise, EFLA outperforms DeltaNet in convergence speed and robustness, maintaining high accuracy under severe interference. EFLA achieves significantly better performance at high input scales and with larger learning rates, validating its exact saturation mechanism mitigates error accumulation and state explosion.
- Language Modeling: On Wikitext and zero-shot reasoning tasks (LAMBADA, PiQA, HellaSwag, WinoGrande, ARC-e, ARC-c, BoolQ, OpenBookQA, SciQ), EFLA with 340M parameters achieves lower perplexity (37.01 vs. 38.09) and higher accuracy (23.9% vs. 22.5% on LAMBADA), with a +7.4% absolute improvement on BoolQ. At 1.3B parameters, EFLA maintains a performance lead even at 16B tokens, indicating superior long-sequence fidelity and scalability.
The authors use a 340M and 1.3B parameter model to compare EFLA against DeltaNet on language modeling and reasoning tasks, with results shown in Table 1. Results show that EFLA consistently outperforms DeltaNet across most metrics, achieving lower perplexity on Wikitext and LAMBADA and higher accuracy on multiple reasoning benchmarks, with the performance gap widening at larger model sizes.

Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.