2 months ago

Table of Contents

Abstract

Linear-time attention and State Space Models (SSMs) promise to solve the quadratic cost bottleneck in long-context language models employing softmax attention. We introduce Error-Free Linear Attention (EFLA), a numerically stable, fully parallelism and generalized formulation of the delta rule. Specifically, we formulate the online learning update as a continuous-time dynamical system and prove that its exact solution is not only attainable but also computable in linear time with full parallelism. By leveraging the rank-1 structure of the dynamics matrix, we directly derive the exact closed-form solution effectively corresponding to the infinite-order Runge-Kutta method. This attention mechanism is theoretically free from error accumulation, perfectly capturing the continuous dynamics while preserving the linear-time complexity. Through an extensive suite of experiments, we show that EFLA enables robust performance in noisy environments, achieving lower language modeling perplexity and superior downstream benchmark performance than DeltaNet without introducing additional parameters. Our work provides a new theoretical foundation for building high-fidelity, scalable linear-time attention models.

One-sentence Summary

Nanyang Technological University and Fudan University researchers propose Error-Free Linear Attention (EFLA), which eliminates discretization errors in linear attention by deriving the exact closed-form solution of continuous-time dynamics through rank-1 matrix properties, achieving linear-time complexity without error accumulation and demonstrating superior robustness in noisy environments, lower perplexity, and better benchmark performance than DeltaNet without additional parameters.

Key Contributions

Identifies that existing linear attention methods suffer from numerical instability due to low-order discretization of continuous-time dynamics, causing truncation errors especially in long-context scenarios where Euler-based approximations fail. This explains performance degradation in noisy environments and extended sequences.
Reformulates linear attention as a continuous-time dynamical system governed by a first-order ordinary differential equation, revealing that standard implementations correspond to suboptimal numerical integration schemes like Euler discretization. This theoretical perspective bridges attention mechanisms with continuous-time system modeling.
Derives an exact closed-form solution for the rank-1 dynamics matrix that eliminates discretization errors while maintaining linear-time complexity, validated through language modeling perplexity improvements and superior downstream benchmark performance over DeltaNet without additional parameters.

Introduction

Long-context modeling is critical for efficiently processing lengthy sequences in applications like language understanding, where standard attention mechanisms become computationally prohibitive at scale due to quadratic complexity. Prior approaches such as linear attention often face numerical instability from approximate discretization of continuous dynamics, introducing errors that degrade performance. The authors address this by proving that rank-1 linear attention admits an exact, error-free discretization when derived from its continuous-time formulation, providing a rigorous theoretical foundation to enhance the reliability of existing linear attention implementations without proposing new architectural primitives. This insight offers a pathway to more stable long-context models while complementing alternative linear-time frameworks like RetNet or Hyena.

Top Figure

Method

The authors leverage a continuous-time dynamical systems perspective to reformulate linear attention as an exact, error-free solution to a first-order ordinary differential equation (ODE). Rather than relying on low-order numerical approximations such as Euler or Runge-Kutta discretizations, they derive a closed-form analytical solution that captures the continuous evolution of the attention state without truncation error. This solution is made computationally tractable by exploiting the rank-1 structure of the underlying dynamics matrix, enabling linear-time complexity while preserving mathematical fidelity.

The core formulation begins by interpreting the DeltaNet update — which minimizes a reconstruction loss via gradient descent — as a discretization of the continuous-time ODE:

\frac{d\mathbf{S}(t)}{dt} = -\mathbf{A}_t \mathbf{S}(t) + \mathbf{b}_t,

where $\mathbf{A}_t = \mathbf{k}_t \mathbf{k}_t^\top$ and $\mathbf{b}_t = \mathbf{k}_t \mathbf{v}_t^\top$ . Under the Zero-Order Hold assumption for discrete input sequences, this ODE governs the evolution of the state matrix $\mathbf{S}(t)$ , which accumulates key-value associations over time. Standard linear attention methods correspond to first-order Euler integration of this system, introducing local truncation errors of $\mathcal{O}(\beta_t^2)$ and suffering from instability under stiff dynamics.

To eliminate these errors, the authors derive the exact solution to the ODE by taking the infinite-order limit of the Runge-Kutta family. This yields:

\mathbf{S}_t = e^{-\beta_t\mathbf{A}_t} \mathbf{S}_{t-1} + \int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tau.

While matrix exponentials typically incur $\mathcal{O}(d^3)$ cost, the rank-1 property of $\mathbf{A}_t$ allows for a closed-form simplification. Specifically, $\mathbf{A}_t^n = \lambda_t^{n-1} \mathbf{A}_t$ for $n \geq 1$ , where $\lambda_t = \mathbf{k}_t^\top \mathbf{k}_t$ . This enables the exponential to be collapsed into:

e^{-\beta_t\mathbf{A}_t} = \mathbf{I} - \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{A}_t.

Similarly, the integral term simplifies due to the identity $\mathbf{A}_t \mathbf{b}_t = \lambda_t \mathbf{b}_t$ , yielding:

\int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tau = \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{b}_t.

Combining these results, the final Error-Free Linear Attention (EFLA) update rule becomes:

\mathbf{S}_t = \left( \mathbf{I} - \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{k}_t \mathbf{k}_t^\top \right) \mathbf{S}_{t-1} + \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{k}_t \mathbf{v}_t^\top.

This update retains the same algebraic structure as DeltaNet, enabling seamless adoption of existing hardware-efficient parallelization techniques. The authors further derive a chunkwise parallel formulation by unrolling the recurrence and expressing the state as a product of decay operators and accumulated inputs. This allows for efficient batched computation over sequence chunks, maintaining $\mathcal{O}(Ld^2)$ complexity while enabling full parallelism.

The spectral properties of $\mathbf{A}_t$ also reveal an implicit gating mechanism: the key norm $\lambda_t$ controls the decay rate along the direction of $\mathbf{k}_t$ . Large $\lambda_t$ induces rapid forgetting, while small $\lambda_t$ results in near-linear decay, effectively prioritizing retention of historical context. In the limit $\lambda_t \to 0$ , EFLA recovers the delta rule, confirming that prior linear attention methods are first-order approximations valid only under non-stiff dynamics.

By grounding the attention mechanism in continuous-time dynamics and deriving its exact solution, EFLA eliminates the numerical error inherent in discretized approximations, offering a theoretically grounded, scalable, and stable alternative to existing linear attention formulations.

Experiment

Numerical stability tests on sMNIST: EFLA maintains significantly higher accuracy than DeltaNet under pixel dropout, OOD intensity scaling, and additive Gaussian noise, especially with a learning rate of 3e-3, validating its robustness against error accumulation and state explosion.
Language modeling on Wikitext and LAMBADA: EFLA achieves 81.28 perplexity (vs. DeltaNet's 96.26) and 23.9% accuracy on LAMBADA, while surpassing DeltaNet by +7.4% absolute accuracy on BoolQ, confirming superior long-sequence information fidelity.
Learning rate analysis: EFLA requires a larger learning rate (3e-3) to counteract saturation effects, empirically validated by improved robustness under interference compared to conservative rates (1e-4).

The authors compare EFLA and DeltaNet on language modeling and reasoning tasks using 340M and 1.3B parameter models. Results show EFLA consistently outperforms DeltaNet across most benchmarks, achieving lower perplexity on Wikitext and LAMBADA and higher accuracy on tasks like BoolQ and SciQ, with the performance gap widening at larger scale. This improvement is attributed to EFLA’s exact decay mechanism, which preserves long-range context fidelity more effectively than DeltaNet’s Euler-based approximation.

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

2 months ago

Jingdi Lei Di Zhang Soujanya Poria

Table of Contents

Abstract

One-sentence Summary

Key Contributions

Identifies that existing linear attention methods suffer from numerical instability due to low-order discretization of continuous-time dynamics, causing truncation errors especially in long-context scenarios where Euler-based approximations fail. This explains performance degradation in noisy environments and extended sequences.
Reformulates linear attention as a continuous-time dynamical system governed by a first-order ordinary differential equation, revealing that standard implementations correspond to suboptimal numerical integration schemes like Euler discretization. This theoretical perspective bridges attention mechanisms with continuous-time system modeling.
Derives an exact closed-form solution for the rank-1 dynamics matrix that eliminates discretization errors while maintaining linear-time complexity, validated through language modeling perplexity improvements and superior downstream benchmark performance over DeltaNet without additional parameters.

Introduction

Top Figure

Method

The core formulation begins by interpreting the DeltaNet update — which minimizes a reconstruction loss via gradient descent — as a discretization of the continuous-time ODE:

\frac{d\mathbf{S}(t)}{dt} = -\mathbf{A}_t \mathbf{S}(t) + \mathbf{b}_t,

To eliminate these errors, the authors derive the exact solution to the ODE by taking the infinite-order limit of the Runge-Kutta family. This yields:

\mathbf{S}_t = e^{-\beta_t\mathbf{A}_t} \mathbf{S}_{t-1} + \int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tau.

e^{-\beta_t\mathbf{A}_t} = \mathbf{I} - \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{A}_t.

Similarly, the integral term simplifies due to the identity $\mathbf{A}_t \mathbf{b}_t = \lambda_t \mathbf{b}_t$ , yielding:

\int_0^{\beta_t} e^{-(\beta_t - \tau)\mathbf{A}_t} \mathbf{b}_t \, d\tau = \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{b}_t.

Combining these results, the final Error-Free Linear Attention (EFLA) update rule becomes:

\mathbf{S}_t = \left( \mathbf{I} - \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{k}_t \mathbf{k}_t^\top \right) \mathbf{S}_{t-1} + \frac{1 - e^{-\beta_t\lambda_t}}{\lambda_t} \mathbf{k}_t \mathbf{v}_t^\top.

Experiment

Numerical stability tests on sMNIST: EFLA maintains significantly higher accuracy than DeltaNet under pixel dropout, OOD intensity scaling, and additive Gaussian noise, especially with a learning rate of 3e-3, validating its robustness against error accumulation and state explosion.
Language modeling on Wikitext and LAMBADA: EFLA achieves 81.28 perplexity (vs. DeltaNet's 96.26) and 23.9% accuracy on LAMBADA, while surpassing DeltaNet by +7.4% absolute accuracy on BoolQ, confirming superior long-sequence information fidelity.
Learning rate analysis: EFLA requires a larger learning rate (3e-3) to counteract saturation effects, empirically validated by improved robustness under interference compared to conservative rates (1e-4).

Source PDF View Code

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Jingdi Lei Di Zhang Soujanya Poria

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Jingdi Lei Di Zhang Soujanya Poria

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Jingdi Lei Di Zhang Soujanya Poria

Abstract

One-sentence Summary

Key Contributions

Introduction

Method

Experiment

Build AI with AI

HyperAI Newsletters