Shattering the Memory Wall: O(1) Inference and Causal Monoid State Compression in Spartacus-1B

Community Article Published February 25, 2026

Author: Zixi Li (Oz) / NoesisLab

The generative AI landscape has been entirely dominated by encoder-decoder stacks and their reliance on Softmax Attention. While powerful, this paradigm carries a fatal flaw: the KV-Cache bottleneck. As context lengths grow, the memory and compute required to store and attend to all previous keys and values scale linearly O(T), erecting a massive "Memory Wall" that cripples deployment efficiency.

At NoesisLab, we believe scaling intelligence should not mean endlessly scaling memory.

Today, we are thrilled to introduce Spartacus-1B-Instruct (1.3B parameters) — a foundational architecture that completely replaces Softmax Attention with Causal Monoid State Compression. Spartacus achieves true O(1) inference time and O(1) memory per token, decoupling sequence length from computational complexity.

The Core Engine: Monoid Recurrence

Instead of keeping a sprawling cache of every historical token, Spartacus compresses the entire causal prefix into a fixed-size state matrix $S_t \in \mathbb{R}^{d \times d}$ for each attention head.

We define the causal history through a strict mathematical monoid recurrence:

$S_t = \text{diag}(\alpha_t) \cdot S_{t-1} + k_t \otimes v_t$

$o_t = q_t \cdot S_t$

The technical magic lies in the associativity of the monoid operator . we can completely transform how the model operates across training and inference:

Training (Parallel Prefix Scan): We bypass the sequential curse of traditional RNNs. Using our custom Triton-accelerated JIT kernels (monoid_scan_cuda), Spartacus computes all prefix states simultaneously. This yields O(T) training efficiency, fully saturating GPU memory bandwidth.
Inference (True O(1) Sequential Updates): During generation, the model executes a single monoid_op step. It folds the new token's outer product into the existing matrix and reads it out via a single matrix multiplication. Whether you are generating the 10th token or the 100,000th token, the memory footprint and latency remain absolutely constant.

Explicit Causality & Vector Decay

In standard encoder-decoder stacks, causality is a hack—enforced artificially through lower-triangular attention masks, while positional information is injected via RoPE.

Spartacus discards both RoPE and attention masks. Instead, causality is elevated to a first-class citizen, explicitly modeled through learned, content-dependent Vector Decay Gates. Each dimension of the state matrix possesses an independent memory lifetime governed by a Sigmoid activation ).

Fast-decaying dimensions naturally learn to track local syntax and punctuation.
Slow-decaying dimensions act as a robust global memory for entities, facts, and long-range logic.

When the model encounters a PAD token, the architecture gracefully assigns it as the monoid identity element , rendering it completely invisible to the state recurrence.

Beyond Sub-Quadratic: The 75% Reasoning Milestone

Replacing Softmax Attention usually incurs a heavy penalty on zero-shot capabilities. However, the vector-decay monoid architecture preserves the expressiveness required for complex reasoning.

Current zero-shot benchmarks demonstrate that Spartacus-1B-Instruct is already outperforming established sub-quadratic architectures like Mamba-1.4B and RWKV-6-1.6B. For instance, Spartacus achieves 0.3063 on ARC-Challenge and 0.5518 on ARC-Easy, proving its zero-shot superiority.

More importantly, our recent integration of structured Chain-of-Thought (CoT) data during the SFT phase has pushed reasoning accuracy to 75%. Because Spartacus excels at implicit state compression, this high-quality CoT data is distilled directly into the S matrix's transition dynamics. The model learns the logic of step-by-step reasoning and internalizes it into its continuous ODE flow, delivering highly accurate conclusions without the agonizing verbosity of traditional models.

The Spartacus

With Spartacus-1B-Instruct, we have proven that the memory wall is a software problem, not an inescapable hardware limitation. By returning to pure, associative linear algebra, we unlock infinite context scaling with zero memory degradation.

The "Spartacus" is about scaling intelligence, not the memory wall,

Explore the code, Triton kernels, and model weights on Hugging Face: [NoesisLab/Spartacus-1B-Instruct] (https://huggingface.co/NoesisLab/Spartacus-1B-Instruct)

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote