When Attention Becomes a Bottleneck: How Mamba Is Rethinking Long-Context AI

By Ankit Gubrani on

In the previous post, we walked through how transformers work and why attention is such a powerful mechanism. We ended on a note that attention has a cost, and that cost grows fast as your sequences get longer. This post is about what that cost looks like in practice, why it matters for the systems you build, and what the industry has come up with to solve it.

The short version: a fundamentally different kind of model called Mamba has emerged that sidesteps the attention bottleneck entirely. To understand why that matters, we need to first understand exactly what problem it solves.

The Quadratic Problem, Made Concrete

When a transformer processes a sequence, every token attends to every other token. That is the whole point of attention. It gives the model a global view of the context. But that global view comes with a price tag.

The compute required grows quadratically with sequence length. If you double the number of tokens, you quadruple the compute. At 4,000 tokens, this is fine. At 32,000 tokens, things get noticeably heavier. At 128,000 tokens, you are looking at memory and compute requirements that are orders of magnitude larger. This is the O(n²) attention problem.

Compute Cost vs Sequence Length 4k 32k 64k 96k 128k Tokens Compute Cost Transformer O(n²) Mamba O(n) ~4x gap at 128k

Think about what this means for a production RAG system. If you are indexing long legal documents, processing full codebases, or ingesting multi-turn conversation histories, you are not working with 4,000 tokens. You are regularly hitting 50,000 to 200,000 tokens. And every time you push that context window further, the transformer pays an increasingly steep price in compute, latency, and GPU memory.

This is not a hypothetical scaling concern. It is a real architectural constraint today. So the question is: is there a smarter way to process long sequences?

How Each Architecture Processes Tokens Transformer Attention Every token talks to every other token T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6 n × n connections = O(n²) cost SSM / Mamba Hidden State Each token updates one compressed state T1 T2 T3 T4 T5 Hidden State (fixed size, always updated) h Output (from state) n updates to one state = O(n) cost

Enter State Space Models: A Different Mental Model

Before Mamba, there was a class of models called State Space Models (SSMs). The idea behind SSMs comes from control theory and signal processing, but you do not need that background to grasp the key insight.

Instead of looking at every token in relation to every other token, an SSM maintains a compressed "hidden state." Think of it as a rolling summary of everything the model has seen so far. As each new token arrives, the model updates that summary. When it needs to produce an output, it reads from that summary.

A useful analogy: imagine reading a long book. A transformer re-reads every page before writing each new sentence. An SSM keeps a running set of notes and updates them as it reads. It never re-reads the whole book, just updates and refers to its notes.

This makes SSMs naturally linear in compute cost. No matter how long the sequence gets, you are always just updating a fixed-size hidden state. That is a fundamental architectural advantage over attention.

The catch? Early SSMs had a problem: they treated all tokens equally. Every word, whether it was a critical decision point or a filler word, got the same weight when updating the hidden state. That made them weaker than transformers at tasks that require selectively remembering specific pieces of information far back in a sequence.

Mamba: Teaching the Model to Be Selective

Mamba, introduced by Albert Gu and Tri Dao in late 2023, took SSMs and added one critical upgrade: selectivity. Instead of using fixed parameters to update the hidden state, Mamba makes those parameters input-dependent. The model learns, based on what it is currently reading, what to hold on to and what to let go.

Vanilla SSM vs Mamba: The Selectivity Difference Vanilla SSM Token 1 Token 2 Token 3 Token 4 fixed update weights Hidden State Output (no selectivity) Mamba (Selective) Token 1 Token 2 Token 3 Token 4 input-dependent weights Selective Hidden State Output (context-aware)

Think of it like an experienced journalist taking notes at a long press conference. A junior reporter writes down everything equally. The experienced journalist knows which statements matter, which are just filler, and edits their notes in real time. Mamba is the experienced journalist.

This change makes Mamba much more competitive with transformers on tasks that require nuanced understanding over long sequences, while still processing in linear time.

In benchmarks, Mamba matched or outperformed similarly sized transformers on language tasks while using significantly less memory and running faster at inference, especially as sequence lengths grew beyond 32k tokens.

Mamba 2 and 3: Closing the Gap with Transformers

The original Mamba was a strong proof of concept. But it still had gaps. Transformers had years of optimization behind them, hardware-efficient kernels, and techniques like FlashAttention. Mamba needed to catch up.

Mamba 2 introduced the idea of Structured State Space Duality (SSD). This sounds complex, but the practical implication is meaningful: the authors showed that Mamba's selective state space mechanism is mathematically related to a restricted form of attention. That meant Mamba 2 could be optimized using similar hardware-level tricks that made transformers fast. The result was a model that was up to eight times faster to train than Mamba 1, with better performance on language modeling tasks.

Mamba 3 went further by focusing on training stability and hybrid architectures. One of the most interesting directions here is mixing Mamba layers with a small number of full attention layers. The insight is that you do not have to choose one or the other. Use Mamba for the bulk of the long-context processing, and sprinkle in attention heads where precise recall from specific positions is critical. This hybrid approach gets the best of both worlds.

Several production-leaning models today, including some from AI labs experimenting with efficient inference, use exactly this kind of hybrid design. It is a sign that the industry has stopped thinking of these as competing approaches and started thinking of them as complementary tools.

Mamba vs Transformers: Where Each Actually Wins

This is not a "Mamba is better" or "transformers are better" conversation. Both are genuinely useful depending on the task. Here is an honest breakdown.

Mamba vs Transformers: Head-to-Head Criterion Mamba Transformer Long-context processing ✓ Wins (linear cost) Expensive at 100k+ Inference memory usage ✓ Wins (fixed state) Grows with context Streaming / real-time ✓ Wins (O(1) step) Re-attends each step In-context learning Weaker ✓ Wins Precise positional recall Can lose detail ✓ Wins Tooling and ecosystem Still maturing ✓ Wins (years of work) Hybrid architectures combine both to get the best of each

The key trade-off to internalize: Mamba is excellent at processing long sequences efficiently, but transformers are better at precisely locating specific information within a context. If your task is "understand the general meaning of this 200-page document," Mamba has an edge. If your task is "find the exact clause in paragraph 47 that contradicts paragraph 112," transformers are still more reliable.

This is why hybrid models are gaining traction. They use Mamba-style layers for the long-range context sweep, and targeted attention layers for the precise recall tasks.

What This Means for Your Architecture

If you are building production AI systems today, Mamba is probably not a drop-in replacement for the transformer-based models you are already using. The ecosystem, tooling, and pretrained model availability are still catching up. But there are real situations where you should be paying attention to SSM-based or hybrid models.

Consider a Mamba-based or hybrid model when:

  • You are processing very long documents, codebases, or conversation histories regularly (50k to 200k tokens).
  • Inference latency and memory cost are constraints, especially in streaming or real-time response scenarios.
  • You are building on edge devices or memory-constrained infrastructure.
  • Your task is more about summarization, synthesis, or general comprehension over long context rather than precise fact retrieval.

Stick with transformer-based models when:

  • Your context windows are moderate (under 32k tokens) and cost is not a primary concern.
  • You need strong few-shot learning, precise retrieval from prompts, or reliable positional accuracy.
  • You are relying on a rich ecosystem of fine-tuned models, integrations, and evaluation tooling.
  • You need a battle-tested model with broad benchmark coverage and community support.

For most web engineering teams building AI features today, the most practical takeaway is this: understanding that the architecture underneath an LLM shapes its cost and capability profile helps you ask better questions when evaluating models. Not every 128k-context model is equal. How it handles that context, whether through quadratic attention or something more efficient, changes the economics significantly.

Wrapping Up

Let us take a step back and look at the arc of this series so far.

In the previous post, we covered how transformers work. Attention is what made them so powerful: the ability to look at every token in relation to every other token, in a single pass, gave the model a rich global understanding of context. That was a genuine leap forward, and it is why transformers became the default architecture for almost everything in AI.

In this post, we looked at the cost that comes with that power. Attention scales quadratically, and at long context lengths that becomes a real constraint in terms of memory, compute, and latency. State Space Models offered a different approach: a compressed rolling hidden state that processes sequences in linear time. Mamba took that further by making the state selective, so the model learns what to remember and what to drop. Mamba 2 and 3 closed the performance gap with transformers, and today hybrid architectures are emerging that combine the strengths of both.

The pattern here is worth noting. Each generation of architecture did not throw out what came before. It identified a specific limitation and addressed it directly. Transformers fixed the memory problem of RNNs. SSMs fixed the quadratic cost of attention. Mamba fixed the selectivity problem of vanilla SSMs. This is how architectural progress actually works in practice.

In the next post, we continue that thread. We will look at another foundational building block of the transformer that has sat quietly in every model since 2017, rarely explained and rarely questioned, until researchers started prodding at it and found something unexpected. That building block is the residual connection, and it turns out the story behind it is more interesting than most engineers realise.

Connect with me on LinkedIn if you want to be notified when the next post goes live.