The Same Trick That Made Transformers Great Just Made Them Better
In the first post of this series, we walked through how transformers work and why attention was such a breakthrough. In the second post, we looked at what happens when attention itself becomes the bottleneck, and how architectures like Mamba are rethinking long-context processing.
This third post closes the series by going back inside the transformer and poking at something most people have never thought to question: the residual connection.
Every transformer since the original 2017 paper has used them. Every LLM you have worked with has them. They are in GPT, LLaMA, Claude, Gemini. They are so standard that they barely get mentioned in architecture discussions. They are just... there.
In March 2026, the Kimi team published a paper called Attention Residuals that asked a question nobody thought to ask: what if the way we have been doing residual connections is subtly wrong, and there is a better way?
The answer turns out to be more interesting than you might expect.
A Quick Refresher: What Residual Connections Actually Do
Before we get to the problem, let us make sure the foundation is solid.
When a transformer processes your input, it runs it through a stack of layers. Each layer does some work, usually attention followed by a feed-forward network, and produces an output. That output then becomes the input for the next layer. Simple enough.
But there is a problem with very deep networks: vanishing gradients. During training, the model learns by computing how wrong its output was and then propagating that error signal backwards through all the layers. In very deep networks, that error signal tends to get smaller and smaller as it travels back. By the time it reaches the early layers, there is almost nothing left to learn from. Those layers get stuck.
ResNet solved this in 2015 with a beautifully simple idea: instead of just passing the transformed output to the next layer, also add back the original input. So if a layer takes in x and produces F(x), the next layer receives F(x) + x rather than just F(x).
Think of it like a highway running alongside a series of processing stations. The main flow goes through each station and gets transformed. But there is always a direct lane bypassing each station. If a station is broken or unhelpful, the signal can just flow through the bypass unchanged. Residual connections are gradient highways. They give the error signal a direct path back through the network and prevent it from disappearing.
Transformers adopted this immediately and it worked brilliantly. Training became more stable, models could be made much deeper, and performance improved. Residual connections became part of the standard recipe. Nobody questioned them much after that.
Until now.
The Hidden Problem: Uniform Accumulation
Here is what happens when you stack many residual layers on top of each other. At each layer, you add the current layer's output back to the running total. By the time you reach layer 50 in a deep model, the hidden state is the sum of contributions from all 49 layers before it, each added with equal weight.
Every layer contributes equally. Layer 1 gets as much say as layer 49. There is no mechanism for a later layer to say "I need more from layer 3 and less from layer 47." It is just a straight addition, every time, all the way down.
The Kimi team identified the consequence of this: in very deep networks, early-layer signals get progressively buried. As you go deeper, each layer's output gets added to a growing pile. The pile gets bigger. The relative contribution of any single early layer shrinks as a fraction of the whole. By the later layers, the signal that layer 1 put in is a tiny whisper beneath fifty layers of noise.
Imagine a cocktail party where everyone is talking at the same volume. The people who spoke first are still technically in the conversation, but their voices are getting drowned out by every new person who joins. The conversation keeps growing, and the earliest contributions get buried under the weight of everything that came after them.
That is what is happening inside a deep transformer with standard residual connections. The model has no way to selectively listen to earlier layers. It just accumulates everything equally and hopes the relevant signals survive.
PreNorm Does Not Fix This
You might be thinking: "Wait, don't modern transformers use layer normalization? Doesn't that fix the signal dilution problem?"
It is a fair question. Most LLMs today use what is called Pre-Layer Normalization (PreNorm), where the input to each sub-layer gets normalized before going through attention or the feed-forward network. This was specifically introduced to stabilize training in deeper models.
But here is the thing: normalization stabilizes the values, it does not fix the underlying structure of how layers accumulate. PreNorm addresses the scale of the signals. It does not address the selection problem. Even after normalization, every prior layer still contributes with equal structural weight to the next. The accumulation is still uniform. The burial of early signals is still happening.
The Kimi paper validates this directly. Their experiments show that even with PreNorm, the hidden state magnitudes grow with depth in a way that causes progressively less uniform gradient distribution across layers. Later layers receive much stronger gradients than earlier ones. The model is effectively learning unevenly, with the deeper layers doing the heavy lifting while earlier layers contribute less and less.
Normalization smooths the ride. It does not fix the route.
The Fix: Apply Attention Across Depth, Not Just Sequence
This is where the Kimi team's insight gets elegant. Transformers already have a proven mechanism for selective aggregation: attention. We use it to selectively pull information from different tokens in the sequence. Why not use the same principle to selectively pull information from different layers in the depth?
That is exactly what Attention Residuals (AttnRes) do. Instead of summing all prior layer outputs with equal weight, AttnRes applies a softmax attention over the outputs of all preceding layers. Each layer computes a pseudo-query vector that represents "what kind of prior representation do I need right now?" That query is then compared against the outputs of every earlier layer, and the most relevant ones are weighted more heavily.
Going back to our cocktail party analogy: instead of everyone's voice contributing equally, imagine you now have noise-canceling headphones that you can tune. You dial up the voices that are relevant to what you are trying to understand and dial down the ones that are not. The earlier speakers do not get buried anymore. If what they said is relevant, the model can actively surface it.
The mechanism mirrors the Q/K/V attention you already know from the sequence dimension, just rotated to work across the depth dimension. The input to each layer does not just get the previous layer's output added to it. It gets a dynamically weighted blend of all previous layer outputs, where the weights are computed based on the current input.
This means the model can, for example, reach back to layer 3's output when processing in layer 30 if that is where the most relevant representation lives. Depth becomes just another dimension that attention can navigate.
Block Attention Residuals: Making It Practical at Scale
Full AttnRes has one practical problem: attending over every prior layer output in a very deep model gets expensive. If you have 96 layers and every layer attends over all 95 before it, you are adding a significant memory and compute overhead on top of an already large model. That is fine in a research paper but hard to deploy in practice.
The Kimi paper introduces a smarter variant called Block Attention Residuals (Block AttnRes) to handle exactly this.
The idea is simple: instead of attending over every individual layer, the model groups layers into blocks and attends over block-level representations. Think of it like chapters in a book. Instead of referencing every individual page to write a new paragraph, you reference the chapter summaries. You still get selective depth-wise attention, but at a much lower memory cost.
The paper also introduces two implementation techniques to make this production-ready: cache-based pipeline communication (so the block summaries do not need to be recomputed every time) and a two-phase computation strategy that sequences the attention operations to avoid memory spikes during training.
The result is something the paper explicitly calls a "drop-in replacement" for standard residual connections. You do not need to redesign your architecture. You swap in Block AttnRes, retrain, and the model gets better without growing significantly in cost.
What the Benchmarks Actually Show
The Kimi team tested this at real scale. They integrated AttnRes into their Kimi Linear architecture, a 48-billion-parameter model (with 3B activated parameters, meaning it is a Mixture-of-Experts style model) and pre-trained it on 1.4 trillion tokens.
The headline results are worth stating clearly:
- AttnRes produces more uniform output magnitudes across depth. The hidden states no longer grow uncontrollably as you go deeper. The signal from early layers stays proportionally meaningful throughout the whole network.
- Gradient distribution becomes more even across all layers. Earlier layers actually receive meaningful gradient signal during training, which means they learn more effectively. The whole model trains as a team rather than having the deeper layers carry most of the load.
- Downstream task performance improves across the board. Every evaluated benchmark showed gains, not just selected cherry-picked ones.
- The scaling law experiments show consistent improvement across model sizes. This is not a trick that only works at a specific scale. The benefit holds as the model grows.
The compute efficiency story is also compelling. The paper's scaling law analysis suggests that AttnRes models reach the same performance point as standard residual models at roughly 1.25x less compute. Put differently: for the same training budget, you get a better model. Or for the same model quality, you spend less.
Why This Matters Beyond the Numbers
Step back and look at the pattern across this series.
In post one, we looked at how transformers use attention across the sequence dimension to let every token selectively pull context from every other token. That was the core insight that made transformers so powerful.
In post two, we looked at how that attention mechanism has a quadratic cost, and how architectures like Mamba are rethinking how to process long sequences by replacing global attention with a selective, compressed hidden state.
Now in this post, the Kimi team has taken the same principle of selective, learned aggregation and applied it to an entirely different dimension of the transformer: depth.
The transformer already knew how to selectively attend across tokens. AttnRes teaches it to selectively attend across layers. The same tool that solved the sequence aggregation problem turns out to also solve the depth aggregation problem.
This is what good architectural thinking looks like. Not inventing an entirely new mechanism, but recognizing that a mechanism you already have is solving a structurally similar problem elsewhere, and applying it there too.
It also suggests that residual connections, despite being in every transformer for nearly a decade, were never really solving the problem they appeared to solve. They were a workaround for vanishing gradients that introduced a different problem: uniform, undifferentiated accumulation across depth. AttnRes is the cleaner solution.
What Should You Take Away From This?
If you are an engineer building on top of LLMs rather than training them from scratch, you will not be swapping in AttnRes yourself today. But there are a few things worth keeping in your mental model:
Architecture details shape model behavior in ways that benchmarks alone do not reveal. A model trained with AttnRes has a fundamentally different internal signal structure than one trained with standard residuals. If Kimi's results hold up across the industry, expect to see AttnRes adopted broadly over the next year or two, the same way PreNorm got quietly adopted after the original transformer papers.
Efficiency gains compound. A 1.25x compute improvement might sound modest. But at the scale modern LLMs are trained, and at the volume of inference calls a production system handles, that is a very large number. Architectural improvements that seem small in percentage terms translate to enormous savings at scale.
The "standard" parts of an architecture are worth questioning. Residual connections had been in every transformer for eight years before anyone seriously asked whether uniform accumulation was the right design. AttnRes is proof that interrogating the parts you assumed were solved can yield real gains.
Wrapping Up the Series
Let us close the loop on this three-part arc.
We started with how transformers work: tokens, embeddings, attention, and the repeating block structure that makes deep learning powerful.
We moved to how transformers get challenged: the quadratic cost of attention at long contexts, and how Mamba proposes a fundamentally different way to handle long sequences.
And we finished with how the internals of transformers are still being refined: residual connections, a mechanism so embedded in the architecture it became invisible, turning out to have a real flaw that a well-targeted application of attention can fix.
The common thread across all three posts is this: these architectures are not finished. They are actively evolving. The engineers and researchers working on them are still finding things that were assumed to be solved but were not quite. Understanding the foundations, the way attention works, what residuals are doing, where the costs come from, puts you in a much better position to reason about this evolution as it happens.
If you want to read the original Kimi paper, you can find it here: Attention Residuals (arXiv:2603.15031).
Connect with me on LinkedIn if you have questions or want to discuss where this goes next.