Page Status: Draft ★☆☆
I mentioned way back in the introduction that I find it useful to think about LLMs first in terms of the fundamental concepts, and then in terms of the algebraic reformulations of those concepts. Until now, I’ve been focusing exclusively on the conceptual layers. In this chapter, I’ll describe how those get bundled into mathematical objects that are more efficient to compute.
There are two major parts to this:
Turning vectors-of-vectors into matrices
Increasing the rank of all tensors by 1, so that we can add a batching dimension
The architecture’s conceptual shape¶
Before we dive into the algebraic reformulations, let’s take a look at the LLM’s architecture once more, this time focusing on the shapes of the learned parameters and activations. I’ll skip the tokenization phase, since that’s effectively a preparation step that happens before the LLM itself runs.
For most of the LLM, the activations are in the form of vectors, each size . The final output is still vectors, but each sized (the vocabulary size).
Vectors of vectors → matrices¶
The basic “lifting” we’ll do is to to turn vectors of vectors into matrices. This will let us turn the various “for each outer vector, do some stuff” loops that we’ve been working with into matrix multiplication (I’ll describe each of these in detail below). This doesn’t change what’s going on conceptually, but it lets us do the math on GPUs that process it much more quickly.
All we need to do is turn each “outer” vector into a row in a matrix:
Let’s work through what that means for the calculations I’ve described in the previous chapters.
Calculating attention¶
Recall that we calculated attention by doing a nested loop over the input embeddings:
For each input embedding (there are of them):
Calculate the query vector . This vector has size .
For each input embedding (the same embeddings as for the query vector), calculate the attention score of against :
Calculate the key vector . This vector has size .
Calculate the dot product to get the attention score (a scalar).
Treat those attention scores as a vector; scale and softmax that vector to get the attention weight vector (size ).
Calculate value vectors:
For every input embedding , calculate a value vector . There are such vectors, each size .
Multiply each value vector by the corresponding attention weight (the scalars from the previous step). The result is still vectors, each size .
Sum the value vectors to get the context vector. This vector has size .
There are inputs (that is, iterations of the loop), so we ended up with context vectors, each of size .
Let’s see how much of this we can turn into matrix math. (Spoiler alert: almost all of it.) Instead of a nested loop that generates vectors of size , we’ll use matrix math to generate an matrix.
Calculating the query matrix¶
I’ll start with step 1.1 above. We’ll focus on one iteration of the loop — call it — and calculate the key vector for input . (Remember that is a -sized embedding vector.)
If we do this for each embedding, we get a matrix that we’ll call . This matrix represents the 1.1 step executed across all of the top-level iterations:
Let’s put each of those iterations into a row of a matrix:
This looks like matrix multiplication — and it is! Specifically, the elements make up a matrix whose rows are the inputs and whose columns are each input’s embedding dimensions; and the elements represent the weight matrix.
This means we can calculate with just one matrix multiplication:
This is really powerful! It means the first part of the nested loop (steps 1 → 1.1) can be reduced to a single matrix multiplication, which GPUs are extremely efficient at processing. We’ll be doing similar things for the key and value vectors, so I’d suggest taking the time to work through the above and make sure it makes sense to you.
Calculating attention scores matrix¶
Now, we can move onto the raw attention scores. This corresponds to step 1.2 above.
First, let’s calculate the key matrix . This is exactly the same as the query matrix , except that it uses instead of . Because the progression from vectors-of-vectors to matrix is the same, I won’t spell it out in full.
Next, we’ll calculate all the attention scores as a matrix. Each row will correspond to a query token, and each column will be the attention score between that query token and the corresponding key token:
Once again, this looks like matrix multiplication! The one problem is the key vectors. In that matrix multiplication for the attention scores, each needs to be a -sized vector corresponding to a horizontal row within . But, if we calculated this matrix as , then the thing that should be -sized key vectors would instead be the -sized vertical slices of :
Not only is this not what we want, but the math isn’t even defined: we’re taking the dot products of -sized vectors and -sized vectors.
What we need is to replace the vertical slicing of with horizontal slicing. To do that, we just need to transpose . This turns its rows into columns — meaning that when take vertical slices of the transposed matrix during multiplication, what we actually get are the rows of .
Now we can just multiply by . For example, the first cell in this matrix would be:
Now the math works out: we’re multiplying by to get an matrix, the raw attention scores:
Just to belabor the point: we’ve turned all of the nested looping in steps 1 → (1.1 - 1.2) into just a few matrix operations:
Transpose (this doesn’t even require moving any memory: it’s just a bit of metadata to tell the computer to treat as )
Causal attention, scale, and softmax¶
Next, we just need to apply the causal mask, scale each element in the attention scores by dividing it by , and then apply softmax. This corresponds to step 1.3 above.
Note:
To apply softmax, you can create an matrix with 0s in the bottom-left and in the top right:
Then, just add this to the attention scores matrix.
Dividing a matrix by a scalar () just divides each of its cells by that scalar.
Softmax operates on vectors. When we apply it to a matrix, this really just means to applying it to each row in that vector. Each of those rows will have softmax calculated independently, but GPUs can parallelize the work efficiently across those rows.
None of these operations change the dimensions of the matrix, so it’s still .
Context matrix¶
Finally, we’ll apply our weights against the value vectors, and sum the results. This corresponds to step 1 → (1.4, 1.5) above.
First, we’ll get the value matrix , similar to the above. This is step 1.4.1.
Each row in this matrix is one value vector.
Before we go further, let’s step back and compute just a single context vector (that is, just a single token’s attentions) the matrices we’ve computed so far.
Just to recap, here’s what we need to do:
For each input :
...
Calculate value vectors:
For every input embedding , calculate a value vector . There are such vectors, each size .
Multiply each value vector by the corresponding attention weight (the scalars from the previous step). The result is still vectors, each size .
Sum the value vectors to get the context vector. This vector has size .
This means that within the context of a single query token , we need to:
Take all the value vectors (step 1.4.1):
(Remember that is an matrix; each row is a -vector.)
Multiply each value vector by its corresponding attention weight for query token (step 1.4.2):
Sum the weighted vectors to get the context vector for query (step 1.5):
We’ll call this vector , the context vector for the -th query token. Let’s see what all the s look like stacked as rows of a matrix:
Each cell is a sum of terms: each of row ’s columns multiplied by column 's rows. In other words, each cell is a dot product:
This may look familiar: it’s just the matrix multiplication .
The full attention calculation¶
So, attention is . If we substitute with the expression from Causal attention, scale, and softmax above, we get:
This is the canonical representation of attention, and is somewhat famous within the literature of LLMs.
This means we’ve now turned the all of the attention calculation — a logically multiple-nested loop — into a few matrix multiplications and a bit of parallelizable manipulation:
divide these by
apply softmax to each row to get , the attention weight matrix
A GPU is going to eat this for breakfast!
Multi-head attention¶
Back in the chapter on attention, I talked about how LLMs use multiple heads within a single attention layer, each learning a different relationship. The attention layer concatenates these heads, and then uses a final projection to combine them.
Described as such, this would require looping over each of the heads to perform the attention function we just saw. It may not surprise you that this can be done without looping, using tensor math.
First, let’s refresh the multi-head ideas:
is the input length, in tokens
overall dimensionality is still (for both input and output)
heads
each head is
To illustrate everything, I’ll pick , , and .
First, for each of , , and , we’ll concatenate the heads’ weights to create a single, matrix. For example:
(Remember, each head is — so in our example, , a.k.a .) Note that this doesn’t happen at runtime, during inference: these are learned parameters, so we can lay them out this way as we build the model.
At inference, we’ll multiply the input by these matrices, just as we did in the single-head description above. So for example:
Remember that in matrix multiplication, the each cell in the result combines the corresponding row from the left matrix (the input, for us) and the column from the right matrix (the weights). Since our weights were split up column-wise by heads, the corresponding matrix products are, too:
At this point, we need to do actual looping — not just clever matrix math. For each of the heads, we’ll compute attention just as we did above:
(Remember to divide the scaling factor by to account for each head’s smaller embedding dimension!) Let’s look at the shape of this head attention. We can disregard softmax and (they don’t change the shape of vectors or matrices), in which case we get:
So with all that, we now have head attentions, each sized . Now we reverse the process that we took with the weights: we take these head attentions, concatenate them by columns, and treat them as a single attention output:
In this figure, each “side” of the attention represents the output from one head, sized . The concatenated heads form a single, matrix.
If you recall, the last step in the multi-head process was to multiply the output by a matrix. This is just a matrix, so there’s nothing special to do here: we just apply the matrix multiplication.
Combined QKV matrix¶
In the above (and back in our original chapter on attention), we treated the , , and weight matrices as three separate matrices. To calculate the query, key, and value matrices, we did:
In practice, these are usually concatenated into one matrix, :
(Note that for brevity, I’m being a bit informal in my notation here: in particular, I’m writing the various values as just , and similarly for and ). We apply matrix multiplication and addition to this:
... and then just split the matrix into three slices:
This lets us to do all three matrix multiplications (, , and ) in a single operation. GPUs have some fixed overhead in any given matrix multiplication, so this optimization just amortizes that overhead across all three matrices.
KV caching¶
In all of the above, we’ve been calculating the full attention for an input tokens. When we process the user’s initial prompt, this is great. But as we generate tokens, we can calculate attention incrementally.
For example, let’s say the user entered The quick brown fox. This prompt is 4 tokens, so our attention is . We generate the next token, jumps, and then loop back for another round of inference. The naive, full-attention calculation I’ve been describing so far would require a attention, for The quick brown fox jumps. We can short-circuit much of this calculation.
Let’s take a quick review of everything we did above. To make things concrete, I’ll pick , and we’ll look at a 3-sequence input ().
First, we calculate , , . These are all matrices (the products of the input and the weight matrices):
Next, we’ll calculate , which an matrix:
(Note that I’m using informal, nonstandard notation here: to represent row 1 of , and to represent column 1. Also, for simplicity, I’m omitting causal attention, scaling, and softmax — we don’t need them right now. That means isn’t quite the we’ve been using above.)
Finally, we calculate , which is :
Remember that when our round of inference is done, we’re only going to use the last logit, which will be derived just from the last row of this attention (after passing it through various FFNs and other transformer blocks). So, let’s focus on the last row of .
As a reminder, is:
This means that contains:
all of ’s data ( for every token )
all of ’s data ( for every dimension )
only the row
So far, this is just a reshash of everything we’ve already seen. Here’s where it gets interesting! We can make some observations:
Each row in is independently calculated, based on the weights and the th token’s embedding. In other words, each token stays within its row in .
This means we can cache at every round of inference. In the next round, we don’t need to calculate all of from scratch: the cache gives us the first rows. All we need to calculate is the th row.
Similarly for .
For , we don’t need the first rows at all: all we need is the th row.
We do still need to build the full and matrices; we just don’t need to compute most of them, since all but the last row are cached. For , we don’t even need to build the full matrix.
Also, since we’re now only computing the last row of attention, we don’t need to account for causal attention. Remember that the attention mask only zeroed out weights for rows before the last row; the last row is unaffected, and that’s the only one we’re generating.
Putting it all together, we have essentially the same attention formula as before, but tweaked to only generate the last row:
is a matrix, derived from just the most recent token (a embedding) and the weights
is constructed by taking the cached — an matrix -- and appending the matrix that’s the most recent token multiplied by .
is similarly constructed
So:
is
softmax and scaling maintain these dimensions
is
And there we have it! We’ve calculated just the last row in attention, which will then snake through the FFN and other transformer blocks to produce a single logit, the next prediction.
Implementation details¶
To do all of the matrix concatenations efficiently, we need to get into the nitty-gritty of standard matrix libraries in software. This book doesn’t cover any particular library, but they’ll all work pretty similarly.
We’ll pick up where we split the matrix by head:
Tensor libraries will let you reinterpret one tensor as another, differently-shaped one. In our case, we’re going to reinterpret the matrix (a rank 2 tensor) into an tensor (rank 3), which splits it up just as we’ve just visualized.
Why does the reinterpretation work like that?
Internally, tensor libraries typically store the matrix as a single, contiguous array of values:
The first dimension splits this evenly among the dimension size (in this case, , so three groups):
The next dimension splits each of these groups, again evenly depending on size (in this case, , so two sub-groups):
If we had even higher dimensions (4-rank tensors and above), we would keep going.
The last dimension is always just “the rest of the elements at this level” — or more colloquially, we can think of it as the column dimension.
This is all a bit of a simplification: things like transposition and optimization details can complicate the picture. But at a high level, it’s a good way of understanding what’s going on.
When tensor libraries perform matrix operations on higher-order tensors (rank 3 or above), they treat the leftmost dimensions as “batching dimensions” — basically, dimensions to loop over. They can load this batching to GPUs, where it happens very efficiently.
Unfortunately, this approach doesn’t quite work for our tensor: we want to loop over each of the heads, not each of the rows. To solve this, we’ll first transpose the tensor to . This doesn’t change its layout at all: it just changes how the library indexes into the tensor, and thus how it batches.
To summarize, we’ve done three things:
multiplied to get an matrix
reinterpreted that as an tensor
transposed that to a tensor
Now we just apply the attention formula:
This time, , , and are each those 3-rank tensors, with as the batch dimension. When they’re multiplied together, the libraries will match each batch up. For example, to multiply :
is
is (you may have to do this transposition explicitly)
When the library multiplies the two, it’ll first multiply batch 0 from with batch 0 from to produce batch 0 of the result. Then it multiplies batch 1 with batch 1 to produce result batch 1, and so on.
When you apply the function, you’ll explicitly tell the library which dimension to apply it against (in our case, the columns — that is, the last dimension). only applies to scalars, so it doesn’t need any dimension or batch handling; it just applies independently to each of the values.
The result of all that is an attention tensor, which is . Now we just reverse the reshaping: we transpose this to and then reinterpret it as a rank-2, matrix.
FFNs¶
As I mentioned in the previous chapter, each FFN in an LLM typically consists of the input sized , one hidden layer sized , and an output layer sized . The FFN’s input and output correspond to a single token embedding; this gets evaluated separately for each token, though GPUs are able to do those separate evaluations efficiently in parallel.
Let’s look at the FFN from the perspective of one layer. Remember from the chapter on FFNs that each layer has:
an input vector of scalars, sized
neurons, each containing a -sized vector of weights
for each neuron, we:
calculate the dot product of the input and that neuron’s weights; this gives us a scalar
add a scalar bias, one per neuron
pass that through an activation function to get one scalar per neuron, which is that neuron’s activation
We can visualize the neuron weights as column vectors, each with elements:
You may already see where this is going: we can treat this as a single matrix. I’ll call this matrix .
We can also treat the layer’s -vector as a matrix, which I’ll call X. If we do, we see that the matrix multiplication gives us the right shape:
Furthermore, each column in the output is the right value for the pre-bias neuron activation. For every column in , its value is:
Now we need to add the biases. There are of them, one per neuron. Instead of treating them as separate values and adding them one at a time,we’ll treat them as a single matrix, and add this to the result from . I’ll call this bias matrix .
After that, we just need to apply the activation function. This does have to be applied to each value separately, but GPUs can efficiently parallelize that work.
This gives the full representation of each FFN layer:
To create the full FFN, we just apply each layer serially.
One crucial optimization we can make is to do all of the tokens’ calculation at once. Remember that is a matrix, corresponding to a single token in the prompt. If we consider the whole prompt, this is an matrix:
If we multiply this by the matrix , the result will have rows, each corresponding to one row from the input , and representing that row multiplied by the weight matrix .
We still need to conceptually loop over each of those rows to add , and then over every value to apply the activation function. GPUs can handle both of those efficiently, though.
Normalization¶
Recall that for each embedding token, the normalization layer is calculated as:
We need to apply this to each input embedding separately. Unfortunately, here there’s nothing tricky we can do with matrix math: not only does each embedding need to be evaluated separately, but calculating the mean and variance requires per-element calculations.
Luckily, the calculations themselves are pretty simple. And, as before, GPUs can handle these efficiently and in parallel.
The tokens of embedding do get treated as a single matrix. This is partially because GPUs know how to parallelize work efficiently across rows of matrices. It’s also convenient for feeding the normalization into the attention layer, which as we’ve seen does gain from seeing the whole input as a single matrix.
Batching¶
Up until now, we’ve been working with one input at a time. In practice, GPUs can process multiple inputs in parallel.
This doesn’t affect the learned parameters at all; just the activations. Basically, we just lift them into a tensor of 1 higher rank. Instead of representing the input as an matrix, we’ll represent it as a tensor.
The rest of the math is exactly the same. At the hardware level, this will result in the same operations (including the same weights) being applied to different inputs at the same time. GPUs are highly optimized for this.
The final architecture¶
Our LLM now has essentially the same architecture as before: the only real difference is that we’re treating the inputs not as vectors of size vectors, but a single matrix. Similarly, the output is an matrix. This lets us reformulate the operations we’ve already seen as matrix operations instead of logical loops, which lets us compute them far more efficiently on GPUs.
This diagram elides some of the complication, especially in the attention layer (and specifically, its multi-head architecture, as described above).
That’s it! You have an LLM!
If someone were to provide you good values for all the weights throughout the architecture (and a lot of AWS credits ;-)), you’d have enough to build an LLM that would have been competitive in early 2020. You’re not about to take down OpenAI or Anthropic, but that’s still pretty neat!