Backpropagation - Yuval's Intro to LLMs

Page Status: Draft ★☆☆

Basics are in place, but needs review.

Introduction¶

At the heart of our LLM’s training is backpropagation, or “backprop”. This is often described either in simple terms, as in Wikipedia’s introduction to the topic...

It is an efficient application of the chain rule to neural networks.

...or in complex math terms, as in the rest of that Wikipedia article. I’ll try to hit an in-between. In particular, this chapter will start with a quick refresher on some level-1 calculus (including what the chain rule is), and then work through how it applies to backprop.

At its core, backprop tries to answer a conceptually simple question:

We have a model with a bunch of learned parameters, each of which has some value.
We take some input for which we know the expected result, and we run that input through the model.
We compare the predicted result with the actual result, and notice that the two don’t quite match up.
Backprop then asks: how can we wiggle each parameter such that when we hold all the other parameters constant, but wiggle just that one parameter, the prediction will get closer to the expected value? (The actual mechanism is more efficient than literally wiggling each parameter and recomputing the prediction, as we’ll see below.)

The model has parameters a, b, c, ... n. After it makes a prediction, backprop wiggles a to get the answer closer, then b, then c, and so on.

To do this, we define a loss function, which takes the predicted and actual value, and compares them. We then apply a bunch of math, which does this wiggling. The loss function always produces a scalar (typically non-negative, so that zero represents a perfect prediction), and backprop will wiggle the model’s parameters to get the loss closer to zero.

To get an understanding of how backprop works, I’ll start exceedingly simple and build up from there:

Backprop on a single-layer, scalar model
Backprop on a multi-layer, but still scalar model
Backprop on a single-layer, matrix-based model
Backprop on a multi-layer, matrix-based model

By the last of those, we’ll have a “full” understanding of backprop. After that, the only difference between what we’ve built and a real model is that the real model is bigger.

The math you’ll need¶

This chapter assumes you’re decently familiar with derivatives; if you’re not, it may be tough. If you’re familiar with them but just need a quick refresher, the following sections should help. If you’re already comfortable with these, feel free to jump ahead to the meat of it.

Derivatives¶

If we have some function $y = f(x)$ , then its derivative $y' = f'(x)$ is how fast $y$ changes at any given point $x$ .

We can also express the derivative using what’s called Leibniz notation: $\frac{dy}{dx}$ . This notation makes it explicit that we’re differentiating with respect to $x$ .

To differentiate a polynomial, bring each exponent down as a factor and lower it by one:

\begin{array}{rccccl} y & = & ax^n & + & bx^m & + \, \dots \\[0.3em] & & \downarrow & & \downarrow & \\[0.3em] y' & = & n \; ax^{n-1} & + & m \; bx^{m-1} & + \, \dots \end{array}

Chain rule¶

The chain rule lets you deconstruct a function that’s the composition of two functions — in other words, a function that takes the output of one function and passes it to another:

h(x) = z( \; y(x) \; )

To compute $h$ ’s derivative, we:

take $y$ ’s derivative at $x$ : $y'(x)$
take $z$ ’s derivative at $y(x)$ : $z'(y(x))$
multiply them:
$h'(x) = z'(y(x)) \, y'(x)$

$h'$ is the derivative of $z$ with respect to $x$ , so the Leibniz notation for that is:

h' \longleftrightarrow \frac{dz}{dx}

Using that, we can write the chain rule in a fraction-like way:

\def\t#1{\textit{\scriptsize #1}} \def\tt#1#2{\begin{array}{c}\t{#1}\\\t{#2}\end{array}} \begin{array}{ccccc} \frac{dz}{dx} & = & \frac{dz}{dy} & \cdot & \frac{dy}{dx} \\ \tt{``The derivative}{of z wrt x} & \t{is} & \tt{the derivative}{of z wrt y} & \t{times} & \tt{the derivative}{of y wrt x.''} \end{array}

Note that this isn’t actually a fraction; the Leibniz notation just illustrates (by way of analogy) how the $dy$ elements “cancel out”.

We can intuit why the chain rule works by going back to $h$ ’s definition: $z$ evaluated at $y(x)$ . So, to see how fast $h$ changes as $x$ changes, we take how fast $y(x)$ changes as $x$ changes, and multiply it by how fast $z$ changes as $y(x)$ changes.

Partial derivatives¶

In the above sections, $y$ was defined in terms of a single variable, $x$ . But what if there are two variables, or more?

y = f(x, u, v, \dots)

To handle this, we use partial derivatives. The concept is simple: treat all but one of the variables as a constant, and then take the (ordinary) derivative with respect to that one remaining variable. The Leibniz notation for this is $\pdv{y}{x}$ if $x$ is the “with-respect-to” variable. We can define as many partial derivatives as there are variables:

\pdv{y}{x} \quad,\quad \pdv{y}{u} \quad,\quad \pdv{y}{v} \quad,\quad \dots

Backprop on a simple, scalar model¶

Now that we have our math refreshed, let’s get to the fun stuff! To start our intuition for how backprop works, let’s start with the simplest possible model: a scalar, linear function:

y = ax + b

This is just a plain old line, like you learned about in middle school. We’re going to use machine learning to figure out its slope and $y$ -intercept. Our training data will be a bunch of $(x, y)$ pairs:

plot of data showing points more or less along a line

Since we’re assuming (as the model designers) that the points form a line $y = ax + b$ , our job will be to figure out $a$ and $b$ from the various $(x, y)$ data points. In other words, $a$ and $b$ are the model’s learned parameters.

Our first step is to define a loss function $L$ , which defines how wrong a given prediction is from the true value. A common one is mean squared error (MSE), which we’ll adapt for our scalar model:

L(x) = (y(x) - y_{true})^2

Our simple model will:

Take an $(x, y_{true})$ pair from the training data
Run $x$ through the model (with whatever $a$ and $b$ we currently have) to produce a prediction, $y_{pred}$
Calculate the loss $L = (y_{pred} - y_{true})^2$
Use the chain rule to compute the two partial derivatives, $\pdv{L}{a}$ and $\pdv{L}{b}$ . These give us the gradients for $a$ and $b$ (I’ll explain this in just a second)
Use the gradients to nudge $a$ and $b$ towards their true values

The gradients for each learned parameter ( $a$ and $b$ ) represent the partial derivative of the loss function with respect to that parameter. In other words, it represents just the mechanical, mathematical question of “as that parameter grows, how fast does the loss grow?” Of course, we want the loss to shrink, since it represents how wrong the prediction was. So, we just nudge the parameter in the opposite direction of the gradient.

Visual representation of the steps described above

The first three steps in the list above are trivial (remember that in this example, “run $x$ through the model” is just $y_{pred} = ax + b$ ). Let’s focus on the fourth step, the chain rule.

We’ll focus on $a$ first. What we want is the partial derivative of the loss $L$ with respect to $a$ :

\dpdv{L}{a}

We can think of $L$ as a composed function $L(x) = ( \, y(x) \, - y_{true} )^2$ . That means we can use the chain rule:

\dpdv{L}{a} = \dpdv{L}{y} \cdot \dpdv{y}{a}

Let’s start by calculating the right term, $\pdv{y}{a}$ :

y(x) = ax + b \\[0.3em] \downarrow \\[0.3em] \pdv{y}{a} = x

Now the left term, $\pdv{L}{y}$ :

L(x) = (y(x) - y_{true})^2\\[0.3em] \downarrow \\[0.3em] \pdv{L}{y} = 2(y(x) - y_{true})

Putting it all together:

\begin{align} \dpdv{L}{a} & = \dpdv{L}{y} & \cdot & \; \dpdv{y}{a} \\[1em] & = 2(y(x) - y_{true}) & \cdot & \; x \end{align}

And here’s where the “efficient application of” starts to kick in: during our inference phase, we already calculated $y(x) = y_{pred}$ . If we just store that value during that forward pass, $\pdv{L}{a}$ becomes a trivial calculation: $y(x)$ comes from that stored lookup, and $x$ and $y_{true}$ were our given arguments. We call this value $a$ ’s gradient.

We can can do the same thing to calculate $\pdv{L}{b}$ . I’ll go a bit faster, since it’s basically the same work.

\begin{array}{rccc} \pdv{L}{b} = & \underbrace{\pdv{L}{y}} & \cdot & \underbrace{\pdv{y}{b}} \\[1em] & \textit{\footnotesize same as $\pdv{L}{y}$ above} & & \footnotesize \pdv{}{b} (ax + b)\\[1.5em] = & 2(y(x) - y_{true}) & \cdot & 1 \end{array}

Notice that the left term is exactly the same as it was for $a$ ’s gradient.

With that, we’ve calculated our two gradients, for $a$ and $b$ . Now we just apply each one to its respective parameter ( $a$ and $b$ ) to update them. As I mentioned before, we subtract the gradients, because we want to reduce the loss. Before we do that, we scale the gradients down by $\eta$ , which is a learning rate. This is some small number, like 0.01, and it means that each round of learning only nudges the values towards a 0-loss, instead of lurching them there. This prevents over-fitting any one data point, which can cause the model to overshoot and oscillate around the desired value, or worse, shoot off to infinity.

\eta = 0.01 \\[1.5em] a_{updated} = a - (\eta \; a_{gradient}) \\ b_{updated} = b - (\eta \; b_{gradient})

That’s all there is to it! If we churn this training through a large enough data set, $a$ and $b$ will eventually converge to the right values.

Terminology: residual and local derivative¶

Before we go further, let’s introduce two useful names for the concepts we’ve already learned. Remember that the gradients for $a$ and $b$ each used the chain rule, and in both cases their left-hand term was the same:

\begin{align} \pdv{L}{a} & = 2(y(x) - y_{true}) & \cdot & \; x \\[1em] \pdv{L}{b} & = 2(y(x) - y_{true}) & \cdot & \; 1 \end{align}

Let’s ask where these various terms come from, and do so within the framing of the layer that contains $a$ and $b$ (that is, $y = ax + b$ ).

$2(y(x) - y_{true})$ comes purely from the layer below us ( $\, L(v) = (v - y_{true})^2 \,$ ), where $y_{true}$ can be thought of as a constant)
- $v$ got stored during forward inference
- The fact that we need to $2 \times$ the value is due to the derivative of $L$ — irrespective of what anything else in the model is doing.
The $x$ and 1 each come from partial derivatives local to the $y$ layer. Again, these only depend on the $y$ layer, irrespective of what anything else is doing.

The distinction between information coming from the layer below, and information computed at this layer, is reflected in terminology:

The residual is the left-hand term in the chain rule: the signal from the layer below
The local derivatives are the right-hand term in the chain rule: the partial derivatives applied at this layer

We can think of this for any parameter $p$ as:

\begin{array}{lll} \dpdv{L}{p} & = \text{(signal from lower level)} & \cdot \; \text{(partial derivative of $p$)} \\[0.5em] & = \text{(residual)} & \cdot \; \text{(local derivative)} \\[1em] & = \boxed{r \cdot \dpdv{y}{p}} & \end{array}

...where:

$p$ is a parameter defined at layer $y$
$r$ is the residual, which comes from the layer below $y$

Note that $r$ isn’t an equation, but an actual, concrete value. Each layer gets this value, and then uses it as-is for all of that layer’s parameters. This is a lot of what’s behind the “efficient” in “efficient application of the chain rule”.

The lowest layer, $L$ , is a special case: it doesn’t have a lower layer to provide a residual, so we need to calculate it by figuring out its derivative and plugging in $y_{pred}$ and $y_{true}$ .

Backprop on a multi-layer, scalar model¶

Now that we have backprop working on a single-layer model, let’s add a second layer. For now, we won’t have an activation function between the two:

y_1 = a_1x + b_1 \\ y_2 = a_2 (y_1) + b_2

We’ll use the same loss function as before:

L(x) = (y_2(x) - y_{true})^2

Let’s start by keeping in mind our objectives:

We want to figure out how much to nudge $a_1$ , $b_1$ , $a_2$ , and $b_2$ .
To do that, we need to calculate their four gradients.
Each gradient is a partial derivative: $\pdv{L}{a_1}$ , $\pdv{L}{b_1}$ , $\pdv{L}{a_2}$ , $\pdv{L}{b_2}$ .

We’ll start at the bottom of the model, the layer closest to $L$ : $y_2$ . This means we’ll be calculating the gradients for $a_2$ and $b_2$ , which are $\pdv{L}{a_2}$ and $\pdv{L}{b_2}$ . Let’s start with $a_2$ . As before, we’ll use the chain rule:

L(x) = (y_2(x) - y_{true})^2 \\[0.3em] \downarrow \\[0.3em] \pdv{L}{a_2} = \pdv{L}{y_2} \cdot \pdv{y_2}{a_2}

This turns out to be exactly the same as the single-layer example above: just add a $_2$ subscript to $y$ , $a$ , and $b$ :

\pdv{L}{a_2} = 2(y_2(x) - y_{true}) \cdot x \\[0.3em] \pdv{L}{b_2} = 2(y_2(x) - y_{true}) \cdot 1

So far, this is all just a review of the previous two sections. Now comes the new wrinkle: calculating the gradients for the $y_1$ layer.

There are two ways to approach this: by working everything out piece by piece, or by relying on the residual-based pattern we established in the previous section. I’m not sure which is more helpful, so I’ll provide both. If one doesn’t make sense, try the other!

Note the general pattern: we start at the bottom of the model, and then work our way up, with each layer providing the residual for the one before it. This is the “back” in “backpropagation.”

Adding an activation function¶

Now that we have our nice two-layer model, we’ll want to add an activation function:

y_1 = a_1x + b_1 \\ y_2 = GeLU(y1) \\ y_3 = a_3 (y_2) + b_3

But instead of getting straight into that, I want to revisit the one-layer model. This will seem like (possibly pedantic) side discussion, but I promise, it’ll get at the activation-enabled model.

Let’s rewrite our one-layer model, with the loss function — but this time expand it into individual operations:

\begin{align} & y = ax + b \\ & L = (y - y_{true})^2 \\[0.3em] & \downarrow \\[0.3em] & y_1 = ax \\ & y_2 = y_1 + b \\ & y_3 = y_2 - y_{true} \\ & y_4 = {y_3}^2 \end{align}

Just like I described in the previous section, we’ll work from the bottom of the model up: each layer will calculate the gradients for its parameter (if it has one), and then calculate its local derivative with respect to its input and pass that residual up to that input.

\begin{array}{l|l|l|l|l} \textbf{Layer} & \textbf{Local deriv.} & \textbf{Incoming} & \textbf{Outgoing} & \textbf{Gradient} \\ & \textbf{wrt input} & \textbf{residual} & \textbf{residual} & \\ \hline y_4 = {y_3}^2 & 2 \, y_3 & \text{---} & r_4 = \underline{2 \, y_3} & \text{---} \\ y_3 = y_2-y_{true} & 1 & r_4 & r_3 = r_4 \cdot \underline{1} & \text{---} \\ y_2 = y_1+b & 1 & r_3 & r_2 = r_3 \cdot \underline{1} & \ipdv{L}{b} = r_3 \cdot 1 \\ y_1 = a \cdot x & a & r_2 & r_1 = r_2 \cdot \underline{a} & \ipdv{L}{a} = r_2 \cdot x \\ \end{array}

Hopefully this table drives home everything from the above. You can see that:

Each layer $y_n$ calculates its outgoing residual as the product of its incoming residual and its local derivatives $\ipdv{(y_n)}{(y_{n-1})}$
For the layers that have learned parameters, those layers separately calculate their gradients as the product of the incoming residual and $\ipdv{(y_n)}{p}$ , where $p$ is the layer’s learned parameter.
Crucially, note that some layers don’t calculate any gradient, and that’s fine. They still calculate an outgoing residual, and that’s all the rest of the layers need.

What would happen if we slid a function $f$ between $y_1$ and $y_2$ ? This would correspond to a model:

\begin{array}{cl} & y = ax + f(b) \\[0.3em] & \downarrow \\[0.3em] & y_1 = ax \\ \bigstar & y_f = f(y_1) \\ & y_2 = y_f + b \\ & y_3 = y_2 - y_{true} \\ & y_4 = {y_3}^2 \end{array}

I’ve named the layer $y_f$ so that most of the other layers don’t have to change; you can see that other than $y_1$ referring to the new residual $r_f$ , everything else is exactly the same.

Note that in a real LLM, the activation function doesn’t go between the weight and the bias, as I’ve done here. But as we’re about to see, the whole point is that backprop doesn’t actually care where it goes. In fact, as far as backprop is concerned, there’s no such thing as a layer at all: everything is just a sequence of operations. Here’s the same table as above, this time with one more layer for $y_f$ :

\begin{array}{cl|l|l|l|l} & \textbf{Layer} & \textbf{Local deriv.} & \textbf{Incoming} & \textbf{Outgoing} & \textbf{Gradient} \\ & & \textbf{wrt input} & \textbf{residual} & \textbf{residual} & \\ \hline & y_4 = {y_3}^2 & 2 \, y_3 & \text{---} & r_4 = \underline{2 \, y_3} & \text{---} \\ & y_3 = y_2-y_{true} & 1 & r_4 & r_3 = r_4 \cdot \underline{1} & \text{---} \\ & y_2 = y_f+b & 1 & r_3 & r_2 = r_3 \cdot \underline{1} & \ipdv{L}{b} = r_3 \cdot 1 \\ \bigstar & y_f = f(y_1) & f'(y_1) & r_2 & r_f = r_2 \cdot f' & \text{---} \\ & y_1 = a \cdot x & a & r_f & r_1 = r_f \cdot \underline{a} & \ipdv{L}{a} = r_f \cdot x \\ \end{array}

And that’s it! If $f$ is our activation function — GELU, ReLU, or anything else — then that’s all there is to it. The machinery needs to know what the activation function’s local derivative with respect to its input is, but the implementation can just hard-code that. This also applies to any other operation in the real LLM, like softmax.

The actual equations for these derivatives can be a handful — for example, GELU’s is $0.5(1 + \tanh(u)) + 0.5x \cdot \text{sech}^2(u) \cdot \sqrt{\frac{2}{\pi}}(1 + 0.134145x^2)$ — but the implementation just hard-codes them, as it does for the functions’ forward-pass definition.

Using vectors instead of scalars¶

Until now, our simple backprop model has used scalars. In an LLM, of course, everything is a matrix.

Vectors? Tensors?

For purposes of calculating derivatives in backprop, we’ll treat everything as a matrix.

We’ll treat $n$ -vectors as $(n \times 1)$ matrices
We’ll treat higher-ranked tensors as having a bunch of batching dimensions, and then two matrix dimensions at the end.

The concepts are exactly the same as above: the only difference is that when we take the [partial] derivative of functions, those functions will have matrices as their inputs and parameters, instead of scalars. Likewise, the result will be a matrix:

\begin{array}{ccccc} y_1 & = & A & \times & X \\ \mm{n}{v} & & \mm{n}{m} & & \mm{m}{v} \end{array}

As with the scalar model, each layer (that is, operation) in the matrix-based model will have to do two things:

Compute its outgoing residual, which is the layer’s local derivative relative to its input.
- This residual will be used by the layer above this layer as the value of this layer; so it needs to be the same shape as this layer’s input.
Compute its learned parameter’s gradient, if any.
- This will eventually be subtracted from the learned parameter, so it needs to be of the same shape as that parameter.

At the core of it, that’s all we need to know for now. Just to round it out, let’s look back at the $a$ -layer of our scalar model:

\begin{align} y(x) & = ax \\[0.3em] & \downarrow \\[0.3em] \text{gradient}_a & = r_{\text{in}} \cdot x \\[0.3em] r_{\text{out}} & = \pdv{}{x}(ax) \end{align}

It’s going to be almost the same for the matrix version, with one gotcha on the gradient:

\begin{align} y(X) & = AX \\[0.3em] & \downarrow \\[0.3em] \text{gradient}_A & = r_{\text{in}} \cdot X^T \\[0.3em] r_{\text{out}} & = \underbrace{\pdv{}{X} (AX)}_{\text{\scriptstyle more on this below}} \\ \end{align}

Let’s look at the gradient. Where did that transpose come from? Without getting too into the weeds of matrix derivatives, let’s at least look at the shapes of the matrices.

Let’s say $A$ is an $(n \times m)$ matrix.
This means our layer (in the forward, inference pass) is:
$\begin{array}{ccccc} y & = & A & \, & X \\ \mm{n}{v} & = & \mm{n}{m} & & \mm{m}{v} \end{array}$
...for some dimension $v$ .
We know $\textit{gradient}_A$ also has to be $(n \times m)$ , as mentioned above.
We know $r_{\text{in}}$ has the same shape this layer’s output, which is $(n \times v$ )$.

So we have:

\begin{array}{ccccc} \text{gradient}_A & = & r_{\text{in}} & \, & \unknown \\ \mm{n}{m} & & \mm{n}{v} & & \mm{?}{?} \end{array}

In order for the matrix multiplication to work, the unknown bit must have shape $(v \times m)$ . And wouldn’t you know it, that’s exactly the shape of $X^T$ .

We’re just randomly transposing now?

If you’re like me, adding that transpose feels a bit like cheating. Like, yes, it makes the matrix math shape work — but we can’t just add math operations willy-nilly just because it’s convenient! After all, the gradient comes from the chain rule, and we we got that definition right at the top of this chapter:

\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}

There are actually two things going on here.

First, the scalar chain rule above is a simplification: The full, general version involves something called a Jacobian matrix, and is more complex.
Second, the shape argument we just made only tells us that the mystery factor must be $(v \times m)$ . That happens to be the same shape as $X^T$ , so that’s a plausible guess as to what the factor might be; but the shape isn’t the full proof or reasoning for why $X^T$ is the right value. For that, you’d need to work through the full Jacobian derivation, which I don’t know how to do, and is out of scope for this book.

What I can say is that the answer was to transpose one of the terms. We just didn’t see it in the scalar case, because in the generalized, Jacobian world, scalars are just $(1 \times 1)$ matrices. The transpose of a $(1 \times 1)$ matrix is just the matrix itself, so when we write the chain rule in scalar-land, we can omit the transpose; but it was always there, hiding.

Note that even though most of backprop works against matrices, the loss function still produces a scalar. The derivative of this function with respect to its input is a matrix of the same shape as that input. This becomes the first residual, and from there everything works as above.

Expanding to computation graphs¶

In all of the above, our model was a single sequence of operations:

\begin{align} & y_1 = ax \\ & y_2 = y_1 + b \\ & y_3 = y_2 - y_{true} \\ & y_4 = {y_3}^2 \end{align}

But if you’ll recall, our LLM has two ways in which operations branch off and reconnect.

Residual connections (unrelated to the “residuals” in this chapter):
We take a layer $h_1$ , perform some operations on it to produce $h_2$ , and then add the original layer to get $h_3 = h_1 + h_2$ .
Attention layer:
We took our input, applied it to each of the $W_q$ , $W_k$ , $W_v$ matrices, and then recombined them via the attention operation.

In both cases, this branching and rejoining forms a graph, not a simple sequence. For example, residuals look like:

Layer 1 flows into layer 2, which flows into layer 3. Layer 1 also flows into layer 3 directly.