Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In the previous chapter, I gave a very high-level overview of how LLMs work. In the next few chapters, I’ll describe this architecture in much more detail. I’ll cover each component of an LLM — what it does, why it’s needed, and the nitty-gritty math that drives it.

How I organize my thinking about LLMs

I mentioned in the introduction that I think of the pedagogy of LLMs in three perspectives:

  1. The fundamental concepts

  2. Algebraic reformulations of those concepts

  3. The actual implementation

Conceptual perspective

In the first perspective (the conceptual perspective) data follows through the LLM in the form of vectors. In particular, we’ll often work with vectors of vectors, like [[1  2  3]  [4  5  6]]\begin{bmatrix}\small\,\begin{bmatrix}1 \; 2 \; 3\end{bmatrix}\;\begin{bmatrix}4 \; 5 \; 6\end{bmatrix}\,\end{bmatrix}. To transform these vectors, we’ll often use matrices.

Algebraic reformulations

The second perspective (algebraic reformulations) batches the conceptual vectors into matrices, and then these matrices into tensors. The underlying concepts are exactly the same: the reformulations just let us represent the data in a way that GPUs can crunch more efficiently than a CPU can.

This book’s approach

In the following chapters, I’ll explain LLMs in terms of that first perspective, the conceptual one. This will hopefully help you understand not just what’s going on, but what motivates each part of the architecture. Each major component will be a separate chapter.

Once I’ve described all of the components, I’ll spend one chapter describing how all the bits get boiled down to the algebraic reformulations.

Components of an LLM

An LLM consists of a few key components:

  • The tokenizer and embedding layer, which turn the input text into vectors that the LLM can reason about (remember the “dog” example from earlier)

  • Self-attention, which tells the LLM how those token vectors relate to each other (this is the main innovation of LLMs as compared to previous AI models)

  • A feedforward network (FFN) for processing the vectors

The self-attention and FFN together form a transformer block, and these blocks are the core of the LLM.

It’s fine if you don’t know what these terms mean. I explain them as we go.

The output of all this is a probability distribution over every token (every “word”, very roughly) that the LLM knows about, representing how likely that token is to be the correct next token. The LLM then picks that most likely token, adds it to the text, and repeats the process with the new token added.

"The quick brown fox" flows into the LLM, which predicts the next word is "jumps". Then, "the quick brown fox jumps" flows into the LLM, and so on.

Hyperparameters, learned parameters, and activations

In addition to the components, it’s important to keep separate in your head the three kinds of data an LLM works with: hyperparameters, learned parameters and activations.

hyperparameter
A value decided by a human as part of the model’s design, which determines the structure of the model. This includes how many hidden layers the feedforward network has, or how big the input embeddings are. (Again, it’s fine if you don’t yet know what a hidden layer or input embedding is!)
learned parameter
A number learned during training and then fixed when the model is used. These parameters encode what the model has learned.
activation:
A value computed from the user’s input and the learned parameters. This is what the model is figuring out about your prompt specifically.

By way of analogy, if an LLM were a simple equation like y=kx2y = kx^2 for some fixed kk, then:

As I introduce various parts of the LLM, I’ll be explicit about which kind of value each one is.