Page Status: Pretty much ready ★★☆
Overview¶
As I’ve mentioned before, vectors are how LLMs encode the nuance of human language. So, the first thing we need to do is to turn each part of the text input into a vector. In the end, we’ll have one vector per token in the input text.
We’re going to process the input in three steps:
Tokenizing the input
Looking up embedding vectors for each token in the input
Using that to generate the input embedding vectors for the input
Each of these steps is pretty simple, so if you read the following and think “I must be missing something”, you probably aren’t.
Tokenization¶
We start with the input text, which we parse into tokens — essentially, the atoms of the input text.
This doesn’t involve any AI: it’s basically just “a” → <1>, “aardvark” → <2>, etc.
I actually won’t cover the tokenization algorithm itself, because it’s not really an ML/AI topic (it was originally invented for compression!). Suffice it to say that the most common form of tokenization is byte pair encoding (BPE), which basically looks for words, common sub-words like the “de” in “demystify”, and punctuation. You can read more about it on Wikipedia if you want. OpenAI has a page that lets you see how text tokenizes: https://
The important thing to remember is that it’s not actually looking at words, but sub-words and punctuation: “we’ll retry!” parsed as we 'll ret ry ! in GPT-3, for example.
Token embeddings¶
All of the tokens our model knows about form its vocabulary, and each one is associated with a vector called the token embedding. This embedding’s values are learned parameters that encode what the LLM knows about that token. The size of each vector is a hyperparameter denoted .
Every token has exactly one embedding that’s used throughout the model. If the token appears multiple times in the input, each one will use the same token embedding. (There’ll be other things, in particular the Self-attention described in the next chapter, to differentiate between input tokens.)
Since we’ve already tokenized the input, now we just need to create a vector of vectors: each outer vector corresponds to one token in the input, and the inner vector is that token’s embedding:
Adding positions to get to input embeddings¶
A word’s meaning may change depending on where in a sentence it appears. That could be because it has an entirely different meaning, and the different usages correlate with position; or it could have the same meaning, but with different nuance or tone. To capture this additional information, we’re going to add a positional embedding to each input.
Just as we defined a unique embedding for each token in the vocabulary — “be” always the same token embedding, for example — we’ll now define a unique embedding for each position. For example, the first token in an input always used the same embedding, that of position 0. These embeddings are learned vectors, with the same dimension as the token embeddings.
For each token in the parsed text, we just the sum its token embedding and positional embedding to get its input embedding:
(Note that I picked the token and positional embedding values so that it’d be easier to follow them through the flow. In an actual LLM, these would all be just random-looking numbers.)
Now we have the input tokenized, and each token translated into an input embedding. In the next chapter, I’ll show how the LLM contextualizes these embeddings relative to each other.