Page Status: Pretty much ready ★★☆
Overview¶
As I’ve mentioned before, vectors are how LLMs encode the nuance of human language. So, the first thing we need to do is to turn each part of the text input into a vector. In the end, we’ll have one vector per token in the input text.
We’re going to process the input in three steps:
Tokenizing the input
Looking up embedding vectors for each token in the input
Using that to generate the input embedding vectors for the input
Each of these steps is pretty simple, so if you read the following and think “I must be missing something”, you probably aren’t.
Tokenization¶
We start with the input text, which we parse into tokens — essentially, the atoms of the input text.
Tokenization doesn’t involve “real” AI: it’s basically just “a” → <1>, “aardvark” → <2>, and so on. The most common form of tokenization is byte-pair encoding (BPE), which basically looks for words, common sub-words like the “de” in “demystify”, and punctuation. OpenAI has a page that lets you see how text tokenizes: https://
BPE isn’t a true ML/API topic: it was originally invented for compression. As such, you can feel free to skip the details if you like.
BPE details
BPE tokenization is relatively simple, at least in its unoptimized form. We start with two configurations:
a priority-ordered list of merge pairs (for example, )
a mapping from token to ID (for example, )
Both of these configurations operate on bytes, not ASCII characters or unicode (hence the term byte-pair encoding).
These configurations are generated during the LLM’s training. As with other training-related parameters, I won’t discuss how they’re generated; during inference, we just assume they’re provided.
At a high level, the BPE steps are:
Encode the incoming text as UTF-8
Merge byte sequences using the merge list
Convert the resulting sequences to IDs using the token mapping
Let’s look at each of these. To keep things simple, I’ll keep all the characters as ASCII, and represent them by their ASCII letters instead of bytes; so, ‘’ instead of . Just remember that this is really operating on bytes, not characters.
UTF-8 decoding
This is just what it sounds like: we encode the text to bytes using UTF-8. We then treat each byte as a 1-byte sequence:
(In this example, all of the characters in “Hi Bob!” translate to single-byte UTF-8 sequences. If the input text had any multi-byte code points, we’d still treat each byte as a single-byte sequence. For example, " ☃ " would translate to three single-byte sequences: .)
Merge sequences:
At this point, we have a list of byte sequences (so, a list of lists). Each of the inner lists has exactly 1 element, but that’s about to change.
Now, we go through the merge pairs in priority order. For each merge pair, we look for consecutive sequences that match that pair; if we find them, we merge them into a single sequence.
For example, if the merge list is:
[ B ] [ o ] [ H ] [ i ] [ Bo ] [ b ]...then we’ll merge:
Finally, we’ll use the token mappings to convert each of these sequences to an ID.
For example, if the token mappings are:
[ Hi ] → 1 [ Bob ] → 2 [ ! ] → 3 [ ] → 4...then we’ll map:
The only wrinkle is that as we go through the priority list (in step 2), we may create sequences whose merge pairs we already passed. For example, imagine if the merge pairings above had had a different priority:
[ Bo ] [ b ]
[ B ] [ o ]
[ H ] [ i ]In this case, we’d do:
To solve this, once we find a merge pair, we reset the merge pairs list and look from the top again:
There are optimization tricks we can do to make this more efficient, but in terms of the core logic, that’s it!
Token embeddings¶
All of the tokens our model knows about form its vocabulary, and each one is associated with a vector called the token embedding. This embedding’'s used throughout the model. If the token appears multiple times in the input, each one will use the same token embedding. (There’ll be other things, in particular the Self-attention described in the next chapter, to differentiate between input tokens.)
Since we’ve already tokenized the input, now we just need to create a vector of vectors: each outer vector corresponds to one token in the input, and the inner vector is that token’s embedding:
Adding positions to get to input embeddings¶
A word’s meaning may change depending on where in a sentence it appears. That could be because it has an entirely different meaning, and the different usages correlate with position; or it could have the same meaning, but with different nuance or tone. To capture this additional information, we’re going to add a positional embedding to each input.
Just as we defined a unique embedding for each token in the vocabulary — “be” always the same token embedding, for example — we’ll now define a unique embedding for each position. For example, the first token in an input always used the same embedding, that of position 0. These embeddings are learned vectors, with the same dimension as the token embeddings.
For each token in the parsed text, we just the sum its token embedding and positional embedding to get its input embedding:
(Note that I picked the token and positional embedding values so that it’d be easier to follow them through the flow. In an actual LLM, these would all be just random-looking numbers.)
Now we have the input tokenized, and each token translated into an input embedding. In the next chapter, I’ll show how the LLM contextualizes these embeddings relative to each other.