Turning input text into vectors - Yuval's Intro to LLMs

Page Status: Pretty much ready ★★☆

I've gone over this page a couple times, and I think it's in pretty good shape.

Overview¶

As I’ve mentioned before, vectors are how LLMs encode the nuance of human language. So, the first thing we need to do is to turn each part of the text input into a vector. In the end, we’ll have one vector per token in the input text.

Self-attention sits between tokenization and the feedforward network

We’re going to process the input in three steps:

Tokenizing the input
Looking up embedding vectors for each token in the input
Using that to generate the input embedding vectors for the input

Each of these steps is pretty simple, so if you read the following and think “I must be missing something”, you probably aren’t.

Tokenization¶

We start with the input text, which we parse into tokens — essentially, the atoms of the input text.

This doesn’t involve any AI: it’s basically just “a” → <1>, “aardvark” → <2>, etc.

I actually won’t cover the tokenization algorithm itself, because it’s not really an ML/AI topic (it was originally invented for compression!). Suffice it to say that the most common form of tokenization is byte pair encoding (BPE), which basically looks for words, common sub-words like the “de” in “demystify”, and punctuation. You can read more about it on Wikipedia if you want. OpenAI has a page that lets you see how text tokenizes: https://platform.openai.com/tokenizer.

The important thing to remember is that it’s not actually looking at words, but sub-words and punctuation: “we’ll retry!” parsed as we 'll ret ry ! in GPT-3, for example.

Token embeddings¶

All of the tokens our model knows about form its vocabulary, and each one is associated with a vector called the token embedding. This embedding’s values are learned parameters that encode what the LLM knows about that token. The size of each vector is a hyperparameter denoted $d$ .

Every token has exactly one embedding that’s used throughout the model. If the token appears multiple times in the input, each one will use the same token embedding. (There’ll be other things, in particular the Self-attention described in the next chapter, to differentiate between input tokens.)

Since we’ve already tokenized the input, now we just need to create a vector of vectors: each outer vector corresponds to one token in the input, and the inner vector is that token’s embedding:

Reminder of what these values mean

As mentioned in the training analogy, these values are just values that emerge through training. If we intuitively think of the various aspects of the word “be” — that it can be a semantically light auxiliary verb, that it can denote existence, that it’s used in philosophical existentialism, and so on — then each of these is, very roughly by way of an analogy, a value in the token embedding vector.

Although every embedding vector is technically independent, training will generally cause them to align what each index means. For example, index 1318 may converge towards meaning something like “single-syllable word” across all embeddings in the LLM’s vocabulary.

Again it’s important to remember that the values don’t actually encode existentialism or syllable count. They’re just values which settle into being during training, and which correlate with predictive power when generating words.

Adding positions to get to input embeddings¶

A word’s meaning may change depending on where in a sentence it appears. That could be because it has an entirely different meaning, and the different usages correlate with position; or it could have the same meaning, but with different nuance or tone. To capture this additional information, we’re going to add a positional embedding to each input.

Just as we defined a unique embedding for each token in the vocabulary — “be” always the same token embedding, for example — we’ll now define a unique embedding for each position. For example, the first token in an input always used the same embedding, that of position 0. These embeddings are learned vectors, with the same dimension $d$ as the token embeddings.

For each token in the parsed text, we just the sum its token embedding and positional embedding to get its input embedding:

(Note that I picked the token and positional embedding values so that it’d be easier to follow them through the flow. In an actual LLM, these would all be just random-looking numbers.)

Now we have the input tokenized, and each token translated into an input embedding. In the next chapter, I’ll show how the LLM contextualizes these embeddings relative to each other.