TODO
“Why we focused on this”: Modern LLMs (GPT, Claude, Llama) all use decoder-only
“If you want to learn more about X”: pointers to other architectures and use cases
This would also be a good place to mention:
Why encoder-decoder exists (translation, summarization with separate input/output)
Why BERT exists (bidirectional context for classification/understanding tasks)
How your decoder-only model is optimized specifically for generation
It’s like the bookend to your introduction — sets scope at the start, provides context at the end.