Tokens & Transformers
From breaking language into pieces to seeing the big picture; The architecture that rewired how machines read
Neural networks gave us the machinery to learn patterns in sequences, but there was still a question: what exactly should the sequence be made of?
As we mentioned earlier, Early models predicted one character at a time. Simple, but inefficient. We needed a middle ground that was small enough to cover any text, yet big enough to capture meaning quickly.
That middle ground was tokens.
Tokens
When we moved from predicting characters to predicting tokens, we gave models a more efficient “unit” of text to work with. A token is just a small chunk of text — often a word, part of a word, or even punctuation — chosen by a tokenizer. The tokenizer breaks text into these chunks before it goes into the model.
But how does this help? and Why bother?
- Characters are too small. Writing “unbreakable” would take 11 steps, and the model wastes time learning spelling patterns.
- Whole words are too big. A model would need to memorize every word it has ever seen — and fail on new words like unbreakability.
Tokens are the middle ground:
- Frequent words like
dogorcomputerbecome single tokens. - Rare or complex words are split into smaller pieces, like
un,break,able. - This lets the model understand new words it hasn't seen before by combining known tokens.
Why this helps: That's because there are fewer steps than characters, almost no “I've never seen this word” failures, and a vocabulary that stays manageable. Our sequence models now predict the next token, not the next character, which speeds learning and lets them focus on meaning sooner.
These token IDs (just numbers under the hood) are what get turned into embeddings before being processed by the model. This is done by using a function called tokenizer.
Different model providers (OpenAI, Anthropic, Google, etc.) use their own tokenizers, each with slightly different rules for splitting text. That's why the exact token count for the same sentence can vary between models — and why token limits aren't directly interchangeable.
Playground: Tokenizer
Interactive
Type anything in the box and pick a model from the dropdown. You'll see exactly how your text gets split into tokens — the chunks the model actually works with.
0 tokens / 59 characters
With tokens in place, we had a way to feed sequences into our models efficiently. But to truly handle long-range dependencies in language, we needed a new kind of architecture — one that could"pay attention" to the right tokens in the sequence, no matter where they appeared.
The breakthrough came from a simple question: what if, instead of carrying everything forward step-by-step, the model could just look back at the parts that matter?
At first, attention was bolted onto RNN-based models (like in sequence-to-sequence translation). But then researchers at Google realized: if attention works this well… do we even need the RNN part at all?
Then, in 2017 a paper Attention Is All You Need, introduced the transformer — a model built entirely around self-attention. No recurrence, no fixed window. Transformers process all tokens in parallel, which makes them faster to train and better at learning long-range relationships.
What’s inside a Transformer
You can think of a Transformer as a stack of identical layers. Each layer has two main parts:
- Self‑Attention — every token looks at every other token and assigns a weight to each connection based on relevance.
- Feed‑Forward Network (FFN) — after attention chooses what to focus on, a small neural network processes each token independently.
Between these steps, you’ll see residual connections (shortcuts for gradient flow) and layer normalization (keeps training stable).
[Work In Progress]