Embeddings & Neural Networks

From mapping meaning in space to wiring up thought; The leap from raw words to machines that learn

Simple predictors could find patterns, but they saw the world in fragments. N-grams remembered only a handful of words. Bag-of-Words and TF-IDF counted them but lost their order.

None of them understood that dog and puppy are closer in meaning than dog and table. To move forward, we needed models that could capture meaning not just frequency and remember context over longer spans.

That's where embeddings came in.

Embeddings

Instead of treating words like separate labels, we mapped them to points in space or vectors. We observed that the words that show up in similar contexts would land near each other.

For example, doctor and nurse would end up close. dog and cat would sit nearby, far from router.

The key is that these vectors aren't hand-crafted. They're learned from data. By training a model to predict a word from its context (or vice versa), the positions in this space shift until words used in similar ways end up close together.

This gives us a dense, low-dimensional representation of meaning that works far better than the sparse, orderless counts from Bag-of-Words or TF-IDF.

These embeddings became the bridge between raw text and the neural networks that could process it, opening the door to systems that could learn richer patterns and relationships.

Playground: Embeddings

Interactive

Interactions: Click on a word on the map. It becomes the focus and observe the nearby vectors.

Click on a word to focus

Click any point to focus or Click and drag around to explore

Analogy

Let's see how accurate these predictions can be

A − B + C ≈ queen

Try: king − man + woman, paris − france + japan

Nearest neighbors for "doctor"

school

patient

hospital

nurse

teacher

As you can see, while it's not the most accurate, this is still a significant jump from the sparse, orderless counts from Bag-of-Words or TF-IDF. Now, we're able to predict words with similar meanings etc.

Embeddings gave us a way to represent meaning in a mathematical space. But to unlock their true potential, we needed something more than simple calculations or raw numbers i.e. we needed a system that could use these representations, recognize intricate patterns, and make sense of the relationships between them. Enter neural networks.

Neural Networks

In case you wondered, yes the name is a direct nod to the brain's neural networks. Just as biological neurons receive signals, process them, and transmit output, artificial neurons in a network follow a similar process.

They take inputs, modify them through weights and biases, pass them through a non-linear activation function, and send them to the next layer of neurons. The magic happens when many such neurons are connected in layers. Individually, they're simple, but when stacked together, they can solve problems that are otherwise impossible for hand-written algorithms.

To understand how this works, let's look at an example of recognizing numbers. Imagine a grid of 8x8 pixels, representing a ascii number below:

ASCII digit

The task for our brain is easy. We instantly recognize the number as 2, 4, 3 despite the differences in thickness, edges, or even the pixelated style. This is where neural networks come in. Like our brain, a neural network can learn to see past the noise and identify that all these different images still represent the same underlying concept the digits 2, 4, 3 and so on.

However, here's where it gets interesting:

A machine has no inherent ability to recognize this. If we had to gave it a 8x8 grid, our first reaction might be, "How do we tell the machine what this is?"

But if we built a neural network, it could learn to do this. Not by hard coding each specific number but by discovering patterns within the data. It learns that a number might have a loop or a line and builds complex rules by recognizing simpler components like edges, curves, or intersections.

Just like when we identify a digit 3 for example, our brains break it down into simpler features. A curved line at the top and a sharp edge at the bottom?

The blue curve at the top, the amber middle segment, and the green bottom edge show how a network could decompose a "3" into simpler features.

The network's layers also deconstruct the input into simpler, identifiable patterns. This makes it possible for a neural network to tackle various other tasks, from recognizing faces to classifying images or predicting sequences.

The network architecture is where the magic really happens. The input layer consists of neurons, each corresponding to a pixel in the image — 64 neurons for a 8x8 grid. The hidden layers contain neurons that "learn" patterns in the input data.

And finally, the output layer represents the network's decision — in this case, a number from 0 to 9, which is the predicted digit. The network "learns" by adjusting weights and biases, refining its connections to make the predictions more accurate over time.

Playground: Neural Network

Interactive

Interactions: Hover over a node to see its connections. Left to right: Input layer → Hidden 1 → Hidden 2 → Output layer.

This process also highlights an important point. The "learning" aspect of neural networks is about adjusting all these weights and biases, not setting them manually.

The magical part is the fact that just as networks of neurons in your brain can recognize a friend's face or understand a sentence, artificial neural networks can learn to recognize patterns in data — from images and audio to the vector representations of words.

This solved the meaning problem. But, what about memory?

To solve for that, we came up with sequence models. RNNs (Recurrent Neural Networks), then LSTMs (Long Short Term Memory) and GRUs, which would read text one piece at a time and carry a hidden state forward.

You can think of that state as a running summary of what's been read so far. This helped with things like agreement across a sentence and basic long-range links. It wasn't perfect, but it was a lot better than n-grams.

But there’s a catch:

Traditional neural networks, even sequence models like early RNNs and LSTMs, struggle to remember information over long stretches. Language isn't just about the last few words; meaning often depends on context spread across entire paragraphs. We needed a way for models to focus on the right parts of the input, no matter where they appeared.

Before we get to that breakthrough, a quick note on what these models were actually predicting.

Early on, many models predicted the next character. That keeps the vocabulary tiny, which is nice, but it makes sequences painfully long. For example: Writing the word “unbreakable” takes 11 steps, one per character, and the model has to learn spelling before it can learn meaning.

The solution? Predict tokens instead — small chunks of text that could be words, subwords, or even punctuation. This shift made training faster, helped models generalize to new words, and set the stage for transformers, which we'll explore next.

Next up: Transformers & Tokenizers