Simple Predictors
From straight lines to word counts; The humble models that taught machines their first guesses
Before transformers and deep learning, machines learned in far humbler ways. They spotted patterns in numbers, tallied word counts, and drew straight lines through data.
These early “simple predictors” couldn't write essays or hold conversations, but they could forecast prices, filter spam, and guess the next word in a sentence and they laid the groundwork for everything that came after.
Linear Regression
You might've come across this in a class or a blog post. If you're into data science or ML, you've probably used it. If not, it's simple enough to get in one sitting.
Think about a taxi fare. The longer the ride, the higher the bill. If we plot distance on the x-axis and fare on the y-axis, the points roughly form a line. Linear regression draws the best-fit line through those points and gives us a tiny formula we can use to predict a fare for a new trip.
At the simplest level it's just:
\( \text{predicted value} \approx \text{weight} \times \text{input} + \text{bias} \)So how do we know if that prediction is any good?
(We'll skim this since it isn't core to the course. If you want, dive in later.)
- The difference between our predicted fare and the actual fare is called Residual.
- If we square each residual, then average them. We get Mean Squared Error (MSE). Squaring stops positives and negatives from canceling and makes big mistakes count more.
Then comes, R² (coefficient of determination). This is not the square of a residual. It asks:how much better is my model than just predicting the average fare for every ride?
A simple way to think about it:
\( R^2 = 1 - \frac{\text{model error}}{\text{error of “always predict the mean”}} \)R² ≈ 1 → great fit
R² ≈ 0 → no better than the average
R² < 0 → worse than the average
Rule of thumb for plain linear regression: aim for small MSE and R² as close to 1 as practical on a held-out test set
Playground: Linear Regression
Interactive
Linear regression might be the “hello world” of prediction models, but its not just a teaching tool. There are a lot of use cases where it still fits well. Such as:
- Housing prices → Predicting price from square footage, number of bedrooms, and location.
- Business forecasting→ Estimating sales based on ad spend or seasonal trends.
It's not just linear regression though. We have other quick wins that still show up in real products today.
Naive Bayes → A quick, surprisingly effective way to classify things, from spam vs. not spam, to sentiment in reviews. It's done by looking at how often words appear in different categories.
N-grams → Simple models that guess the next word by looking at the last n words. They powered early autocomplete and speech recognition long before deep learning took over.
Bag-of-Words (BoW) → One of the simplest ways to turn text into numbers. We count how often each word appears in a document and represent it as a vector. This works well with models like Naive Bayes, but treats every word as independent and ignores order.
TF-IDF → Short for “Term Frequency–Inverse Document Frequency.” It improves on BoW by down-weighting very common words (like “the” or “and”) and giving more importance to words that are unique to a document. This helps models focus on the terms that actually carry meaning.
Playground: N-grams powered keyboard
Interactive
You must've used auto-correct in the keyboards on your phone. Here's a simple one, powered by n-grams :)
Interactions: Type a word or click on a word suggestion to see the predictions. You can also edit the corpus (i.e. the training data) to try it out with your own text.
Corpus (Training Data)
Edit the corpus to try it out with your own text.
How it works
We estimate\( P(\text{next} \mid \text{previous } n-1 \text{ tokens}) \)from counts. Add-one smoothing avoids zero when a word was never seen after a context.
Context used: the
All of these are small and tidy, and they still earn their keep in specific corners: pricing, filtering, ranking. However, under the hood they share the same move: predict something about the future from patterns in the past.
These ideas worked because the world has structure. But they forgot quickly. N-grams only see a short window. Rare phrases cause trouble. There is no real sense of meaning, just tables of counts and some smoothing.
So we needed two upgrades:
1. A way to represent meaning
2. A way to remember over longer spans