Session 9
Finding the Bottom
Loss Functions and Optimization Intuition
Concept Lesson
Imagine you are training a model to predict house prices in Lagos for a real estate platform. Your model looks at a 3-bedroom apartment in Lekki and predicts ₦38 million. The actual sale price turns out to be ₦45 million. Your model is off by ₦7 million. But how wrong is that, really? Is it a little wrong, or catastrophically wrong? That depends entirely on how you define "wrong," and in machine learning, that definition is called the loss function. A loss function takes every prediction your model makes, compares it to the true answer, and produces a single number that summarizes how badly the model is performing overall. The goal of training is to make this number as low as possible. Think of the loss as a scorecard pinned on the wall: lower means your model is getting closer to reality, and zero would mean perfect predictions on every data point. Different loss functions define "wrong" differently, and this choice is not academic — it directly shapes what your model prioritizes fixing.
The two most common loss functions for regression problems (Regression means predicting a number — like a price, a temperature, or a salary. The opposite is classification, which means predicting a category — like 'spam' or 'not spam.') are Mean Squared Error (MSE) and Mean Absolute Error (MAE). Let's work through a concrete example to see how they differ. Suppose your model predicts house prices of ₦45M, ₦30M, and ₦60M for three properties, but the true prices are ₦42M, ₦35M, and ₦50M. The individual errors are +3M (overpredicted), −5M (underpredicted), and +10M (overpredicted). MSE squares each error first: 3² = 9 (multiplying the number by itself: 3 × 3 = 9), 5² = 25, 10² = 100, then averages them: (9 + 25 + 100) / 3 = 44.67. MAE takes absolute values: |3| = 3, |5| = 5, |10| = 10, then averages: (3 + 5 + 10) / 3 = 6.0. Notice what happened: MSE gave 100 points of penalty to that ₦10M error while MAE only gave it 10. MSE punishes big mistakes far more harshly because squaring a large number makes it exponentially larger. This means if you use MSE as your loss function, your model will work extra hard to fix its worst predictions, sometimes at the expense of being slightly worse on the majority of data points. MAE, by contrast, treats every error with equal weight — a ₦1M error and a ₦20M error get proportional but not disproportionate attention. The choice between them matters: if your Lagos real estate platform can tolerate a few big misses but needs generally reliable predictions across the board, MAE might serve you better. If a single wildly wrong price estimate damages your reputation, MSE will push harder to prevent that.
Optimization is the process of finding the specific model parameters — the weights and biases (Weights are numbers that control how much each input feature influences the prediction — like turning dials up or down. A bias is a constant added to shift the final output (like the intercept b in y = mx + b). Together, they are the knobs the model adjusts during training.) — that minimize the loss. Imagine you are standing on a vast hilly landscape in Jos, blindfolded, and your goal is to find the lowest valley. You cannot see the whole terrain, but you can feel the slope of the ground beneath your feet. At each step, you feel which direction goes downhill, plant your foot there, feel the new slope, and step again. This is gradient descent: at each point in the parameter space (Imagine every possible combination of weight and bias values as a point on a vast landscape — that landscape is the parameter space.), you compute the slope (called the gradient), and you move your parameters in the direction that reduces the loss. The landscape itself is determined by two things: your data and your model architecture (the overall structure of the model — how many layers it has, what operations each layer performs, how they connect). Change the training data, and the hills shift. Change the model from linear regression to a neural network, and the entire topography transforms. The key insight is that you never see the whole landscape — you only ever know the slope at your current position, which is why training can get stuck in local valleys (Some valleys are shallow dips — local minima — where the loss is low but not the lowest possible. The deepest valley is the global minimum. A good optimization algorithm tries to find the global minimum without getting stuck in local ones.) instead of finding the true global minimum.
The learning rate controls how big each step is along this landscape, and it is the single most important hyperparameter (A parameter is something the model learns automatically from data (like weights). A hyperparameter is something YOU set before training begins — like the learning rate. The prefix 'hyper' means 'above' — it's a setting above the model, not inside it.) you will tune. Set it too large, say 1.0, and you overshoot the valley entirely. Your model's parameters leap from one side of the valley to the other, bouncing back and forth with each epoch (one complete pass through all the training data), and the loss either stays flat or even increases — this is called divergence. Set it too small, say 0.00001, and you crawl downhill so slowly that training a model that should take 10 minutes instead takes 6 hours. You waste GPU time, compute budget, and your own patience. The sweet spot varies by problem, but common starting values are 0.01 or 0.001, and many frameworks include learning rate schedulers (a tool that automatically reduces the learning rate as training progresses, so you take big steps at the start and tiny steps as you get close to the bottom) that automatically decrease the step size as training progresses. A common beginner mistake is setting a large learning rate, watching the loss bounce around, and concluding the model itself is broken when really it is the step size that needs adjustment. Always check your learning rate first when training goes wrong.
Guided Exercises
Discussion Prompt
Training a model is called "optimization," which implies finding the best solution. But are we always finding the absolute best solution? Think about the hilly landscape analogy: what happens if there are multiple valleys? Could your model settle in a shallow valley when a deeper one exists nearby? What strategies do practitioners use to deal with this problem?
Key Takeaway
Training is a search through parameter space for the settings that make the model least wrong. The loss function defines what "wrong" means, and gradient descent is how the search proceeds. Understanding both gives you the intuition to diagnose why training fails and how to fix it.