Skip to main content
Back to Syllabus

Week 5 of 8

How Models Learn

Understand the mechanics of model training (loss and optimization) and the metrics used to evaluate whether training worked. These two sessions complete the quantitative reasoning arc of the course.

Before this week
Bayesian thinking
Vectors & matrices

Session 9

Finding the Bottom

Loss Functions and Optimization Intuition

90 Minutes
Objective Understand what a loss function measures, why optimization means minimization, and develop intuition for gradient descent.

Concept Lesson

Imagine you are training a model to predict house prices in Lagos for a real estate platform. Your model looks at a 3-bedroom apartment in Lekki and predicts ₦38 million. The actual sale price turns out to be ₦45 million. Your model is off by ₦7 million. But how wrong is that, really? Is it a little wrong, or catastrophically wrong? That depends entirely on how you define "wrong," and in machine learning, that definition is called the loss function. A loss function takes every prediction your model makes, compares it to the true answer, and produces a single number that summarizes how badly the model is performing overall. The goal of training is to make this number as low as possible. Think of the loss as a scorecard pinned on the wall: lower means your model is getting closer to reality, and zero would mean perfect predictions on every data point. Different loss functions define "wrong" differently, and this choice is not academic — it directly shapes what your model prioritizes fixing.

The two most common loss functions for regression problems (Regression means predicting a number — like a price, a temperature, or a salary. The opposite is classification, which means predicting a category — like 'spam' or 'not spam.') are Mean Squared Error (MSE) and Mean Absolute Error (MAE). Let's work through a concrete example to see how they differ. Suppose your model predicts house prices of ₦45M, ₦30M, and ₦60M for three properties, but the true prices are ₦42M, ₦35M, and ₦50M. The individual errors are +3M (overpredicted), −5M (underpredicted), and +10M (overpredicted). MSE squares each error first: 3² = 9 (multiplying the number by itself: 3 × 3 = 9), 5² = 25, 10² = 100, then averages them: (9 + 25 + 100) / 3 = 44.67. MAE takes absolute values: |3| = 3, |5| = 5, |10| = 10, then averages: (3 + 5 + 10) / 3 = 6.0. Notice what happened: MSE gave 100 points of penalty to that ₦10M error while MAE only gave it 10. MSE punishes big mistakes far more harshly because squaring a large number makes it exponentially larger. This means if you use MSE as your loss function, your model will work extra hard to fix its worst predictions, sometimes at the expense of being slightly worse on the majority of data points. MAE, by contrast, treats every error with equal weight — a ₦1M error and a ₦20M error get proportional but not disproportionate attention. The choice between them matters: if your Lagos real estate platform can tolerate a few big misses but needs generally reliable predictions across the board, MAE might serve you better. If a single wildly wrong price estimate damages your reputation, MSE will push harder to prevent that.

Optimization is the process of finding the specific model parameters — the weights and biases (Weights are numbers that control how much each input feature influences the prediction — like turning dials up or down. A bias is a constant added to shift the final output (like the intercept b in y = mx + b). Together, they are the knobs the model adjusts during training.) — that minimize the loss. Imagine you are standing on a vast hilly landscape in Jos, blindfolded, and your goal is to find the lowest valley. You cannot see the whole terrain, but you can feel the slope of the ground beneath your feet. At each step, you feel which direction goes downhill, plant your foot there, feel the new slope, and step again. This is gradient descent: at each point in the parameter space (Imagine every possible combination of weight and bias values as a point on a vast landscape — that landscape is the parameter space.), you compute the slope (called the gradient), and you move your parameters in the direction that reduces the loss. The landscape itself is determined by two things: your data and your model architecture (the overall structure of the model — how many layers it has, what operations each layer performs, how they connect). Change the training data, and the hills shift. Change the model from linear regression to a neural network, and the entire topography transforms. The key insight is that you never see the whole landscape — you only ever know the slope at your current position, which is why training can get stuck in local valleys (Some valleys are shallow dips — local minima — where the loss is low but not the lowest possible. The deepest valley is the global minimum. A good optimization algorithm tries to find the global minimum without getting stuck in local ones.) instead of finding the true global minimum.

The learning rate controls how big each step is along this landscape, and it is the single most important hyperparameter (A parameter is something the model learns automatically from data (like weights). A hyperparameter is something YOU set before training begins — like the learning rate. The prefix 'hyper' means 'above' — it's a setting above the model, not inside it.) you will tune. Set it too large, say 1.0, and you overshoot the valley entirely. Your model's parameters leap from one side of the valley to the other, bouncing back and forth with each epoch (one complete pass through all the training data), and the loss either stays flat or even increases — this is called divergence. Set it too small, say 0.00001, and you crawl downhill so slowly that training a model that should take 10 minutes instead takes 6 hours. You waste GPU time, compute budget, and your own patience. The sweet spot varies by problem, but common starting values are 0.01 or 0.001, and many frameworks include learning rate schedulers (a tool that automatically reduces the learning rate as training progresses, so you take big steps at the start and tiny steps as you get close to the bottom) that automatically decrease the step size as training progresses. A common beginner mistake is setting a large learning rate, watching the loss bounce around, and concluding the model itself is broken when really it is the step size that needs adjustment. Always check your learning rate first when training goes wrong.

Guided Exercises

Exercise 1: You have three houses with true sale prices: ₦10M, ₦20M, ₦30M. Your model predicts ₦12M, ₦18M, and ₦25M respectively. (a) Calculate the error for each prediction. (b) Compute the Mean Squared Error. (c) Compute the Mean Absolute Error. (d) Which metric penalizes the ₦25M prediction (off by ₦5M) more harshly? Show all arithmetic. (e) If you were building a property valuation tool for a mortgage lender who needs to avoid wildly wrong estimates, which loss function would you prefer and why?
Exercise 2: On paper, sketch a U-shaped curve representing a loss landscape. Mark a starting point high on the left slope. Draw a sequence of arrows showing gradient descent steps toward the bottom using a moderate learning rate. Now redraw with a learning rate that is too large: show your arrows bouncing past the valley and climbing up the opposite slope. Finally, redraw with a learning rate that is too small: show tiny steps that barely move after 10 iterations. Label each scenario and note what the loss curve over epochs would look like in each case.
Exercise 3: Your model's loss after each training epoch (one complete pass through all the training data) is: 5.0, 3.2, 2.1, 1.8, 1.75, 1.74, 1.74, 1.74. (a) Plot this on an x-y graph with epochs on the x-axis and loss on the y-axis. (b) At which epoch would you stop training, and why? (c) What would it mean if the loss at epoch 9 jumped to 1.9 and epoch 10 to 2.4? Name this phenomenon. (This is called overfitting — the model has memorized the training data so well that it starts performing worse on new, unseen data. It is like a student who memorizes practice test answers instead of learning the underlying concepts.) (d) What practical action would you take to address it?

Discussion Prompt

Training a model is called "optimization," which implies finding the best solution. But are we always finding the absolute best solution? Think about the hilly landscape analogy: what happens if there are multiple valleys? Could your model settle in a shallow valley when a deeper one exists nearby? What strategies do practitioners use to deal with this problem?

Key Takeaway

Training is a search through parameter space for the settings that make the model least wrong. The loss function defines what "wrong" means, and gradient descent is how the search proceeds. Understanding both gives you the intuition to diagnose why training fails and how to fix it.

Quick Check

True prices: 10, 20, 30. Predictions: 12, 18, 25. MSE = (4+4+25)/3 = 11. MAE = (2+2+5)/3 = 3. Which loss penalizes the error of 5 more?

  • MAE
  • MSE (squares the error, giving 25 instead of 5)
  • Both equally

Your learning rate is too high. What happens during training?

  • The loss bounces up and down or diverges — parameters overshoot the minimum
  • Training converges faster than normal
  • Nothing — learning rate doesn't matter

Loss decreases for 8 epochs, then starts increasing at epoch 9. This is likely:

  • Normal — loss always goes up eventually
  • Underfitting — the model hasn't trained enough
  • Overfitting — the model memorized training data and now performs worse on new data

Key Terms — Tap to Flip

1 / 4

What is a loss function?

A measure of how wrong the model's predictions are. The goal of training is to minimize the loss. Lower loss = better model.

What is MSE?

Mean Squared Error: average of squared prediction errors. Squaring punishes big errors more harshly than small ones. Use when large mistakes are costly.

What is gradient descent?

The algorithm that finds the best model parameters by following the slope (gradient) downhill on the loss landscape, one step at a time.

What is the learning rate?

The step size in gradient descent. Too large = overshoot and diverge. Too small = painfully slow convergence. Common starting values: 0.01 or 0.001.

Session 10

Measuring What Matters

Evaluation Metrics — Beyond Accuracy

90 Minutes
Objective Understand precision, recall, F1, and confusion matrices. Know when each metric matters and why accuracy alone is dangerous.

Concept Lesson

You deploy a fraud detection model at a Nigerian bank. The model processes 50,000 transactions per day, and on your test set it achieved 99.2% accuracy. The engineering team celebrates. Three weeks later, the bank's risk department calls an emergency meeting: they have audited the model's actual performance and discovered it is catching only 60% of real fraud cases. Forty percent of fraudulent transactions are sailing through undetected, costing the bank ₦180 million in the first month alone. How did a 99.2% accurate model miss 40% of fraud? The answer is that accuracy is a dangerously misleading metric when classes (In ML, a 'class' is one of the categories your model predicts — for example, 'fraud' and 'not fraud' are two classes. When one class has far more examples than the other, the data is imbalanced. The rare class is called the minority class.) are imbalanced. Out of 50,000 daily transactions, roughly 49,500 are legitimate and only 500 are fraudulent. A model that simply predicts "legitimate" for every single transaction achieves 99% accuracy without catching a single case of fraud. Accuracy measures the percentage of all predictions that were correct, but it does not distinguish between the easy task of identifying legitimate transactions and the hard task of catching fraud. When one class vastly outnumbers the other — as fraud always does — accuracy rewards the model for ignoring the minority class entirely.

Precision and recall are sharper tools that dissect different aspects of model performance. Precision answers this question: of everything your model flagged as fraudulent, what percentage actually was fraud? If your model flags 100 transactions as fraud and 85 of them genuinely are, your precision is 85%. The remaining 15% are false alarms — legitimate customers whose cards get frozen, whose transactions get blocked, and who now call your customer service line furious. High precision means few false alarms. Recall asks a completely different question: of all the transactions that were genuinely fraudulent, what percentage did your model catch? If there were 500 actual fraud cases and your model identified 300 of them, your recall is 60%. The remaining 200 cases slipped through undetected. High recall means few missed threats. Here is the critical tension: there is almost always a trade-off between precision and recall. If you lower the threshold for flagging fraud to catch more cases (increasing recall), you will inevitably flag more legitimate transactions too (decreasing precision). Raising the threshold does the reverse. The bank must decide: is it worse to freeze a legitimate customer's card for an hour, or to let ₦2 million in fraudulent charges go through? That decision determines where you set the threshold (Most classifiers output a probability — say, 0.72 chance this is fraud. The threshold is the cutoff you pick: above it, you flag the transaction; below it, you let it through. Set the threshold at 0.5 and you flag anything above 50% probability. Set it at 0.9 and you only flag cases the model is very confident about.) and which metric you prioritize.

F1 score provides a single balanced number by taking the harmonic mean of precision and recall: F1 = 2 × (precision × recall) / (precision + recall). (The harmonic mean is a special kind of average that penalizes imbalance. A regular average of 90% and 10% is 50% — which hides the fact that one number is terrible. The harmonic mean of 90% and 10% is only about 18% — much more honest. The formula is: 2 × (precision × recall) / (precision + recall). Here is a worked example: if precision is 90% (0.9) and recall is 70% (0.7), then F1 = 2 × (0.9 × 0.7) / (0.9 + 0.7) = 2 × 0.63 / 1.6 = 0.7875, or about 79%.) Unlike a simple average, the harmonic mean punishes extreme imbalances. If your precision is 99% but recall is only 10%, your simple average is 54.5% — which sounds acceptable. But the F1 score is only 18.2%, which honestly reflects that your model is catching almost nothing. This property makes F1 useful when you need one number to compare models but cannot afford to ignore either type of error. The confusion matrix is the master document from which all other metrics are derived. It is a 2×2 table (a 2-by-2 table — two rows and two columns — that shows four numbers: what the model got right and what it got wrong, broken down by type) with four cells: true positives (correctly caught fraud), true negatives (correctly identified legitimate transactions), false positives (legitimate transactions wrongly flagged), and false negatives (fraud that slipped through). Every metric — accuracy, precision, recall, F1 — is a simple arithmetic combination of these four numbers. Once you can build and read a confusion matrix, you have full visibility into what your model is doing right and what it is getting wrong.

A common mistake is to optimize for a metric without understanding the real-world cost of each type of error. Consider a disease screening model deployed in a hospital in Abuja. False negatives — telling a sick patient they are healthy — could delay treatment by months, allowing a treatable disease to become fatal. False positives — telling a healthy patient they might be sick — cause anxiety and additional testing, but the patient ultimately walks away unharmed after confirmation. In this context, recall matters far more than precision: you would rather flag 100 healthy patients for follow-up testing than miss one actual cancer case. Now consider spam email filtering at a Nigerian tech company. A false positive means a legitimate email from a client gets silently deleted. A false negative means one spam email reaches your inbox, and you delete it in two seconds. Here, precision matters more: you would rather let ten spam emails through than accidentally block one important business email. The metric you choose to optimize is not a technical decision — it is an ethical and business decision about which mistakes you are willing to make.

Guided Exercises

Exercise 1: Your fraud detection model ran on 10,000 transactions. Of these, 9,800 were legitimate and 200 were fraudulent. The model correctly identified 9,700 legitimate transactions, wrongly flagged 100 legitimate transactions as fraud, correctly caught 150 fraud cases, and missed 50 fraud cases. (a) Build the full 2×2 confusion matrix labeling all four cells. (b) Calculate accuracy, precision, recall, and F1 score step by step. (c) The bank reports your model as "98.5% accurate." Explain in one sentence why this number hides a serious problem. (d) Which metric should the bank's risk team monitor daily, and why?
Exercise 2: A cancer screening tool is being evaluated for deployment in Lagos State hospitals. Scenario A has precision 95% and recall 60%. Scenario B has precision 60% and recall 95%. (a) In plain language, what does each scenario mean for patients? (b) Which scenario would the hospital's chief medical officer prefer, and why? (c) Which scenario would a patient who just received a positive result prefer? (d) Are the doctor's and patient's preferences aligned or conflicting? Explain the real-world consequences of choosing each scenario.
Exercise 3: Your company builds a hiring model that screens resumes. Overall accuracy is 92%. However, when you break down performance by demographic group, you find that recall is 85% for candidates from public universities but only 52% for candidates from private universities. (a) What does this mean practically for applicants from each group? (b) Even though overall accuracy looks strong, why should this disparity concern you? (c) What steps would you take to investigate and address this issue before deploying the model?

Discussion Prompt

If someone tells you "our model is 97% accurate," what are the first three questions you would now ask? Think about class distribution, what the model is being used for, and what types of errors are being made. Share your questions and discuss why each one matters.

Key Takeaway

Choose your evaluation metric based on what mistake costs more in your specific context. Accuracy is the default, but it can be dangerously misleading when classes are imbalanced. Precision, recall, F1, and the confusion matrix give you the full picture you need to make responsible decisions.

Quick Check

Precision answers the question:

  • Of everything the model flagged, how many were actually correct?
  • Of all actual positives, how many did the model catch?
  • What percentage of all predictions were correct?

Recall answers the question:

  • Of everything the model flagged, how many were correct?
  • Of all actual positive cases, how many did the model catch?
  • How many predictions were true negatives?

F1 score is useful because it:

  • Ignores recall and focuses only on precision
  • Is always higher than accuracy
  • Balances precision and recall, punishing extreme imbalances between them

Key Terms — Tap to Flip

1 / 4

What is precision?

Of everything the model flagged as positive, what fraction was actually positive? High precision = few false alarms. Critical for spam filters.

What is recall?

Of all actual positive cases, what fraction did the model catch? High recall = few missed cases. Critical for disease screening.

What is F1 score?

The harmonic mean of precision and recall: 2 × (P × R) / (P + R). Punishes extreme imbalance — 99% precision with 1% recall gives a very low F1.

What is a confusion matrix?

A 2×2 table showing TP, TN, FP, FN. The master document from which all other metrics (accuracy, precision, recall, F1) are calculated.