Objective Understand Bayes’ theorem intuitively and apply it to reason about how evidence should change beliefs.

Concept Lesson

Imagine you are a doctor at a clinic in Abuja. A patient walks in, takes a screening test for a rare disease, and the result comes back positive. How worried should you actually be? Your instinct might say 95% worried if the test is 95% accurate, but as you learned in Session 6, the real answer can be shockingly lower when the disease is rare. Bayes’ theorem is the formal tool that gives you the correct answer every time, and it works by forcing you to separate three distinct pieces of information: your prior belief (how common is the disease before the test?), the evidence (how likely is a positive result if the patient is actually sick?), and the total probability of seeing the evidence at all (how many positive results come from sick people versus healthy people who got false alarms?). The formula ties these together: P(disease | positive) = P(positive | disease) × P(disease) ⁄ P(positive). Do not memorize this mechanically — understand the logic. Your updated belief equals how well the evidence supports the hypothesis, scaled by how plausible the hypothesis was to begin with, divided by how surprising the evidence is overall.

Let us trace through the formula step by step with concrete numbers. You know P(disease) = 0.01 — the disease affects 1 in 100 people in Abuja. This is your prior: before any test, there is a 1% chance any given patient has it. The test sensitivity is P(positive | disease) = 0.95 — if the patient is sick, the test says positive 95% of the time. Sensitivity — also called the true positive rate — is the probability that the test correctly identifies someone who actually has the condition. If sensitivity is 95%, the test catches 95 out of 100 sick people. The false positive rate is P(positive | no disease) = 0.05 — if the patient is healthy, the test still says positive 5% of the time. The false positive rate is the chance that a healthy person gets a wrong positive result. If the false positive rate is 5%, then 5 out of every 100 healthy people will be incorrectly told they are sick. Now you need P(positive), the total probability of a positive result. This comes from the law of total probability: P(positive) = P(positive | disease) × P(disease) + P(positive | no disease) × P(no disease) = (0.95 × 0.01) + (0.05 × 0.99) = 0.0095 + 0.0495 = 0.059. Plugging into Bayes: P(disease | positive) = (0.95 × 0.01) ⁄ 0.059 = 0.0095 ⁄ 0.059 = 0.161, or about 16.1%. The 95% accurate test only gives you a 16% chance of actually being sick. The reason is that the flood of false positives from the 99% healthy population drowns out the true positives from the 1% who are actually sick.

This is not just a medical curiosity — it is the exact challenge every ML system faces when dealing with rare events. A fraud detector monitoring ₦4 billion in daily mobile money transactions across Nigeria might face a fraud rate of 0.5%. Even with a model that correctly catches 90% of fraud (sensitivity) and has only a 2% false positive rate, Bayes’ theorem tells you that a flagged transaction actually has only about a 18.5% chance of being real fraud: P(fraud | flagged) = (0.90 × 0.005) ⁄ [(0.90 × 0.005) + (0.02 × 0.995)] = 0.0045 ⁄ (0.0045 + 0.0199) = 0.0045 ⁄ 0.0244 = 18.4%. That means over 80% of flagged transactions are legitimate customers being wrongly accused. This is why banks use multi-stage screening: the first filter catches broad patterns, then a second system reviews only the flagged cases, progressively updating the belief at each stage. Each stage is a Bayesian update, and the posterior (your updated belief) from one stage becomes the prior (your starting belief) for the next. This is similar to how a doctor updates their diagnosis: they start with a prior belief about what disease the patient might have (based on symptoms and prevalence), then update that belief when lab results come in (the evidence). New test results can shift the diagnosis dramatically — or barely at all — depending on how consistent the evidence is with each possible condition. Bayesian thinking is not a niche academic topic; it is the logical backbone of how intelligent systems refine their understanding of the world.

The practical takeaway is this: always think about the base rate. The base rate is simply how common something is before you test for it. If 1% of the population has the disease, the base rate is 1%. This number matters enormously — a test for a rare condition will produce far more false alarms than a test for a common one, even if both tests are equally accurate. How common or rare is the thing you are trying to detect? When the base rate is very low — rare diseases, rare fraud, rare security threats — even a highly accurate model will produce mostly false alarms. This is why accuracy alone is dangerously misleading for imbalanced problems. Metrics like precision (of all flagged items, what fraction are real?) and recall (of all real cases, what fraction did we catch?) exist precisely because they separate the two types of errors that Bayes’ theorem makes visible. A model with 99% accuracy that never catches fraud has 0% recall. A model with 80% accuracy that catches 90% of fraud is far more valuable to the bank. Bayesian thinking forces you to see past surface-level accuracy and understand what a model is actually doing in context.

Guided Exercises

Exercise 1: Revisit the medical test from Session 6 using Bayes’ formula explicitly. Prior: P(disease) = 0.01. Likelihood of positive if sick: P(positive | disease) = 0.95. False positive rate: P(positive | no disease) = 0.05. Step 1: Calculate P(positive) using the law of total probability. Step 2: Plug into Bayes’ formula to find P(disease | positive). Step 3: Now suppose the disease is 10 times more common, affecting 10% of the population. Recalculate. How much does the base rate change your conclusion? Write out every arithmetic step.

Exercise 2: An email contains the word "free." Your prior belief that any email is spam is P(spam) = 0.20. Among spam emails, 60% contain "free," so P("free" | spam) = 0.60. Among legitimate emails, only 5% contain "free," so P("free" | not spam) = 0.05. Step 1: Calculate P("free") = P("free" | spam) × P(spam) + P("free" | not spam) × P(not spam). Step 2: Apply Bayes to find P(spam | "free"). Step 3: What if the email contains "free" AND "winner"? Would the posterior probability increase or decrease compared to just "free" alone? Explain your reasoning.

Exercise 3: You are evaluating a new model architecture. Your prior belief that it will beat your baseline is 50%. You run an experiment and it wins. But small-dataset experiments are noisy: P(wins | truly better) = 0.70, and P(wins | not truly better) = 0.30. Step 1: Use Bayes to calculate P(truly better | wins). Step 2: You run a second independent experiment and it also wins, with the same likelihoods. Update your posterior from step 1 to get a new posterior. How much more confident are you after two wins versus one? This is sequential Bayesian updating — the same principle behind how models improve with more training data.

Discussion Prompt

How does Bayesian thinking apply to debugging an ML pipeline? Suppose your model’s accuracy drops from 94% to 71% after deployment. You suspect data drift. How would you use Bayesian reasoning to evaluate this hypothesis — what is your prior for data drift being the cause, what evidence would change your belief, and what alternative explanations (software bugs, label errors, infrastructure issues) compete as hypotheses?

Key Takeaway

Always consider the base rate. Flashy evidence is less impressive when the thing you are looking for is rare. A 95% accurate test for a 1-in-100 disease gives you only a 16% chance of being sick. This principle applies to medical tests, fraud detection, anomaly detection, and every ML system that deals with imbalanced data.

Objective Understand what vectors and matrices are, what operations on them mean intuitively, and why dimensions matter in ML.

Concept Lesson

Your team is building a house price prediction model for Lagos. Each house in your dataset has three features: number of bedrooms, size in square meters, and price in millions of Naira. A single house — say, 3 bedrooms, 120 square meters, ₦45M — can be written as a list of three numbers: [3, 120, 45]. That list is a vector. A vector is simply an ordered list of numbers, and in machine learning, every data point you ever feed into a model is represented this way. A customer profile becomes [age, income, purchase_count]. An image becomes a vector of pixel values. A sentence becomes a vector of word embeddings (In modern AI, each word is converted into a list of numbers that captures its meaning — words with similar meanings get similar lists of numbers. This conversion is called an embedding.) Each number in the list is one feature, and the total number of features is the dimension of the vector. Once you see data as vectors, you realize that ML is fundamentally about finding patterns across large collections of these lists.

Now take your three Lagos houses: House A is [3, 120, 45], House B is [5, 250, 120], and House C is [2, 75, 28]. Stack them vertically and you get a grid with 3 rows and 3 columns — that grid is a matrix. Each row is one house (one data point), and each column is one feature (bedrooms, size, price). This is exactly how a spreadsheet of data looks, and it is exactly how datasets are represented inside ML frameworks. A dataset with 1,000 houses and 5 features is a 1,000 × 5 matrix. The shape — rows by columns — is something you will check constantly when building ML pipelines. If your model expects 5 features but you accidentally pass it 4, you get a dimension mismatch error. This is the single most common source of bugs in practice, and it happens because someone did not check that the matrix shape matched what the model expected. Getting comfortable with the idea of matrix shape now will save you hours of frustrating debugging later.

The most important operation in all of ML is matrix multiplication, and you do not need to memorize the mechanical procedure — you need to understand what it means. When you multiply a data vector by a weight vector, each element of the result is a weighted sum of the input features. Consider a single neuron in a neural network that receives three inputs: number of bedrooms, size in square meters, and number of bathrooms. The neuron has three learned weights: [0.5, 0.3, 0.2]. To compute its output, you take the dot product — multiply each weight by the corresponding feature and add the results: (0.5 × 3) + (0.3 × 120) + (0.2 × 2) = 1.5 + 36.0 + 0.4 = 37.9. The weights tell you which features the neuron pays attention to. Here, the size feature (weight 0.3 applied to a large value of 120) dominates the output. During training, the model adjusts these weights to amplify features that help predict the target and suppress features that add noise. When you stack many such neurons together — each with its own weight vector — and arrange them in layers, you get a neural network. Each layer is a matrix multiplication that transforms the input into a new representation, and the non-linear activation function between layers allows the network to learn curved, complex patterns rather than just straight lines. An activation function is a simple rule applied after the weighted sum — it decides whether a neuron ‘fires’ or not.

To build intuition for what the dot product measures, think of it as a similarity score. If two vectors point in the same direction, their dot product is large and positive. If they point in opposite directions, it is negative. If they are perpendicular, it is zero. In a recommendation system, each user and each movie can be represented as vectors in the same space. The dot product of a user vector and a movie vector measures how well the movie matches the user’s preferences — a high dot product means the user is likely to enjoy it. In a search engine, the dot product between a query vector and a document vector measures relevance. This single operation — multiplying corresponding elements and summing them up — appears in every corner of ML: linear regression, logistic regression, support vector machines, transformers, and convolutional neural networks all rely on it at their core. Once you understand that the dot product is a weighted combination measuring alignment between two vectors, 80% of the linear algebra you need for ML is in place.

Guided Exercises

Exercise 1: Represent these three houses as vectors: House A (3 beds, 120 sqm, ₦45M), House B (5 beds, 250 sqm, ₦120M), House C (2 beds, 75 sqm, ₦28M). Write out all three vectors. Now stack them into a matrix. What are the dimensions of this matrix? What does each row represent? What does each column represent? If you added a fourth feature — distance to the nearest BRT station in km — what would the new dimensions be?

Exercise 2: You have a data vector with 10 features, but your model’s input layer expects exactly 8 features. When you try to run the model, you get a shape mismatch error. Why does this happen? Describe two specific ways to fix this problem. One fix should remove information (dimensionality reduction — combining or transforming your features to reduce how many there are, e.g. turning 10 related measurements into 3 summary scores — or feature selection — choosing only the most important features and dropping the rest), and the other should change the model architecture (add more input neurons). What are the tradeoffs of each approach?

Exercise 3: Given weights w = [0.5, 0.3, 0.2] and house features x = [4 bedrooms, 8 bathrooms, 10 age in years], calculate the dot product: w₁x₁ + w₂x₂ + w₃x₃. Show every multiplication step. This single number is what one neuron computes. Which feature contributes the most to the output? If the model learned these weights during training on a price prediction task, what does it tell you about which features the model considers most predictive of house price?

Discussion Prompt

When someone says "a neural network is just matrix multiplication plus some non-linearity," does that make more sense now? A single layer takes an input vector, multiplies it by a weight matrix (many dot products in parallel), adds a bias (a constant number added to the weighted sum — it shifts the output up or down, like adjusting the intercept in y = mx + b. This is different from statistical bias, which means systematic error.), and passes the result through a non-linear function like ReLU (Rectified Linear Unit), which has a beautifully simple rule: if the input is positive, pass it through unchanged; if it’s negative, output zero. This simple non-linearity is what allows neural networks to learn complex, curved patterns instead of just straight lines. The non-linearity is what lets the network learn curved patterns — without it, stacking 100 layers would still just produce a straight line. Can you now trace what happens at each step when data flows through one layer of a network?

Key Takeaway

Vectors are data points. Matrices are datasets (or weight tables). The dot product is a weighted sum that measures alignment. Matrix multiplication is many dot products run in parallel. That is 80% of the linear algebra you need to understand what is happening inside every ML model you will ever use.

Probabilistic Reasoning

Bayes’ Brain

Concept Lesson

Guided Exercises

Quick Check

Key Terms — Tap to Flip

Vectors and Matrices — No Fear

Concept Lesson

Guided Exercises

Quick Check

Key Terms — Tap to Flip

Interactive Tools