Skip to main content
Back to Syllabus

Week 3 of 8

Data in Context

Apply everything from Weeks 1–2 to a real dataset, then begin the shift into probabilistic thinking — the mathematical language of ML.

Before this week
Mean, median, mode & spread
Distributions

Session 5

Integration Day — Putting It All Together

Reading a Dataset

90 Minutes
Objective Apply all Weeks 1–2 concepts to analyze a real dataset from scratch. Build the habit of exploratory reasoning before modeling.

Concept Lesson

Imagine you are a data analyst at a Nigerian bank that processes ₦4.2 billion in mobile money transactions every day. Your team lead drops a CSV file on your desk and says, "Take a look at this before we build anything." You open the file and see 10,000 rows with four columns: amount in Naira, timestamp, merchant_category, and is_fraud coded as 0 or 1. No Python scripts, no Jupyter notebooks — just your pen, a calculator, and the reasoning skills you have built over the last two weeks. This is how every real ML project should start: with a human reading the data before a machine ever touches it.

Your first instinct should be to count. How many rows are there? The file says 10,000 — that is your dataset size, and it matters because small datasets lead to unreliable patterns. Next, you split the data by the is_fraud column. Suppose you find 9,400 legitimate transactions and 600 fraudulent ones. That is a 600 / 9,400 = 6.38% fraud rate, or a ratio of roughly 1 fraud case for every 15.7 legitimate transactions. This class imbalance is critical. If you ignored it and built a model that always predicted "not fraud," the model would appear to be 94% accurate — an impressive number that completely fails at its job. Your team lead would see a high accuracy score, approve deployment, and the bank would lose money every single day because the model never catches the fraud it was designed to detect. Understanding class balance before modeling is not optional; it is the difference between a useful system and an expensive paperweight.

Now look at the amount column. Calculate the overall mean and median transaction amount. Suppose the mean is ₦18,500 and the median is ₦4,200. That gap is a red flag — it tells you the distribution is right-skewed, pulled upward by a few very large transactions. Now split by fraud status. Perhaps legitimate transactions have a mean of ₦12,000 and a median of ₦3,800, while fraudulent transactions have a mean of ₦95,000 and a median of ₦72,000. The fraud amounts are dramatically higher, but the median is the more honest measure here because it is not distorted by a single ₦2,000,000 transfer. If you set a threshold for suspicious transactions based on the mean, you would flag every transaction above ₦95,000 — but that misses the cluster of fraud cases in the ₦50,000 to ₦80,000 range. The median-based approach would catch more real fraud. Choosing mean versus median is not a technicality; it determines whether your alert system actually works or just looks good on a report.

Finally, examine time. Sort the timestamps into weekly buckets and count the fraud cases per week. Suppose Week 1 had 80 fraud cases, Week 2 had 95, Week 3 had 110, Week 4 had 130. That is a consistent upward trend — roughly 15–20 additional fraud cases each week. If you calculate the rate of change from Week 1 to Week 4, fraud increased by (130 − 80) ⁄ 80 = 62.5% over four weeks. This tells you the problem is getting worse, not better, and any model you build must be retrained regularly or it will drift (meaning the patterns in the real world change over time — for example, fraud tactics evolve — so a model trained on last month's data starts making wrong predictions on this month's data) out of date within a month. A model trained on Week 1 data alone would miss the evolving fraud patterns appearing in later weeks. All of this analysis — ratios, percentages, central tendency, spread, and rate of change — comes directly from the first two weeks of this course. You are now applying those tools to make a real business decision.

Guided Exercises

Exercise 1: From the dataset, count the total number of fraud cases and legitimate cases. Calculate the fraud ratio (fraud / total). Now imagine a lazy baseline model that always predicts "not fraud" for every transaction. What accuracy would it achieve? Express your answer as a percentage. Then explain in two sentences why this high accuracy is completely meaningless for a fraud detection system — what would happen to the bank if they deployed this model and fraud cases slipped through every day?
Exercise 2: Calculate the mean and median transaction amount for fraudulent transactions separately from legitimate ones. Suppose you find fraud mean = ₦95,000, fraud median = ₦72,000, legitimate mean = ₦12,000, legitimate median = ₦3,800. Which measure — mean or median — should you use to set a "suspicious transaction" threshold, and why? Consider what happens if you set the threshold at the mean: you would flag only transactions above ₦95,000, but there are fraud cases at ₦55,000 that you would miss entirely.
Exercise 3: Write a 3-sentence summary of this dataset for a non-technical bank executive sitting in Lagos. Include the fraud rate, the typical fraud amount (use the median, not the mean), and the time trend. Your executive needs to decide whether to invest in a fraud detection model — what number will convince them this is urgent? This exercise bridges quantitative analysis with the verbal reasoning you will use every day in your career.

Discussion Prompt

Before this course, if someone handed you a CSV with 10,000 rows, you might have jumped straight into training a model. What would you do differently now? List at least five specific questions you would ask about the data before writing a single line of model code. How would the class imbalance you discovered today change your choice of model, your evaluation metric, and your communication with stakeholders?

Key Takeaway

Data analysis is reasoning, not just computation. The best ML practitioners spend more time understanding their data than tuning their models. A 94% accurate model that never catches fraud is worse than useless — it is dangerous because it creates false confidence.

Quick Check

You have a fraud dataset with 600 fraud cases out of 10,000 total. The fraud rate is 6%. A model that always predicts "not fraud" achieves:

  • 6% accuracy
  • 94% accuracy
  • 50% accuracy

Fraud transaction amounts have mean = ₦95,000 and median = ₦72,000. To set a threshold for flagging suspicious transactions, which measure is more appropriate?

  • Median, because the mean is inflated by a few very large transactions
  • Mean, because it accounts for all values
  • It doesn't matter — they give the same result

Key Terms — Tap to Flip

1 / 3

What is class imbalance?

When one category has far more examples than the other. A 98/2 fraud split means the model can get 98% accuracy by always predicting "not fraud" — and never catch a single fraud case.

What is exploratory analysis?

The process of examining data before modeling: counting rows, checking class balance, computing statistics, visualizing distributions. Every ML project should start here.

What is data drift?

When patterns in real-world data change over time, making a trained model less accurate. A model trained on Week 1 data may miss fraud patterns emerging in Week 4.

Session 6

Entering Probability

Thinking in Probabilities

90 Minutes
Objective Understand probability as a measure of uncertainty, calculate basic probabilities, and interpret probabilistic ML outputs.

Concept Lesson

Imagine you are designing a spam filter for an email service used by 2 million Nigerians. Every day, about 800,000 emails flow through the system, and roughly 120,000 of those are spam. That means the probability that a randomly selected email is spam is 120,000 / 800,000 = 0.15, or 15%. A probability is simply a number between 0 and 1 that describes how likely something is. A value of 0 means the event is impossible; 1 means it is certain; 0.5 means it is a coin flip. Every ML classifier you will ever use outputs probabilities under the hood, even if the final prediction you see is just a class label. When your spam filter says "spam," it first computed something like P(spam | email content) = 0.87 and then decided that 87% confidence was high enough to flag the message. Understanding what those numbers mean is the foundation of building systems that make good decisions under uncertainty.

Now consider two things happening together — this is called joint probability. Suppose there is a 30% chance of rain in Lagos tomorrow, and a 50% chance you forget your umbrella on the bus. If these two events are independent, meaning rain does not affect whether you forget your umbrella, then the probability of both happening is 0.3 × 0.5 = 0.15, or 15%. You multiply the individual probabilities because you need both conditions to be true simultaneously. This seems simple, but it matters enormously in ML. When a fraud detector considers multiple suspicious features at once — a large transaction amount AND an unusual merchant AND a new device — the joint probability of all three happening by coincidence is much smaller than any single feature alone. Models learn to exploit these joint probabilities. The danger is when features are correlated: if "unusual merchant" and "new device" tend to appear together, treating them as independent inflates your confidence. Ignoring correlation between features is one of the most common ways ML models produce badly calibrated predictions. But in ML, features are often NOT independent — if a customer's spending increases, their transaction count usually increases too. When events are correlated, you cannot simply multiply their probabilities.

Conditional probability is the engine that makes ML work. It asks a targeted question: given that I have already observed event X, what is the probability of outcome Y? This is written P(Y | X). For your spam filter, P(spam | email contains "winner") might be 0.72, while P(spam | email contains "meeting") might be only 0.03. The overall probability of spam is 15%, but the word inside the email changes that probability dramatically. Your filter works by computing conditional probabilities for every word and feature in the email, then combining them to produce a final score. This is exactly how a fraud detection system works too: P(fraud | amount > ₦100,000) is much higher than P(fraud | amount = ₦2,000). Conditional probability is the bridge between raw data and meaningful predictions.

To see why conditional probability can be deeply counterintuitive, consider this medical screening scenario. A hospital in Abuja screens 10,000 patients for a disease that affects 1% of the population. The test is 95% accurate: if you are sick, it says positive 95% of the time; if you are healthy, it says negative 95% of the time. Sensitivity (also called recall) is the chance the test catches you if you are actually sick — in this case, 95%. Specificity is the chance the test correctly clears you if you are healthy — also 95% here. Out of 10,000 people, 100 have the disease and 9,900 do not. Of the 100 sick people, 95 test positive (true positives). Of the 9,900 healthy people, 5% test positive by accident — that is 495 false positives. So out of 590 total positive results, only 95 are real. The probability of actually having the disease given a positive test is 95 / 590 = 16.1%. Your gut says 95%, but the math says 16%. This gap between intuition and calculation is exactly why ML practitioners must think in probabilities rather than in gut feelings. The consequence of getting this wrong in a real hospital is that hundreds of healthy patients receive unnecessary treatment, wasting resources and causing anxiety.

Two metrics matter most when evaluating a classifier's predictions. Precision answers: of everything the model flagged as positive, how many actually were positive? (If the model flags 100 transactions and 85 are real fraud, precision is 85%.) Recall answers: of everything that actually was positive, how many did the model catch? (If there are 200 fraud cases and the model catches 150, recall is 75%.) High precision means few false alarms; high recall means few missed cases. In fraud detection, a bank typically prefers high recall — it is better to investigate a few extra transactions than to let a thief steal from customers.

Guided Exercises

Exercise 1: A medical test for a disease is 95% accurate (sensitivity = 95%, specificity = 95%). The disease affects 1% of the population. You test positive. Work this out with a population of 10,000: how many people have the disease? How many true positives? How many false positives? What fraction of all people who test positive actually have the disease? Show every step of your calculation. Then explain in one sentence why the result is so much lower than 95%.
Exercise 2: Your fraud detection model outputs a probability of 0.7 that a transaction is fraudulent. If you set the classification threshold at 0.5, you flag the transaction as fraud. A threshold is a cutoff point. Most ML classifiers output a probability — say, 0.72 chance this transaction is fraud. The threshold is where you draw the line: if the probability is above the threshold, you say 'fraud'; below it, you say 'legitimate.' Choosing the threshold is a business decision, not just a mathematical one. If you set it at 0.9, you let it pass. A false positive is a legitimate transaction that the model wrongly flags as fraud (the customer's card gets frozen for no reason). A false negative is a real fraud case the model misses (the thief gets away with the money). For each threshold, describe what happens to false positives and false negatives. Now consider a hospital diagnostic model: would you want a lower or higher threshold? What about a spam filter? Explain how the cost of each type of error should guide your threshold choice.
Exercise 3: In your training data, 80% of emails are not spam. A model that always predicts "not spam" would be 80% accurate on this dataset. Compute this number yourself: 800 out of 1,000 emails are not spam, so predicting "not spam" every time gets 800 correct = 80% accuracy. Explain why this is misleading. What happens to the 200 spam emails? What metric would better capture whether the model is actually learning something useful, and why would a bank care about that metric when building a fraud detector?

Discussion Prompt

Why do you think many ML models output probabilities rather than just yes/no decisions? Imagine two fraud detectors: one says "fraud" or "not fraud," and the other says "78% chance of fraud." Which gives the bank more flexibility to decide what to do next? How does probabilistic output let different stakeholders — risk teams, customer service, executives — each set their own thresholds based on their own cost-benefit analysis?

Key Takeaway

ML is fundamentally about uncertainty. Probability gives us a precise language to reason about what we do not know, rather than relying on gut feeling. A model that says "I am 87% confident" is far more useful than one that just says "yes," because it lets humans make informed decisions with full knowledge of the risk.

Quick Check

A disease affects 1% of the population. A test with 95% sensitivity and 95% specificity gives a positive result. The probability the patient actually has the disease is approximately:

  • 16%
  • 95%
  • 50%

You lower your fraud detection threshold from 0.7 to 0.5. What happens?

  • Fewer legitimate transactions are flagged, but more fraud is missed
  • More fraud is caught (higher recall), but more legitimate transactions are wrongly flagged (lower precision)
  • Nothing changes — only accuracy matters

80% of emails are not spam. A model that always predicts "not spam" achieves 80% accuracy. Why is this misleading?

  • The accuracy is wrong — it should be 100%
  • 80% is a good number, so the model is fine
  • The model never catches any spam — 0% recall on the important class

Key Terms — Tap to Flip

1 / 3

What is conditional probability?

The probability of one event given that another has already happened. Written P(Y|X). "What is the chance of fraud given a large transaction amount?"

What is a threshold?

The cutoff point for converting a probability into a decision. A threshold of 0.5 means "flag as fraud if probability > 50%." Lower thresholds catch more but cause more false alarms.

What is the base rate?

How common something is before testing. A disease affecting 1% has a low base rate, meaning even a 95% accurate test produces mostly false positives.