Objective Summarize datasets using central tendency and spread, and understand when each measure is appropriate.

Concept Lesson

Imagine you are a data scientist at a Nairobi-based agritech company. Your team just trained five versions of a crop disease detection model and ran each on the same test set. The accuracy scores come back: 0.81, 0.83, 0.82, 0.80, 0.84. You calculate the mean by adding them up and dividing by five: (0.81 + 0.83 + 0.82 + 0.80 + 0.84) ÷ 5 = 4.10 ÷ 5 = 0.82, or 82%. Simple enough. But now suppose one of the five runs suffered a GPU memory error mid-training and produced a score of 0.45 instead of 0.84. The new mean is (0.81 + 0.83 + 0.82 + 0.80 + 0.45) ÷ 5 = 3.71 ÷ 5 = 0.742, or 74.2%. A single faulty run dragged the average down by nearly 8 percentage points. This is the mean’s critical weakness: it is sensitive to outliers. In machine learning you will constantly encounter datasets where a small number of extreme values—a sensor glitch, a data entry error, a rare edge case—distort the average. If you rely blindly on the mean to report model performance or summarize features, you will draw wrong conclusions that ripple into production decisions, from choosing the wrong model to setting incorrect thresholds.

The median solves the outlier problem by looking at the middle value. Sort the data from lowest to highest, and the median is whatever sits in the center. Using the buggy run from above, the sorted scores are 0.45, 0.80, 0.81, 0.82, 0.83. The middle value is 0.81—far closer to reality than the mean of 0.742. This is why many machine learning papers report median performance across runs rather than mean: it gives a truer picture of what a typical experiment produces. The mode—the most frequently occurring value—is useful for categorical data. Suppose your fraud detection model labels transactions as “fraud” or “not fraud,” and across 10,000 predictions the mode is “not fraud” (appearing 9,700 times). That tells you something fundamental about the class distribution. Or consider a confusion matrix where the most common error type is what ML calls a false negative — a real case that the model missed. (We will explore false negatives and their counterpart, false positives, in detail in Week 5.) The mode immediately tells you where to focus your debugging effort. Knowing when to reach for mean, median, or mode is a foundational skill: mean works well for symmetric, outlier-free data; median shines when outliers or skewed distributions are present; and mode is indispensable when you are counting categories or finding the most common prediction.

Spread measures how much the values in a dataset vary from one another, and without it a summary statistic like the mean tells you almost nothing. Suppose two crop disease models both achieve a mean F1 score of 0.82. Model A’s scores across 10 runs are tightly clustered: 0.80, 0.81, 0.81, 0.82, 0.82, 0.82, 0.83, 0.83, 0.84, 0.84. Model B’s scores are all over the place: 0.55, 0.65, 0.72, 0.78, 0.82, 0.86, 0.90, 0.95, 0.98, 0.99. Both have the same mean, but you would never trust Model B in production—it might score 0.99 one day and 0.55 the next. The range (max minus min) is the simplest spread measure: Model A’s range is 0.84 − 0.80 = 0.04, while Model B’s is 0.99 − 0.55 = 0.44. But range is crude because it ignores everything between the extremes. Variance and standard deviation go deeper. Variance is the average of the squared differences from the mean. Here is what that means step by step. Say your five values are 78, 80, 81, 82, 85. The mean is 81.2. The differences from the mean are −3.2, −1.2, −0.2, 0.8, 3.8. Square each: 10.24, 1.44, 0.04, 0.64, 14.44. Average those: 5.36. That is the variance. The standard deviation is the square root of 5.36, which is about 2.3 — meaning the typical run deviates about 2.3 percentage points from the mean. (The square root is the reverse of squaring: since 5 × 5 = 25, the square root of 25 is 5; it converts the squared units back to the original scale.) For Model A, most squared differences are tiny (around 0.0001), so the variance is small and the standard deviation (its square root) might be around 0.013. For Model B, the squared differences are much larger, yielding a standard deviation of perhaps 0.13—ten times higher. A model with accuracy 85% ± 2% is dramatically more reliable than one with 85% ± 15%. Always report a measure of spread alongside your central tendency; a number without context is just a number, and spread is the context that tells you whether to trust it.

Guided Exercises

Exercise 1: Five trained models produce F1 scores of 0.72, 0.75, 0.73, 0.71, and 0.99. Step 1: Calculate the mean by summing all five and dividing by 5. Step 2: Sort the values and find the median. Step 3: Which value better represents typical performance? Why is the 0.99 score suspicious, and what might have caused it (think: data leakage (when information from the test data accidentally sneaks into the training process, making the model look better than it really is), evaluation bug, lucky seed)? If you were writing a report for a client, which measure would you present and why?

Exercise 2: Model A has test accuracies across 10 independent runs that all fall between 84% and 86%. Model B has accuracies ranging from 70% to 95%. Both models have the same mean accuracy of 85%. Which model would you deploy in production, and why? Calculate the approximate standard deviation for each model by finding the average distance of each value from the mean. How does spread influence your confidence in a model’s real-world reliability? Consider a scenario where the model must work 24/7 for a bank—would you tolerate Model B’s variance?

Exercise 3: You have monthly income data for a city: most people earn ¤300,000–¤600,000, but a handful of individuals earn over ¤10,000,000. Sketch what this distribution looks like on paper. Should you use the mean or the median to describe “typical” income? Consider how this choice affects feature engineering: if you normalize this feature by subtracting the mean, what happens to the majority of the data? A log transformation applies a mathematical function called a logarithm to each value. What you need to know: the log of 10 is 1, the log of 100 is 2, the log of 1,000 is 3. It compresses large numbers dramatically while stretching out small ones — so a ¤500 million mansion and a ¤30 million apartment become much closer together. How might a log transformation help compress the tail and bring extreme values closer to the bulk of the distribution?

Discussion Prompt

A research paper reports “our model achieves 94% mean accuracy across five folds.” What additional information would you need to properly evaluate this claim? Think about spread, sample size, class balance, and what “accuracy” actually measures. Could a 94% mean accuracy still hide a poorly performing model? Draft a list of five follow-up questions you would ask the authors.

Key Takeaway

A single number never tells the whole story. Always ask about spread, always consider outliers, and choose the right summary statistic for your data. The difference between a good data scientist and a great one is knowing which number to trust and when.

Objective Recognize and interpret common data distributions, and understand why distribution shape matters for machine learning.

Concept Lesson

A distribution shows how values are spread across a range. Imagine you are a data scientist at a PropTech company in Lagos, and you have just scraped listing prices for 10,000 properties across the city. If you plot a histogram of those prices, you will see that most properties cluster somewhere between ¤15 million and ¤50 million, with a thick band of listings in the ¤25–¤35 million range. But to the right, a long thin tail stretches out toward ¤500 million and beyond—those are the luxury mansions in Ikoyi and Victoria Island. The shape of that histogram is the distribution, and it encodes vital information about the underlying data-generating process. In machine learning, you will rarely work with data you can fully understand by staring at a spreadsheet of numbers—visualizing the distribution gives you an immediate, intuitive grasp of what your data actually looks like. Is it clumped in one place? Is it spread out evenly? Are there strange spikes at certain values (maybe round numbers like ¤30 million, because agents round their listings)? Are there gaps where no data exists at all? These questions are almost impossible to answer without seeing the shape of the data, and getting the shape wrong is one of the most common causes of silent model failure.

The normal distribution (also called the bell curve or Gaussian distribution (named after the mathematician Carl Friedrich Gauss)) is symmetric: most values cluster near the center, and the frequency tapers off equally on both sides. Measurement errors, human heights, and the residuals of a well-fitted regression model tend to follow this pattern. Consider a model that predicts delivery times for a logistics company in Johannesburg: if the model is well-calibrated, the errors—the differences between predicted and actual delivery times—should form a normal distribution centered at zero. That means the model is not systematically over- or under-predicting; the errors cancel out. Many machine learning algorithms—linear regression, logistic regression, and support vector machines among them—assume that features are normally distributed, or at least perform best when they are. When the normality assumption holds, these algorithms produce reliable results — confidence intervals (a range of values that likely contains the true answer), well-calibrated probabilities (when the model says 70%, the outcome actually happens about 70% of the time), and stable predictions. When it does not hold, the models can silently degrade: predictions become biased, error estimates become meaningless, and you may not even realize it until a customer complains in production. Knowing when a normal distribution is a good approximation—and when it is dangerously wrong—is a critical modeling skill that separates beginners from practitioners.

Skewed distributions lean one way: a right-skewed distribution has a long tail pointing to the right (most values are low, a few are very high), while a left-skewed distribution has a long tail pointing to the left. Income data is the classic right-skewed example: most people earn modest amounts, but a few earn orders of magnitude more, pulling the tail out. Transaction amounts in mobile money platforms like M-Pesa follow the same pattern—millions of small transactions and a handful of massive transfers. In machine learning, skewed features can cause serious problems because a few extreme values dominate the learning process, especially in distance-based algorithms like K-Nearest Neighbors (a simple algorithm that classifies new data points by looking at the K closest data points in the training set and going with the majority label) and gradient-based algorithms like neural networks. This is why we frequently apply transformations—log transforms, power transforms, or simple normalization—to compress the tail and bring the data closer to a symmetric shape. For example, applying a log transform to the Lagos property prices would pull the ¤500 million mansions much closer to the ¤30 million bulk, making the feature more useful for a model. Understanding distribution shape is always the first step to effective feature engineering: before you can fix a problem, you have to see it, and distributions make the invisible visible.

Guided Exercises

Exercise 1: For each of the following scenarios, predict the shape of the distribution (symmetric, left-skewed, or right-skewed) and briefly explain your reasoning:

(a) Ages of university undergraduates in a single intake year.
(b) Monthly mobile data usage among smartphone users in Nigeria.
(c) Time taken to complete an online checkout form.
(d) Prediction errors from a well-calibrated regression model.

For each answer, consider what the “typical” case looks like and whether extreme values are more likely on one side or the other. Sketch a rough histogram for each on paper.

Exercise 2: You have a feature with the following ten values: 1, 2, 2, 3, 3, 3, 4, 4, 50, 200. Step 1: Draw a rough histogram on paper. Step 2: Calculate the mean (sum ÷ 10) and the median (middle value after sorting). Step 3: What would happen if you fed this feature directly into a distance-based algorithm such as K-Nearest Neighbors? Identify the outliers and suggest at least two transformations (for example, log transform, clipping, or winsorization (capping extreme values at a chosen threshold — for example, replacing anything above 50 with 50)) that could mitigate the problem. Which transformation would you choose, and why?

Exercise 3: Two features exist in your dataset: Feature A ranges from 0 to 1 (e.g., a normalized probability), and Feature B ranges from 0 to 100,000 (e.g., annual revenue in Naira). You are training a K-Means clustering model that calculates Euclidean distances (straight-line distance between two points — the kind you'd measure with a ruler on a map) between data points. What problem arises from the difference in scale? Describe the specific distortion that occurs—Feature B will dominate the distance calculation, making Feature A virtually invisible. Propose two solutions (for example, min-max normalization (rescales every value to fall between 0 and 1 by subtracting the minimum and dividing by the range), z-score standardization (converts each value to show how many standard deviations it sits above or below the mean)) and explain which you would prefer in practice and under what circumstances.

Discussion Prompt

Why do you think so many machine learning tutorials begin with “first, normalize your data”? Now that you understand distributions, can you explain what normalization actually does—at a mathematical level—and why it matters? Is normalization always necessary, or are there algorithms and scenarios where it is unnecessary or even harmful? Consider tree-based algorithms like Random Forest (an algorithm that builds many decision trees and averages their predictions): do they care about feature scale?

Key Takeaway

The shape of your data determines which algorithms will work, what preprocessing you need, and how to interpret your results. Always look at your distributions before modeling. The most common mistake in machine learning is not bad algorithms—it is feeding the algorithm data you never bothered to understand.

Describing Data

Summarizing Data

Concept Lesson

Guided Exercises

Quick Check

Key Terms — Tap to Flip

Seeing Distributions

Concept Lesson

Guided Exercises

Quick Check

Key Terms — Tap to Flip

Interactive Tools