Session 3
Summarizing Data
Mean, Median, Mode, and Spread
Concept Lesson
Imagine you are a data scientist at a Nairobi-based agritech company. Your team just trained five versions of a crop disease detection model and ran each on the same test set. The accuracy scores come back: 0.81, 0.83, 0.82, 0.80, 0.84. You calculate the mean by adding them up and dividing by five: (0.81 + 0.83 + 0.82 + 0.80 + 0.84) ÷ 5 = 4.10 ÷ 5 = 0.82, or 82%. Simple enough. But now suppose one of the five runs suffered a GPU memory error mid-training and produced a score of 0.45 instead of 0.84. The new mean is (0.81 + 0.83 + 0.82 + 0.80 + 0.45) ÷ 5 = 3.71 ÷ 5 = 0.742, or 74.2%. A single faulty run dragged the average down by nearly 8 percentage points. This is the mean’s critical weakness: it is sensitive to outliers. In machine learning you will constantly encounter datasets where a small number of extreme values—a sensor glitch, a data entry error, a rare edge case—distort the average. If you rely blindly on the mean to report model performance or summarize features, you will draw wrong conclusions that ripple into production decisions, from choosing the wrong model to setting incorrect thresholds.
The median solves the outlier problem by looking at the middle value. Sort the data from lowest to highest, and the median is whatever sits in the center. Using the buggy run from above, the sorted scores are 0.45, 0.80, 0.81, 0.82, 0.83. The middle value is 0.81—far closer to reality than the mean of 0.742. This is why many machine learning papers report median performance across runs rather than mean: it gives a truer picture of what a typical experiment produces. The mode—the most frequently occurring value—is useful for categorical data. Suppose your fraud detection model labels transactions as “fraud” or “not fraud,” and across 10,000 predictions the mode is “not fraud” (appearing 9,700 times). That tells you something fundamental about the class distribution. Or consider a confusion matrix where the most common error type is what ML calls a false negative — a real case that the model missed. (We will explore false negatives and their counterpart, false positives, in detail in Week 5.) The mode immediately tells you where to focus your debugging effort. Knowing when to reach for mean, median, or mode is a foundational skill: mean works well for symmetric, outlier-free data; median shines when outliers or skewed distributions are present; and mode is indispensable when you are counting categories or finding the most common prediction.
Spread measures how much the values in a dataset vary from one another, and without it a summary statistic like the mean tells you almost nothing. Suppose two crop disease models both achieve a mean F1 score of 0.82. Model A’s scores across 10 runs are tightly clustered: 0.80, 0.81, 0.81, 0.82, 0.82, 0.82, 0.83, 0.83, 0.84, 0.84. Model B’s scores are all over the place: 0.55, 0.65, 0.72, 0.78, 0.82, 0.86, 0.90, 0.95, 0.98, 0.99. Both have the same mean, but you would never trust Model B in production—it might score 0.99 one day and 0.55 the next. The range (max minus min) is the simplest spread measure: Model A’s range is 0.84 − 0.80 = 0.04, while Model B’s is 0.99 − 0.55 = 0.44. But range is crude because it ignores everything between the extremes. Variance and standard deviation go deeper. Variance is the average of the squared differences from the mean. Here is what that means step by step. Say your five values are 78, 80, 81, 82, 85. The mean is 81.2. The differences from the mean are −3.2, −1.2, −0.2, 0.8, 3.8. Square each: 10.24, 1.44, 0.04, 0.64, 14.44. Average those: 5.36. That is the variance. The standard deviation is the square root of 5.36, which is about 2.3 — meaning the typical run deviates about 2.3 percentage points from the mean. (The square root is the reverse of squaring: since 5 × 5 = 25, the square root of 25 is 5; it converts the squared units back to the original scale.) For Model A, most squared differences are tiny (around 0.0001), so the variance is small and the standard deviation (its square root) might be around 0.013. For Model B, the squared differences are much larger, yielding a standard deviation of perhaps 0.13—ten times higher. A model with accuracy 85% ± 2% is dramatically more reliable than one with 85% ± 15%. Always report a measure of spread alongside your central tendency; a number without context is just a number, and spread is the context that tells you whether to trust it.
Guided Exercises
Discussion Prompt
A research paper reports “our model achieves 94% mean accuracy across five folds.” What additional information would you need to properly evaluate this claim? Think about spread, sample size, class balance, and what “accuracy” actually measures. Could a 94% mean accuracy still hide a poorly performing model? Draft a list of five follow-up questions you would ask the authors.
Key Takeaway
A single number never tells the whole story. Always ask about spread, always consider outliers, and choose the right summary statistic for your data. The difference between a good data scientist and a great one is knowing which number to trust and when.