Session 5
Integration Day — Putting It All Together
Reading a Dataset
Concept Lesson
Imagine you are a data analyst at a Nigerian bank that processes ₦4.2 billion in mobile money transactions every day. Your team lead drops a CSV file on your desk and says, "Take a look at this before we build anything." You open the file and see 10,000 rows with four columns: amount in Naira, timestamp, merchant_category, and is_fraud coded as 0 or 1. No Python scripts, no Jupyter notebooks — just your pen, a calculator, and the reasoning skills you have built over the last two weeks. This is how every real ML project should start: with a human reading the data before a machine ever touches it.
Your first instinct should be to count. How many rows are there? The file says 10,000 — that is your dataset size, and it matters because small datasets lead to unreliable patterns. Next, you split the data by the is_fraud column. Suppose you find 9,400 legitimate transactions and 600 fraudulent ones. That is a 600 / 9,400 = 6.38% fraud rate, or a ratio of roughly 1 fraud case for every 15.7 legitimate transactions. This class imbalance is critical. If you ignored it and built a model that always predicted "not fraud," the model would appear to be 94% accurate — an impressive number that completely fails at its job. Your team lead would see a high accuracy score, approve deployment, and the bank would lose money every single day because the model never catches the fraud it was designed to detect. Understanding class balance before modeling is not optional; it is the difference between a useful system and an expensive paperweight.
Now look at the amount column. Calculate the overall mean and median transaction amount. Suppose the mean is ₦18,500 and the median is ₦4,200. That gap is a red flag — it tells you the distribution is right-skewed, pulled upward by a few very large transactions. Now split by fraud status. Perhaps legitimate transactions have a mean of ₦12,000 and a median of ₦3,800, while fraudulent transactions have a mean of ₦95,000 and a median of ₦72,000. The fraud amounts are dramatically higher, but the median is the more honest measure here because it is not distorted by a single ₦2,000,000 transfer. If you set a threshold for suspicious transactions based on the mean, you would flag every transaction above ₦95,000 — but that misses the cluster of fraud cases in the ₦50,000 to ₦80,000 range. The median-based approach would catch more real fraud. Choosing mean versus median is not a technicality; it determines whether your alert system actually works or just looks good on a report.
Finally, examine time. Sort the timestamps into weekly buckets and count the fraud cases per week. Suppose Week 1 had 80 fraud cases, Week 2 had 95, Week 3 had 110, Week 4 had 130. That is a consistent upward trend — roughly 15–20 additional fraud cases each week. If you calculate the rate of change from Week 1 to Week 4, fraud increased by (130 − 80) ⁄ 80 = 62.5% over four weeks. This tells you the problem is getting worse, not better, and any model you build must be retrained regularly or it will drift (meaning the patterns in the real world change over time — for example, fraud tactics evolve — so a model trained on last month's data starts making wrong predictions on this month's data) out of date within a month. A model trained on Week 1 data alone would miss the evolving fraud patterns appearing in later weeks. All of this analysis — ratios, percentages, central tendency, spread, and rate of change — comes directly from the first two weeks of this course. You are now applying those tools to make a real business decision.
Guided Exercises
Discussion Prompt
Before this course, if someone handed you a CSV with 10,000 rows, you might have jumped straight into training a model. What would you do differently now? List at least five specific questions you would ask about the data before writing a single line of model code. How would the class imbalance you discovered today change your choice of model, your evaluation metric, and your communication with stakeholders?
Key Takeaway
Data analysis is reasoning, not just computation. The best ML practitioners spend more time understanding their data than tuning their models. A 94% accurate model that never catches fraud is worse than useless — it is dangerous because it creates false confidence.