Skip to main content
Back to Syllabus

Week 8 of 8

Capstone

Bring everything together. Students demonstrate mastery of quantitative reasoning, probabilistic thinking, technical reading, and clear communication in a single integrated exercise.

Before this week
Technical communication
Technical writing

Session 15

Capstone — Read, Reason, Communicate

Critical Paper Review

90 Minutes
Objective Apply all eight weeks of learning: read a real ML paper, reason about its claims quantitatively and logically, and communicate your assessment clearly.

Concept Lesson

It is the final session. You receive a 4-page machine learning paper titled "Gradient-Boosted Credit Scoring for Sub-Saharan African Markets" — a study that trained an XGBoost model (XGBoost is a popular machine learning algorithm that builds a series of decision trees, where each new tree tries to correct the mistakes of the ones before it — think of it as a team of advisors where each new advisor focuses on the cases the previous advisors got wrong) on 45,000 loan applications from three microfinance institutions in Nigeria, Kenya, and Ghana, claiming an AUC of 0.91 (AUC stands for Area Under the Curve — specifically, the ROC curve, which plots the true positive rate against the false positive rate at every possible threshold. What you need to know: AUC ranges from 0.5 (no better than a coin flip) to 1.0 (perfect). An AUC of 0.91 means the model is quite good at distinguishing between the two classes — in 91% of randomly chosen pairs (one positive, one negative), the model gives the positive case a higher score. But AUC can be misleading with imbalanced data, so always check it alongside precision and recall) and a 23% reduction in default rates compared to the institutions' existing rule-based systems. On the surface, the numbers look impressive. But you have spent seven weeks learning that impressive numbers are exactly when you should slow down and ask harder questions. Your task: read the paper carefully, identify what the authors actually proved versus what they merely claimed, and present your honest assessment to a partner in three minutes flat. This is not an academic exercise. This is what happens every time a vendor pitches an AI product to your organization, every time a colleague shares a paper in Slack, every time a startup claims their model outperforms the state of the art.

The capstone exercise integrates every skill from the bootcamp into a single, realistic workflow. Start with quantitative reasoning: check the numbers. The paper claims AUC 0.91, but on what dataset? If they evaluated on the same distribution they trained on, the number is inflated. What was their train-test split? (Before training a model, you split your data into two parts: a training set (usually 80%) that the model learns from, and a test set (usually 20%) that you hold back and use only to evaluate performance. This is like studying 80% of the material and saving 20% for the final exam. If you train and test on the same data, the model can memorize the answers and look great while actually being useless.) Did they use cross-validation (cross-validation is a more thorough version of the train-test split. Instead of one split, you split the data into, say, 5 equal parts. You train on 4 parts and test on the 1 remaining part, then rotate which part is the test set. You end up with 5 test results, which you average. This gives a more reliable estimate of performance because every piece of data gets to be in the test set once) or a single holdout? If it was a single 80/20 split, how confident can we be that the 0.91 is not a lucky draw on one particular fold? Then move to probabilistic thinking: the paper claims a 23% reduction in default rates. But from what baseline? If the rule-based system had a 40% default rate, a 23% reduction means 30.8% — an improvement, certainly, but not the revolution the headline suggests. And what is the confidence interval on that 23%? (A confidence interval is a range of values that likely contains the true answer. If a paper reports '23% improvement (95% CI: 10%–35%),' it means: we are 95% confident the true improvement is somewhere between 10% and 35%. The range exists because results from a sample are always somewhat uncertain — if you repeated the experiment with different data, you would get a slightly different number. Wider intervals mean more uncertainty, which usually means the sample was small or the data was noisy.) If the test set had only 2,000 loans, the true reduction could easily be anywhere from 10% to 35%. A 23% reduction that might actually be 10% is a very different business case. Finally, apply your communication skills: can you explain this in 250 words to someone who has never read an ML paper? Can you do it without hedging so much that your review says nothing, and without being so aggressive that you dismiss legitimate work?

The structure of your critical review should mirror the analytical process. First, extract the paper's main claim: what are the authors saying they achieved? State it in one sentence, in your own words, without copying their abstract. Second, examine the evidence: what data did they use, how much, from where, and how did they split it? Third, scrutinize the metrics: are they reporting the right ones for the problem? For credit scoring, accuracy is misleading if the dataset is imbalanced — did they report precision, recall, or just AUC? Fourth, identify weaknesses: is the dataset representative of the market they claim to serve? Did they test on data from a different time period, or only on a random split from the same period? Did they compare against a strong enough baseline (a baseline is the simplest possible model or existing system you compare your new model against. For fraud detection, a baseline might be a set of hand-written rules. For classification, it might be a model that always predicts the most common class. If your fancy model cannot beat the baseline, it is not worth deploying), or just against manual rule-based systems that were never designed to be competitive? Each of these questions draws on a different week of the bootcamp: Weeks 1-2 for the quantitative checks, Weeks 3-4 for the probabilistic reasoning about uncertainty and error, Weeks 5-6 for reading the numbers critically, and Week 7 for communicating your assessment clearly.

The presentation format — 3 minutes per person, followed by on-the-spot questions — is deliberately designed to simulate real pressure. In practice, you will almost never have unlimited time to present your analysis. A product manager interrupts you in a standup. An executive asks you to "just give me the bottom line" during a hallway conversation. A client's CTO challenges your model's accuracy during a demo. The ability to distill a 4-page paper into a clear, evidence-backed, 3-minute assessment — and then defend it when questioned — is one of the most valuable professional skills you can develop. Do not aim for perfection in your review. Aim for clarity, honesty, and evidence. If the paper's methodology is sound but its claims are overreaching, say so. If the results are genuinely impressive but the dataset is too small to generalize, say that. A nuanced, honest assessment is always more credible than a blanket endorsement or a blanket dismissal.

Guided Exercises

  1. Exercise 1 — Individual (15 min): Read the assigned paper carefully from start to finish — do not skim. As you read, extract and write brief notes on four things: (a) the main claim the authors make — write it in one sentence in your own words, (b) the evidence they present to support it — what data, how much, from where, (c) the evaluation metrics they report — list each metric and its value, and (d) at least two potential weaknesses you can identify in their methodology, data, or conclusions. Use the critical reading strategies from Week 5: check for missing baselines, unreported confidence intervals, and train-test leakage (when information from the test data accidentally leaks into the training process — for example, by normalizing the data using statistics from both sets, or by including duplicate records in both train and test. This makes the model look better than it really is because it has already seen hints about the test data). This extraction is the foundation for everything that follows — take your time and be thorough.
  2. Exercise 2 — Individual (15 min): Write a 250-word critical review. Structure it in three paragraphs: the first paragraph summarizes the paper's main claim and methodology in your own words (no jargon, no copying from the abstract); the second paragraph evaluates its strengths — what did the authors do well, and what do the numbers actually support; the third paragraph identifies limitations or unanswered questions — what should the reader be skeptical about, and what would you want to see before trusting the results. Ground every statement in a specific number, method, or quote from the paper. Avoid vague phrases like "the methodology seems reasonable" — instead, say exactly what is reasonable and what is not.
  3. Exercise 3 — Pairs, then Group (30 min): Find a partner. Present your critical review in exactly 3 minutes — set a timer. Your partner's job is to listen actively and then ask one challenging question you must answer on the spot. Examples: "You said the dataset is too small — how small is too small, and how did you decide that?" or "You praised the metric choice, but did you check whether the metric aligns with the business objective?" Then swap roles. After both presentations, the full group reconvenes to discuss: did the paper's claims hold up to collective scrutiny? Where did different reviewers agree, and where did they disagree? What does that disagreement tell you about the subjectivity inherent in evaluating ML research?

Discussion Prompt

You have spent eight weeks building skills in quantitative reasoning, probabilistic thinking, technical reading, and clear communication. Reflect honestly: how has your relationship with numbers and technical writing changed since Week 1? What is one specific habit or mental framework you will carry forward into your ML work — whether it is questioning metrics more critically, reading papers more carefully, writing experiment reports more precisely, or explaining results more clearly to non-technical stakeholders?

Key Takeaway

Quantitative and verbal reasoning are not separate from ML work — they ARE ML work. Building a model is one part of the job. Understanding whether a model's claims are justified, evaluating its real-world performance, and communicating your findings to people who need to make decisions based on them — that is the rest. These are the skills that make you not just a technician, but a practitioner who can be trusted with consequential decisions.

Quick Check

When reading an ML paper that claims "23% reduction in default rates," the first thing to check is:

  • The baseline — 23% reduction from what starting point?
  • The number of authors on the paper
  • Whether the paper uses deep learning

When evaluating a paper's methodology, you should check:

  • Only whether the results look impressive
  • Data split, evaluation metrics, confidence intervals, and whether baselines are strong enough
  • The writing style and grammar

A critical review should be:

  • Entirely negative — the goal is to tear the paper apart
  • Entirely positive — the authors worked hard on it
  • Balanced — identify strengths and limitations with specific evidence from the paper

Key Terms — Tap to Flip

1 / 3

What is a baseline?

The simplest model or existing system you compare against. If your fancy model can't beat the baseline, it's not worth deploying. Always check: is the baseline strong enough?

What is a confidence interval?

A range of values that likely contains the true answer. "23% improvement (95% CI: 10%–35%)" means the true improvement is probably between 10% and 35%.

What makes a good critical review?

Extracting the main claim, examining the evidence, scrutinizing metrics, identifying weaknesses, and communicating your assessment clearly and honestly — with specific numbers, not vague opinions.