← Back to projects
Predictive Modeling · Competition

I2DB Datathon:
Diabetes Risk Prediction

Will this patient's A1c become uncontrolled in the next year? A machine learning model built on 62,425 patients' EHR data to predict diabetes decompensation before it happens.

I2DB Symposium 2026 · WashU Datathon

0.852
AUC-ROC
62,425
Patients
97%
NPV (rule-out power)
20
Engineered features
THE CHALLENGE

Predict who loses control before it happens

10.6% of diabetes patients in this EHR dataset will have uncontrolled A1c next year. The challenge: using 12 months of prior clinical data (labs, medications, utilization, demographics), build a model that identifies these patients in advance so care teams can intervene. The data is messy, imbalanced (1:8 ratio), and riddled with missing values and potential leakage.

Why it's hard
Real EHR, real mess
76% missing BMI. 97% missing later A1c readings. A leakage column that perfectly predicts the outcome. 67% missing insurance. This is what real clinical data looks like.
Why it matters
Proactive, not reactive
If you can identify high-risk patients 12 months early, you can escalate medication, schedule closer follow-up, and connect them with diabetes educators before A1c spirals.
DATA DEEP DIVE

What the data told me before I built anything

I spent significant time understanding the data before touching a model. Five exploration steps revealed critical patterns that shaped every downstream decision.

A1c
Current A1c is the dominant signal. Risk rises from 1% (A1c <5.7) to 32% (A1c >9). But 65% of future-uncontrolled patients have A1c <9 now. A1c alone isn't enough.
Meds
Patients on 4+ drug classes have 20-37% uncontrolled rate vs 8.4% for zero meds. But this is confounding: sicker patients get more meds AND are harder to control.
Missing
Missingness is informative. Uncontrolled patients have more A1c tests (less missing). One column was pure leakage: 100% of missing values = controlled. Dropped it immediately.
Risk by A1c level
Uncontrolled rate by current A1c level. Risk plateaus around A1c 9-10.
FEATURE ENGINEERING

20 features built from clinical reasoning

Raw A1c values carry most of the signal, but the engineered features capture clinical nuance that raw columns miss. Three of these became top predictors.

Clinical insight
treatment_resistant
High A1c despite 2+ medication classes. Captures patients who are pharmacologically managed but still failing. Correlation with target: +0.20, one of the strongest non-A1c features.
Clinical insight
a1c_variability
Standard deviation across serial A1c readings. Unstable glycemic control predicts future decompensation even when the mean looks acceptable. Correlation: +0.21.
The confounding insight that changed the model
In Step 3, I noticed more medications correlated with worse outcomes. Naive interpretation: meds are harmful. Clinical interpretation: sicker patients get more meds. This led to engineering "treatment_resistant" and "undertreated" features that disentangle disease severity from treatment intensity. That's the difference between a data scientist and a clinician building a model.
MODEL BUILDING

From logistic regression to tuned XGBoost

I built progressively: baseline first, then advanced models, then hyperparameter tuning. Each step justified by clear improvement.

Logistic Reg
0.828
Random Forest
0.847
XGBoost (default)
0.827
XGBoost (tuned)
0.852

Tuning: RandomizedSearchCV with 80 parameter combinations across 5-fold CV (400 fits). Key insight: smaller learning rate (0.01 vs 0.1) + more trees (800) + regularization (gamma=0.3, subsample=0.7) produced the best generalization.

ROC curve
ROC curve for final tuned XGBoost model (AUC = 0.852)
FINAL RESULTS

97% NPV: the model's superpower is ruling out risk

The model catches 81% of patients who will become uncontrolled (sensitivity). But its strongest metric is NPV: if the model says a patient is low risk, there's a 97% chance they truly stay controlled. In a screening context, that's the metric that matters most.

MetricValueWhat it means
AUC-ROC0.852Strong discrimination between controlled and uncontrolled
Sensitivity80.6%Catches 4 out of 5 patients who will lose control
NPV97.0%If cleared by model, 97% truly stay controlled
Specificity74.2%1 in 4 controlled patients gets flagged (acceptable for screening)
PPV27.0%1 in 4 flagged patients is truly uncontrolled
Confusion matrix
Confusion matrix at threshold 0.5
Performance metrics
Comprehensive performance metrics
EXPLAINABILITY

What's driving the predictions?

SHAP values reveal that mean A1c is by far the strongest driver, followed by max A1c and the most recent reading. But clinically meaningful engineered features like treatment resistance and A1c variability earn their place in the top 10.

SHAP values
SHAP feature importance: what drives the model's predictions
Clinically interpretable, not a black box
Every top feature maps to clinical reasoning a physician would recognize: higher A1c = higher risk, worsening trajectory = higher risk, more meds without improvement = treatment resistance. The model learned what clinicians already know, then quantified it at scale.
EQUITY

No fairness red flags across subgroups

The model performs consistently (AUC 0.83 to 0.88) across all demographic subgroups. It actually catches more uncontrolled patients in minority groups (Black: 85% recall, Asian: 87%) than White patients (79%). Medicaid patients have the highest recall at 91-93%.

Fairness analysis
Model performance across race, gender, age, and insurance subgroups
CLINICAL APPLICATION

Adjustable threshold for different use cases

The model's operating point can be tuned depending on the clinical context. Lower thresholds catch more patients (better for population-level screening), higher thresholds reduce false alarms (better for resource-intensive interventions).

ThresholdSensitivityPatients flaggedUse case
0.1596.4%7,697Population screening, automated nudges
0.3090.6%5,552Proactive outreach, care coordination
0.5080.6%3,953Clinical review, medication escalation
WHAT I LEARNED

A clinician building ML from scratch

I came into this with a clinical and public health background, not CS. Every step was learned and earned. Three lessons stand out.

01
Domain knowledge is the real feature engineering. The confounding insight about medications and treatment resistance came from clinical training, not a Kaggle tutorial. That single insight produced a top-10 feature.
02
Explore before you model. Five exploration steps before writing a single line of modeling code. The leakage detection alone saved the entire project from producing a meaningless submission.
03
The right metric depends on the use case. Accuracy (89.4%) is useless when you're trying to catch the 10.6% who fail. NPV (97%) is what makes this model deployable as a screening tool.
METHODS

16-step reproducible pipeline

From raw CSVs to tuned submission in 16 documented steps. Every decision recorded in a living README. Data exploration (5 steps), cleaning, feature engineering (20 features from 41 raw columns), stratified 80/20 split, progressive modeling (logistic regression, random forest, XGBoost), RandomizedSearchCV tuning (400 fits), SHAP explainability, subgroup fairness analysis, threshold optimization, and presentation-ready figures.

Python XGBoost scikit-learn SHAP pandas Feature Engineering EHR Data Class Imbalance Hyperparameter Tuning Fairness Analysis
Tech Stack

Built with

XGBoost over 20 engineered features from EHR data. SHAP for feature attribution, fairness metrics stratified by race, sex, age, and insurance. Full model card shipped alongside.

Python XGBoost scikit‑learn SHAP pandas matplotlib Jupyter

Interested in predictive modeling, clinical ML, or health data science?

Get in touch