I2DB Datathon - Diabetes Risk Prediction

THE CHALLENGE

Predict who loses control before it happens

10.6% of diabetes patients in this EHR dataset will have uncontrolled A1c next year. The challenge: using 12 months of prior clinical data (labs, medications, utilization, demographics), build a model that identifies these patients in advance so care teams can intervene. The data is messy, imbalanced (1:8 ratio), and riddled with missing values and potential leakage.

Why it's hard

Real EHR, real mess

76% missing BMI. 97% missing later A1c readings. A leakage column that perfectly predicts the outcome. 67% missing insurance. This is what real clinical data looks like.

Why it matters

Proactive, not reactive

If you can identify high-risk patients 12 months early, you can escalate medication, schedule closer follow-up, and connect them with diabetes educators before A1c spirals.

DATA DEEP DIVE

What the data told me before I built anything

I spent significant time understanding the data before touching a model. Five exploration steps revealed critical patterns that shaped every downstream decision.

A1c

Current A1c is the dominant signal. Risk rises from 1% (A1c <5.7) to 32% (A1c >9). But 65% of future-uncontrolled patients have A1c <9 now. A1c alone isn't enough.

Meds

Patients on 4+ drug classes have 20-37% uncontrolled rate vs 8.4% for zero meds. But this is confounding: sicker patients get more meds AND are harder to control.

Missing

Missingness is informative. Uncontrolled patients have more A1c tests (less missing). One column was pure leakage: 100% of missing values = controlled. Dropped it immediately.

Uncontrolled rate by current A1c level. Risk plateaus around A1c 9-10.

FEATURE ENGINEERING

20 features built from clinical reasoning

Raw A1c values carry most of the signal, but the engineered features capture clinical nuance that raw columns miss. Three of these became top predictors.

Clinical insight

treatment_resistant

High A1c despite 2+ medication classes. Captures patients who are pharmacologically managed but still failing. Correlation with target: +0.20, one of the strongest non-A1c features.

Clinical insight

a1c_variability

Standard deviation across serial A1c readings. Unstable glycemic control predicts future decompensation even when the mean looks acceptable. Correlation: +0.21.

The confounding insight that changed the model

In Step 3, I noticed more medications correlated with worse outcomes. Naive interpretation: meds are harmful. Clinical interpretation: sicker patients get more meds. This led to engineering "treatment_resistant" and "undertreated" features that disentangle disease severity from treatment intensity. That's the difference between a data scientist and a clinician building a model.

MODEL BUILDING

From logistic regression to tuned XGBoost

I built progressively: baseline first, then advanced models, then hyperparameter tuning. Each step justified by clear improvement.

Logistic Reg

0.828

Random Forest

0.847

XGBoost (default)

0.827

XGBoost (tuned)

0.852

Tuning: RandomizedSearchCV with 80 parameter combinations across 5-fold CV (400 fits). Key insight: smaller learning rate (0.01 vs 0.1) + more trees (800) + regularization (gamma=0.3, subsample=0.7) produced the best generalization.

ROC curve for final tuned XGBoost model (AUC = 0.852)

FINAL RESULTS

97% NPV: the model's superpower is ruling out risk

The model catches 81% of patients who will become uncontrolled (sensitivity). But its strongest metric is NPV: if the model says a patient is low risk, there's a 97% chance they truly stay controlled. In a screening context, that's the metric that matters most.

Metric	Value	What it means
AUC-ROC	0.852	Strong discrimination between controlled and uncontrolled
Sensitivity	80.6%	Catches 4 out of 5 patients who will lose control
NPV	97.0%	If cleared by model, 97% truly stay controlled
Specificity	74.2%	1 in 4 controlled patients gets flagged (acceptable for screening)
PPV	27.0%	1 in 4 flagged patients is truly uncontrolled

Confusion matrix at threshold 0.5

Comprehensive performance metrics

EXPLAINABILITY

What's driving the predictions?

SHAP values reveal that mean A1c is by far the strongest driver, followed by max A1c and the most recent reading. But clinically meaningful engineered features like treatment resistance and A1c variability earn their place in the top 10.

SHAP feature importance: what drives the model's predictions

Clinically interpretable, not a black box

Every top feature maps to clinical reasoning a physician would recognize: higher A1c = higher risk, worsening trajectory = higher risk, more meds without improvement = treatment resistance. The model learned what clinicians already know, then quantified it at scale.

EQUITY

No fairness red flags across subgroups

The model performs consistently (AUC 0.83 to 0.88) across all demographic subgroups. It actually catches more uncontrolled patients in minority groups (Black: 85% recall, Asian: 87%) than White patients (79%). Medicaid patients have the highest recall at 91-93%.

Model performance across race, gender, age, and insurance subgroups

CLINICAL APPLICATION

Adjustable threshold for different use cases

The model's operating point can be tuned depending on the clinical context. Lower thresholds catch more patients (better for population-level screening), higher thresholds reduce false alarms (better for resource-intensive interventions).

Threshold	Sensitivity	Patients flagged	Use case
0.15	96.4%	7,697	Population screening, automated nudges
0.30	90.6%	5,552	Proactive outreach, care coordination
0.50	80.6%	3,953	Clinical review, medication escalation

WHAT I LEARNED

A clinician building ML from scratch

I came into this with a clinical and public health background, not CS. Every step was learned and earned. Three lessons stand out.

01

Domain knowledge is the real feature engineering. The confounding insight about medications and treatment resistance came from clinical training, not a Kaggle tutorial. That single insight produced a top-10 feature.

02

Explore before you model. Five exploration steps before writing a single line of modeling code. The leakage detection alone saved the entire project from producing a meaningless submission.

03

The right metric depends on the use case. Accuracy (89.4%) is useless when you're trying to catch the 10.6% who fail. NPV (97%) is what makes this model deployable as a screening tool.

METHODS

16-step reproducible pipeline

From raw CSVs to tuned submission in 16 documented steps. Every decision recorded in a living README. Data exploration (5 steps), cleaning, feature engineering (20 features from 41 raw columns), stratified 80/20 split, progressive modeling (logistic regression, random forest, XGBoost), RandomizedSearchCV tuning (400 fits), SHAP explainability, subgroup fairness analysis, threshold optimization, and presentation-ready figures.

Python XGBoost scikit-learn SHAP pandas Feature Engineering EHR Data Class Imbalance Hyperparameter Tuning Fairness Analysis

Tech Stack

Built with

XGBoost over 20 engineered features from EHR data. SHAP for feature attribution, fairness metrics stratified by race, sex, age, and insurance. Full model card shipped alongside.

Python XGBoost scikit‑learn SHAP pandas matplotlib Jupyter

I2DB Datathon:Diabetes Risk Prediction