The business question
Forget the dataset for a second. The question a real credit risk team actually asks is something closer to: are we declining mortgage applications in a way that's explainable by income and debt, or in a way that's explainable by geography we'd rather not be explaining at all? That second version of the question is the one that gets banks fined, and it's the one this project is actually about.
Why this matters
The Home Mortgage Disclosure Act (HMDA) requires most lenders to report loan-level data on every mortgage application they receive — approved, denied, or withdrawn — along with the applicant's income, loan amount, and census tract. It's one of the few places you can look directly at lending behavior without a bank's permission. Regulators use it to spot patterns that look like discriminatory lending. I wanted to see what that actually looks like from the inside of a SQL query, not just from a headline.
The approach
I pulled a year of HMDA loan application records for a handful of metro areas and started with the question regulators actually ask first: does the approval rate differ by census tract income level, holding loan amount and applicant income roughly constant? If it doesn't, that's a clean story. If it does, the next question is whether anything in the data explains the gap — or whether it's the kind of gap that shows up in an exam letter.
I didn't reach for a model first. The instinct to jump straight to a classifier here is exactly the instinct that buries the actual finding — a simple grouped approval-rate table, sorted and filtered correctly, tells you 80% of the story before any modeling starts.
The build
SQL — finding the pattern
The first real query is a straightforward aggregation, but the window function underneath it is doing the actual work — ranking tracts within each metro by approval rate so the comparison is relative to local norms, not a national average that hides regional cost-of-living differences.
SELECT
metro_area,
census_tract,
tract_income_level,
COUNT(*) AS total_applications,
SUM(CASE WHEN action_taken = 'approved' THEN 1 ELSE 0 END) AS approvals,
ROUND(100.0 * SUM(CASE WHEN action_taken = 'approved' THEN 1 ELSE 0 END)
/ COUNT(*), 1) AS approval_rate,
RANK() OVER (
PARTITION BY metro_area
ORDER BY 100.0 * SUM(CASE WHEN action_taken = 'approved' THEN 1 ELSE 0 END)
/ COUNT(*) ASC
) AS lowest_approval_rank
FROM hmda_applications
GROUP BY metro_area, census_tract, tract_income_level
HAVING COUNT(*) >= 30;
The HAVING COUNT(*) >= 30 matters more than it looks — without it,
tiny tracts with three applications produce approval rates of 0% or 100% that
have nothing to do with lending behavior and everything to do with small numbers.
Python — the cleanup nobody shows you
Here's the part most write-ups skip: HMDA's public data is genuinely messy. Income fields are reported in different units across years, some loan purposes are coded inconsistently, and a meaningful share of records have a denial reason field that's blank even when the application was denied. None of that is a modeling problem — it's a "read the field documentation twice and write defensive code" problem.
import pandas as pd
df = pd.read_csv("hmda_2025_metro_sample.csv")
# Income is reported in thousands in some extract years — normalize to dollars
df["applicant_income"] = df["applicant_income"].apply(
lambda x: x * 1000 if x < 1000 else x
)
# Denial reason is blank for ~6% of denied records — flag rather than drop,
# since dropping silently would understate the denial-reason-missing pattern itself
df["denial_reason_missing"] = (
(df["action_taken"] == "denied") & (df["denial_reason"].isna())
)
risk_flags = df.groupby("census_tract").agg(
denial_rate=("action_taken", lambda x: (x == "denied").mean()),
missing_reason_rate=("denial_reason_missing", "mean"),
median_income=("applicant_income", "median"),
).reset_index()
That denial_reason_missing flag turned out to be more interesting
than I expected — tracts with high missing-reason rates clustered geographically,
which is its own small finding before any approval-rate analysis even starts.
Findings
Across the metros I looked at, the lowest-approval-rate tracts weren't randomly distributed — they clustered in a way that correlated more strongly with median tract income than with applicant-level income or loan amount. That's the pattern that's worth a second look, not because it proves anything on its own, but because it's exactly the shape regulators are trained to flag.
A clean grouped query with the right window function surfaced a pattern worth investigating before a single model was trained. The lesson isn't "don't use machine learning" — it's that the cheapest, most explainable tool should get first crack at the question, especially when the answer needs to survive a regulator asking "explain this number to me."
What I'd do next
The natural next step is controlling for more variables simultaneously — a logistic regression with tract income, applicant income, loan-to-value ratio, and loan purpose as predictors, to see whether the geographic pattern survives once those are held constant. I'd also want at least two more years of data before drawing any real conclusion — one year of HMDA data is a snapshot, not a trend.