
Transform complex data into actionable insights
An end-to-end data platform that transforms raw, messy patient visit records into a clean, actionable Power BI dashboard — built on Databricks and the Medallion Architecture.
The Problem
The hospital's patient visit data was stored in a single raw CSV — inconsistent, incomplete, and impossible to report on. Leadership had no reliable source of truth for operations or costs.
Two conflicting date formats (MM/DD/YYYY and YYYY-MM-DD) scattered across 4 date columns, causing filter and aggregation failures downstream.
150 duplicate rows inflated counts. Critical fields like Patient ID, Treatment Cost, and Room Type held nulls that distorted KPI calculations.
“PAID”, “paid”, “Paid” treated as 3 separate values. Emergency Visit encoded as “y”, “YES”, “yes”, “n”, “NO” — unusable for group-by analysis.
Negative treatment costs, medication cost outliers (value: 999,999), and satisfaction scores outside the 1–10 valid range corrupted averages.
Solution Architecture
We designed a layered data pipeline following the Medallion pattern — Raw → Bronze → Silver → Gold — ensuring each layer adds quality and business context before surfacing to Power BI.
Data Engineering
Every transformation was systematically applied and validated. The pipeline was written in PySpark and runs inside Databricks notebooks, producing a certified Delta table.
Read dirty CSV from DBFS with schema inference and multiLine support.
Null counts, duplicate rows, summary statistics — baseline quality report established.
150 exact duplicate rows removed across all 20 columns.
All 20 columns renamed to snake_case for consistent downstream SQL and DAX usage.
Mixed MM/DD/YYYY and YYYY-MM-DD formats unified to ISO DateType across 4 columns.
Leading/trailing whitespace removed from all string columns to prevent silent JOIN mismatches.
Service type, payment status, referral source normalised to consistent Title Case.
Emergency visit column variants (y, YES, n, NO…) mapped to clean Yes/No values.
Column-specific strategies: drop on PK, median impute on costs, business defaults on categoricals.
Negative costs fixed with abs(). Outlier medication costs and out-of-range satisfaction scores corrected.
All 20 columns cast to correct final types (IntegerType, DoubleType, DateType, StringType).
Full quality re-run confirms 0 nulls on PKs, 0 negatives, 0 out-of-range scores, 0 duplicates.
Cleaned dataset written as a registered Delta Lake table with overwriteSchema support.
Final SQL query confirms the table is accessible and grouped aggregations return correct values.
Deliverable
The cleaned Delta table powers a live Power BI dashboard giving hospital leadership real-time insight into patient flow, costs, satisfaction, and department performance.
Results
Technology