Healthcare Data Intelligence

Hospital Operations
Optimization Dashboard

An end-to-end data platform that transforms raw, messy patient visit records into a clean, actionable Power BI dashboard — built on Databricks and the Medallion Architecture.

5,150
Patient Records Processed
14
Pipeline Steps
20
KPI Metrics Tracked
100%
Data Quality Score

The Problem

Fragmented data, zero visibility

The hospital's patient visit data was stored in a single raw CSV — inconsistent, incomplete, and impossible to report on. Leadership had no reliable source of truth for operations or costs.

Mixed date formats

Two conflicting date formats (MM/DD/YYYY and YYYY-MM-DD) scattered across 4 date columns, causing filter and aggregation failures downstream.

Duplicate & null records

150 duplicate rows inflated counts. Critical fields like Patient ID, Treatment Cost, and Room Type held nulls that distorted KPI calculations.

Inconsistent categoricals

“PAID”, “paid”, “Paid” treated as 3 separate values. Emergency Visit encoded as “y”, “YES”, “yes”, “n”, “NO” — unusable for group-by analysis.

Invalid numeric data

Negative treatment costs, medication cost outliers (value: 999,999), and satisfaction scores outside the 1–10 valid range corrupted averages.

Solution Architecture

Medallion Architecture on Databricks

We designed a layered data pipeline following the Medallion pattern — Raw → Bronze → Silver → Gold — ensuring each layer adds quality and business context before surfacing to Power BI.

Source
Raw CSV
Dirty patient visit data loaded from DBFS
5,150 rows 20 columns
Ingestion · Bronze
Apache Spark
Schema inference, deduplication, column rename
PySpark Databricks
Transform · Silver
Delta Lake
Null handling, type casting, outlier repair, validation
ACID Versioned
Serve · Gold
Power BI
Live dashboard with operational KPIs for hospital leadership
DirectQuery DAX

Data Engineering

14-step cleaning pipeline

Every transformation was systematically applied and validated. The pipeline was written in PySpark and runs inside Databricks notebooks, producing a certified Delta table.

Ingest

Load raw CSV

Read dirty CSV from DBFS with schema inference and multiLine support.

Audit

Data profiling

Null counts, duplicate rows, summary statistics — baseline quality report established.

Clean

Deduplication

150 exact duplicate rows removed across all 20 columns.

Standardise

Column rename

All 20 columns renamed to snake_case for consistent downstream SQL and DAX usage.

Standardise

Date normalisation

Mixed MM/DD/YYYY and YYYY-MM-DD formats unified to ISO DateType across 4 columns.

Clean

Whitespace trim

Leading/trailing whitespace removed from all string columns to prevent silent JOIN mismatches.

Standardise

Title case casing

Service type, payment status, referral source normalised to consistent Title Case.

Standardise

Yes/No encoding

Emergency visit column variants (y, YES, n, NO…) mapped to clean Yes/No values.

Clean

Null handling

Column-specific strategies: drop on PK, median impute on costs, business defaults on categoricals.

Clean

Invalid values

Negative costs fixed with abs(). Outlier medication costs and out-of-range satisfaction scores corrected.

Schema

Type enforcement

All 20 columns cast to correct final types (IntegerType, DoubleType, DateType, StringType).

Validate

Post-clean QA

Full quality re-run confirms 0 nulls on PKs, 0 negatives, 0 out-of-range scores, 0 duplicates.

Persist

Write to Delta

Cleaned dataset written as a registered Delta Lake table with overwriteSchema support.

Verify

Sanity check

Final SQL query confirms the table is accessible and grouped aggregations return correct values.

Deliverable

Power BI operations dashboard

The cleaned Delta table powers a live Power BI dashboard giving hospital leadership real-time insight into patient flow, costs, satisfaction, and department performance.

Hospital Operations — Overview Live · Delta Lake
5,000
Total visits
$3,240
Avg treatment cost
7.8
Avg satisfaction
18%
Emergency rate
Visits by service type
Outpatient Inpatient Emergency Follow-up Preventive Other
Payment status
Paid 60%
● Paid ● Pending ● Outstanding

Results

What the client gained

  • Single source of truth All departments now share one certified, versioned Delta table — no more conflicting reports from different data exports.
  • Real-time cost visibility Leadership can monitor treatment and medication costs by department, service type, and room type without waiting for manual reports.
  • Patient satisfaction tracking Satisfaction scores, filtered by provider and diagnosis, are now a live KPI powering quality improvement initiatives.
  • Emergency pattern detection Emergency vs. standard visit ratios are tracked over time, enabling resource planning and bed management decisions.
Records cleaned 5,000 After removing 150 duplicates
Issues resolved 100% Post-clean QA: 0 critical issues
Pipeline steps 14 Automated, repeatable, versioned
Columns tracked 20 All cast to correct schema types

Technology

Stack used

Apache Spark · PySpark
Databricks
Delta Lake
Power BI
DAX
DBFS
Medallion Architecture
Spark SQL
Shopping Basket