Portfolio
Projects
Production-grade data engineering systems built to handle enterprise-scale workloads. Each project solves real business problems with measurable impact.
10M Events/Day Kafka → Spark Streaming Pipeline
Production Kafka → Spark Structured Streaming pipeline processing 10M+ events/day with exactly-once delivery to Delta Lake. Watermark-based late-event handling, idempotent MERGE upserts, and dead-letter queue with automatic replay. Reduced end-to-end latency from 8 minutes to under 5 seconds.
Enterprise Lakehouse — Databricks Medallion Architecture
Unified 50+ isolated AWS Glue jobs into a Databricks Delta Lake medallion architecture (Bronze/Silver/Gold). Unity Catalog for governance, dbt for schema contracts, Photon-powered Gold layer. Achieved 60% pipeline runtime reduction and eliminated schema conflicts across 8 engineering teams.
100TB Warehouse Migration — Redshift & Oracle → Snowflake + BigQuery
Led migration of 100+ TB from on-premise Oracle and legacy AWS Redshift to Snowflake and BigQuery using dual-write validation strategy. Re-modeled physical layer with micro-partition clustering and incremental ELT using dbt. Achieved 70% query performance improvement (p95: 42s → 11s) and 40% cost reduction with zero-downtime cutover.
ML Feature Store — 1,000+ Features, p99 < 8ms Online Serving
Centralized dual-mode feature platform on Databricks: Delta Lake offline store (point-in-time correct for training) and Redis online store (p99 < 8ms for inference). Eliminated training-serving skew across 4 ML teams, reduced feature engineering time from days to hours.
Real-Time Credit Decisioning — 48h Batch → < 2min Streaming
Replaced overnight batch credit scoring with Kafka-driven real-time pipeline. PySpark micro-batch feature engineering computes 200+ credit risk signals in real time, integrated with REST model serving layer. Reduced decisioning latency from 48 hours to under 2 minutes while maintaining 95%+ model accuracy at 100K+ applications/day.
Cost Engineering Framework — 40% Platform Spend Reduction
Automated framework for Spark cluster rightsizing, S3 → Glacier storage tiering, and cross-workspace cost anomaly detection using Isolation Forest ML. Built centralized cost analytics aggregating AWS Cost Explorer, Databricks, and Snowflake usage. Achieved 40% platform spend reduction in 90 days.