Client Work

All
Projects

Every project below delivered a measurable business outcome — cost savings, hours recovered, or revenue enabled. This is what "Future-Proof" data looks like in practice.

Featured Engagements

< 5s Latency · 0 Data Loss

10M events/day Kafka → Spark Streaming Pipeline

Exactly-once · Late-event watermarks · Sub-5s end-to-end latency · AWS EMR

Production Kafka → Spark Structured Streaming pipeline at 10M+ events/day with exactly-once delivery to Delta Lake. Implemented watermark-based late-event handling (30-min tolerance), idempotent MERGE upserts, and a dead-letter queue with automatic replay. Reduced end-to-end data latency from 8 minutes to under 5 seconds.

Apache KafkaSpark Structured StreamingDelta LakeAWS EMRPySpark
STREAMING PIPELINE — EXACTLY-ONCEKafka10M+ events/daySpark SSlate-event · dedupDelta Lakeexactly-once writesLooker< 5s freshSpark Structured Streaming · late-event watermark · idempotent upsertsKey GuaranteesExactly-OnceLate Events OKAuto-DedupMinutes → < 5s latency · 10M+ events/day · 0 data loss
-60% Pipeline Runtime

Enterprise Lakehouse — Databricks Medallion Architecture

50+ siloed AWS Glue jobs → Bronze/Silver/Gold · Unity Catalog · 60% faster pipelines

Replaced a fragmented multi-warehouse topology (50+ isolated AWS Glue jobs, 8 engineering teams, no shared catalog) with a unified Delta Lake medallion architecture on Databricks. Unity Catalog for governance, automated schema contracts via dbt, Photon-powered Gold layer for BI and ML. Pipeline runtime dropped 60%, schema conflicts eliminated.

DatabricksDelta LakePySparkUnity CatalogdbtApache Spark
DELTA LAKE MEDALLION ARCHITECTUREKafka EventsS3 RawDB CDC50+ APIsBronzeDelta Lakeraw · CDC · ingestSilverDelta Lakeclean · typed · testedGoldDelta LakeKPIs · features · martsUnity Catalogdbt ModelsPhoton EngineDatabricks SQL-60% runtime50+ sources unified0 schema conflicts

More Client Work

BEFORE — Redshift + Oracle · T+1 · full scans · 100TBRedshiftOracleAFTER — DUAL-WRITE MIGRATION · ZERO DOWNTIMESnowflakemicro-partition clusteringauto-suspend · zero-copy cloneBigQuerycolumnar + partitioned42s → 11s p95 queriesdbt + Airflowincremental ELTdual-write validation70%Query Perf40%Cost Saved100TBMigrated100TB migrated · zero downtime · dual-write validated
70% Faster Queries

100TB Warehouse Migration — Redshift & Oracle → Snowflake + BigQuery

Dual-write validation · Zero downtime · p95 query time 42s → 11s

Led migration of 100+ TB from on-premise Oracle and legacy AWS Redshift to Snowflake and BigQuery using a dual-write validation strategy. Re-modelled physical layer (micro-partition clustering, column ordering, incremental ELT with dbt). Achieved 70% query performance improvement and 40% cost reduction with zero-downtime cutover.

SnowflakeBigQuerydbtAirflow
ML FEATURE STORE — DUAL-MODE SERVINGFeature StoreDatabricks · 1,000+ featurespoint-in-time correctOffline StoreDelta Lake · PySparkOnline StoreRedis · p99 < 8msTraining JobsMLflow TrackingInference API1,000+ features · 4 ML teams · p99 < 8ms online serving
p99 < 8ms · 4 Teams Served

ML Feature Store — 1,000+ Features, p99 < 8ms Online Serving

Dual-mode offline/online · Point-in-time correct · Zero training-serving skew

Centralised dual-mode feature platform on Databricks: Delta Lake offline store (point-in-time correct for training) and Redis online store (p99 < 8ms for inference) backed by identical feature definitions. Eliminated training-serving skew across 4 ML teams, cut feature engineering time from days to hours.

Databricks Feature StoreMLflowPySparkRedis
BEFORE — overnight batch · 48h decision latencyBatch scoring job · runs at 2am · 48 hours until decisionAFTER — REAL-TIME DECISIONING · < 2 MINUTESApplicants100K+/dayKafkaeventsPySparkfeature eng.ML Score95% accuracyLatency Reduction48h → < 2 minThroughput100K+ apps/day48h → < 2min · 100K+/day · 95%+ model accuracy maintained
48h → < 2min Decisioning

Real-Time Credit Decisioning — 48h Batch → < 2min Streaming

Kafka + PySpark micro-batch feature engineering · 100K+ apps/day · 95%+ accuracy

Replaced overnight batch credit scoring with a Kafka-driven real-time pipeline. PySpark micro-batch feature engineering computes 200+ credit risk signals in real time, integrated with a REST model serving layer. Reduced decisioning latency from 48 hours to under 2 minutes while maintaining 95%+ model accuracy at 100K+ applications/day.

PySparkApache KafkaPostgreSQLAWS
COST GOVERNANCE · 20+ DATABRICKS WORKSPACESAWS APIGCP APIDatabricksBigQueryTerraformAnomaly EngineIsolation Forest · Python · Terraformauto-remediate · storage tier · rightsizeRunaway Clusterauto-terminateStorage TieringS3 → GlacierSlack Alert+ chargeback-40% cloud spend90 days20+ workspaces automated
-40% Cloud Spend in 90 Days

Data Platform Cost Governance — −40% Spend in 90 Days

Isolation Forest anomaly detection · Terraform auto-remediation · 20+ Databricks workspaces

Automated cost governance framework using Isolation Forest ML to detect runaway Spark clusters, enforce S3 → Glacier storage tiering, and rightsize compute across 20+ Databricks workspaces. Terraform-driven auto-remediation halts anomalous jobs and routes Slack/PagerDuty alerts with per-team chargeback attribution.

PythonTerraformDatabricksAWS Cost Explorer