Back to Projects
1,000+ features centralized, 4 ML teams served, p99 < 8ms online latency, 0 training-serving skew, feature dev time: days → hours

ML Feature Store — 1,000+ Features, p99 < 8ms Online Serving

Centralized dual-mode feature platform on Databricks: Delta Lake offline store (point-in-time correct for training) and Redis online store (p99 < 8ms for inference). Eliminated training-serving skew across 4 ML teams, reduced feature engineering time from days to hours.

View on GitHub

Problem

Data scientists recreating similar features across models causing inconsistency. No feature versioning led to model reproducibility issues. Training-serving skew between batch and online features. No central feature discovery or reuse.

Solution

Built Databricks Feature Store with dual-mode serving: Delta Lake for offline training (point-in-time correctness) and Redis for online inference (< 8ms p99). MLflow for feature lineage and versioning. Automated materialization pipelines with PySpark.

Architecture

Feature Pipelines (PySpark) → Offline Store (Delta Lake + point-in-time) → Online Store (Redis cluster) → ML Training + Inference

Key Challenges

  • Ensuring point-in-time correctness for historical feature lookups in training
  • Sub-10ms p99 latency requirements for online serving with Redis clustering
  • Automated materialization ensuring offline/online feature parity
  • Feature freshness monitoring and automated backfill pipelines

Tech Stack

Databricks Feature StoreMLflowPySparkRedisDelta LakeApache Spark