ML Feature Store — 1,000+ Features, p99 < 8ms Online Serving
Centralized dual-mode feature platform on Databricks: Delta Lake offline store (point-in-time correct for training) and Redis online store (p99 < 8ms for inference). Eliminated training-serving skew across 4 ML teams, reduced feature engineering time from days to hours.
View on GitHubProblem
Data scientists recreating similar features across models causing inconsistency. No feature versioning led to model reproducibility issues. Training-serving skew between batch and online features. No central feature discovery or reuse.
Solution
Built Databricks Feature Store with dual-mode serving: Delta Lake for offline training (point-in-time correctness) and Redis for online inference (< 8ms p99). MLflow for feature lineage and versioning. Automated materialization pipelines with PySpark.
Architecture
Feature Pipelines (PySpark) → Offline Store (Delta Lake + point-in-time) → Online Store (Redis cluster) → ML Training + Inference
Key Challenges
- ▸Ensuring point-in-time correctness for historical feature lookups in training
- ▸Sub-10ms p99 latency requirements for online serving with Redis clustering
- ▸Automated materialization ensuring offline/online feature parity
- ▸Feature freshness monitoring and automated backfill pipelines