10M Events/Day Kafka → Spark Streaming Pipeline
Production Kafka → Spark Structured Streaming pipeline processing 10M+ events/day with exactly-once delivery to Delta Lake. Watermark-based late-event handling, idempotent MERGE upserts, and dead-letter queue with automatic replay. Reduced end-to-end latency from 8 minutes to under 5 seconds.
View on GitHubProblem
Batch processing caused 8-minute data delays, making real-time analytics impossible. No exactly-once guarantees led to duplicate records. Late-arriving events were dropped, causing data loss. System had no fault tolerance for malformed messages.
Solution
Built Kafka → Spark Structured Streaming pipeline with 30-minute watermark tolerance for late events. Implemented exactly-once semantics via idempotent MERGE operations into Delta Lake. Added dead-letter queue for poison messages with automated replay. Deployed on AWS EMR with auto-scaling.
Architecture
Kafka (RF=3, 12 partitions) → Spark Structured Streaming (watermarks + micro-batches) → Delta Lake (MERGE upserts) → Looker dashboards
Key Challenges
- ▸Handling late-arriving events with 30-minute watermark while maintaining exactly-once semantics
- ▸Idempotent MERGE operations to prevent duplicates during Spark job restarts
- ▸Managing state checkpoints on S3 for fault tolerance and recovery
- ▸Auto-scaling EMR clusters based on Kafka consumer lag without dropping events