Designing a 10M Events per Day Kafka to Spark Streaming Pipeline
๐ Context
In one of my recent projects, I had to design a real-time data pipeline that could:
- Handle 10M+ events/day
- Deliver less than 5-second latency
- Guarantee exactly-once processing
- Maintain high reliability (99.9% uptime)
This post breaks down how I approached it.
โ ๏ธ Problem
The existing system was batch-based:
- Around 8-minute delay in data availability
- Duplicate records due to retries
- No handling of late-arriving events
- Manual recovery on failures
๐ Result: unreliable analytics and delayed decision-making
๐๏ธ Architecture
Producers -> Kafka -> Spark Structured Streaming -> Delta Lake -> BI