Sub-5s end-to-end latency (from 8min), 10M+ events/day throughput, 0 data loss with exactly-once delivery, 99.9% uptime

10M Events/Day Kafka → Spark Streaming Pipeline

Production Kafka → Spark Structured Streaming pipeline processing 10M+ events/day with exactly-once delivery to Delta Lake. Watermark-based late-event handling, idempotent MERGE upserts, and dead-letter queue with automatic replay. Reduced end-to-end latency from 8 minutes to under 5 seconds.

View on GitHub

Problem

Batch processing caused 8-minute data delays, making real-time analytics impossible. No exactly-once guarantees led to duplicate records. Late-arriving events were dropped, causing data loss. System had no fault tolerance for malformed messages.

Solution

Built Kafka → Spark Structured Streaming pipeline with 30-minute watermark tolerance for late events. Implemented exactly-once semantics via idempotent MERGE operations into Delta Lake. Added dead-letter queue for poison messages with automated replay. Deployed on AWS EMR with auto-scaling.

Architecture

Kafka (RF=3, 12 partitions) → Spark Structured Streaming (watermarks + micro-batches) → Delta Lake (MERGE upserts) → Looker dashboards

Key Challenges

▸Handling late-arriving events with 30-minute watermark while maintaining exactly-once semantics
▸Idempotent MERGE operations to prevent duplicates during Spark job restarts
▸Managing state checkpoints on S3 for fault tolerance and recovery
▸Auto-scaling EMR clusters based on Kafka consumer lag without dropping events

Tech Stack

Apache KafkaSpark Structured StreamingDelta LakePySparkAWS EMRPython