Data Engineering System Design for Real-Time Applications
Introduction
Real-time data systems power some of the most critical applications today: fraud detection, recommendation engines, monitoring dashboards, and trading platforms. Designing these systems requires careful consideration of throughput, latency, reliability, and cost.
After building real-time systems processing billions of events daily, I'll share architectural patterns and best practices for designing production-grade real-time data platforms.
Real-Time vs Near Real-Time vs Batch
First, let's clarify terminology:
| Type | Latency | Use Cases | Examples | |------|---------|-----------|----------| | Real-Time | <100ms | Trading, gaming, fraud detection | | Near Real-Time | 1s-5s | Dashboards, alerting, recommendations | Metrics dashboards, anomaly detection | | Micro-Batch | 5s-5min | Analytics, ETL | Streaming aggregations | | Batch | Hours/days | Historical analytics, reporting | Daily reports, data warehousing |
Most "real-time" systems are actually near real-time (1-5s latency), which is sufficient for most use cases.
Core Design Principles
1. Immutability
Events are immutable facts. Never update or delete; always append.
Benefits:
- Simple debugging (replay events)
- Audit trail out of the box
- Time travel capabilities
- Easier to reason about state
2. Event Sourcing
Store state as a sequence of events rather than current state.
# Bad: Store current state
user_balance = 1000
# Good: Store events
events = [
{"type": "deposit", "amount": 500, "timestamp": "..."},
{"type": "withdrawal", "amount": 200, "timestamp": "..."},
{"type": "deposit", "amount": 700, "timestamp": "..."}
]
# Current balance = sum(deposits) - sum(withdrawals)
3. Idempotency
Operations should be safe to retry without side effects.
# Use unique event IDs for deduplication
def process_payment(event_id, amount):
if not already_processed(event_id):
charge_customer(amount)
mark_as_processed(event_id)
4. At-Least-Once vs Exactly-Once
- At-least-once: Guaranteed delivery, possible duplicates
- Exactly-once: Each event processed exactly once (harder to achieve)
Most systems use at-least-once with idempotent processing.
Reference Architecture
┌─────────────┐
│ Data Sources│
│ (Apps, IoT) │
└──────┬───────┘
│
▼
┌──────────────┐
│ Ingestion │
│ (Kafka/Kinesis)│
└──────┬───────┘
│
▼
┌──────────────┐
│ Processing │
│ (Flink/Spark)│
└──────┬───────┘
│
├─────────────┐
▼ ▼
┌──────────┐ ┌──────────┐
│ Storage │ │ Real-Time│
│(Delta/S3)│ │ Serving │
└──────────┘ │(Redis/DB)│
└──────────┘
Component Deep Dive
Ingestion Layer: Apache Kafka
Kafka is the de facto standard for event streaming.
Why Kafka?
- High throughput (millions of messages/sec)
- Durable (replicated across brokers)
- Scalable (horizontal partitioning)
- Decoupled producers and consumers
Architecture:
┌──────────────────────────────────────┐
│ Kafka Cluster │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Broker 1│ │Broker 2│ │Broker 3│ │
│ └────────┘ └────────┘ └────────┘ │
│ │
│ Topic: events (3 partitions, RF=3) │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ P0 │ │ P1 │ │ P2 │ │
│ └─────┘ └─────┘ └─────┘ │
└──────────────────────────────────────┘
Configuration for Real-Time:
# Producer config for low latency
producer_config = {
'bootstrap.servers': 'kafka:9092',
'linger.ms': 0, # Send immediately
'batch.size': 16384, # Small batches
'compression.type': 'lz4', # Fast compression
'acks': 1, # Leader acknowledgment only
'max.in.flight.requests.per.connection': 5
}
# Consumer config for low latency
consumer_config = {
'bootstrap.servers': 'kafka:9092',
'group.id': 'realtime-processor',
'auto.offset.reset': 'latest',
'enable.auto.commit': False, # Manual commits
'fetch.min.bytes': 1, # Return immediately
'fetch.max.wait.ms': 100 # Max wait 100ms
}
Processing Layer: Apache Flink
For true real-time (<100ms), Apache Flink is superior to Spark Streaming.
Flink vs Spark Streaming:
| Feature | Flink | Spark Streaming | |---------|-------|-----------------| | Model | True streaming | Micro-batching | | Latency | <100ms | 500ms-5s | | State Management | Better | Good | | Exactly-once | Native | Achievable |
Example Flink Pipeline:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)
# Read from Kafka
kafka_consumer = FlinkKafkaConsumer(
topics='events',
deserialization_schema=...,
properties={
'bootstrap.servers': 'kafka:9092',
'group.id': 'flink-processor'
}
)
# Process stream
stream = env.add_source(kafka_consumer) \
.map(lambda event: process_event(event)) \
.filter(lambda event: event.is_valid()) \
.key_by(lambda event: event.user_id) \
.window(TumblingEventTimeWindows.of(Time.seconds(5))) \
.aggregate(AggregateFunction()) \
.add_sink(sink)
env.execute("Real-time Pipeline")
Storage Layer: Choosing the Right Database
Different use cases need different databases:
Time-Series Data:
- InfluxDB, TimescaleDB: Metrics, logs, sensor data
- Optimized for time-range queries
- Automatic downsampling
Key-Value Store:
- Redis: Session data, caching, leaderboards
- Sub-millisecond reads
- In-memory with persistence
Document Store:
- MongoDB, DynamoDB: User profiles, product catalogs
- Flexible schema
- Horizontal scaling
Analytical Store:
- ClickHouse, Druid: OLAP queries, dashboards
- Columnar storage
- Fast aggregations
Example: Multi-Store Architecture:
# Write to multiple stores based on access pattern
def process_event(event):
# Hot path: Real-time dashboard
write_to_redis(event, ttl=3600) # 1 hour
# Warm path: Analytics
write_to_clickhouse(event)
# Cold path: Data lake
write_to_delta_lake(event)
Design Patterns
1. Lambda Architecture
Combine batch and stream processing:
┌──────────┐
│ Ingestion│
└─────┬────┘
│
┌───────┴────────┐
▼ ▼
┌─────────────┐ ┌──────────┐
│Batch Layer │ │Speed Layer│
│(Spark Batch)│ │(Flink) │
└─────────────┘ └──────────┘
│ │
└───────┬────────┘
▼
┌────────────┐
│Serving Layer│
└────────────┘
Pros:
- Batch handles historical data
- Stream handles real-time
- Fault-tolerant
Cons:
- Maintain two codebases
- Complexity
2. Kappa Architecture
Stream-only processing (simplified Lambda):
┌──────────┐
│Ingestion │
└────┬─────┘
│
▼
┌─────────────┐
│Stream Layer │
│(Flink/Spark)│
└────┬────────┘
│
▼
┌────────────┐
│Serving Layer│
└────────────┘
Pros:
- Single codebase
- Simpler to maintain
Cons:
- Reprocessing entire history can be slow
3. Event-Driven Microservices
Event: OrderPlaced
↓
┌────────────┬─────────────┬──────────────┐
│ Inventory │ Shipping │ Notification │
│ Service │ Service │ Service │
└────────────┴─────────────┴──────────────┘
Each service:
- Consumes relevant events
- Maintains its own state
- Publishes new events
Data Quality in Real-Time
Schema Validation
Use Avro/Protobuf for schema enforcement:
from confluent_kafka.avro import AvroProducer
avro_producer = AvroProducer({
'bootstrap.servers': 'kafka:9092',
'schema.registry.url': 'http://schema-registry:8081'
}, default_value_schema=event_schema)
# Invalid events rejected automatically
avro_producer.produce(topic='events', value=event_dict)
Data Quality Checks
def validate_event(event):
checks = [
event.timestamp is not None,
event.user_id > 0,
event.amount >= 0,
event.event_type in VALID_TYPES
]
if not all(checks):
send_to_dlq(event) # Dead letter queue
return False
return True
Monitoring & Alerting
# Track key metrics
metrics = {
'events_processed': Counter(),
'processing_latency': Histogram(),
'errors': Counter(),
'lag': Gauge()
}
# Alert on anomalies
if metrics['lag'] > SLA_THRESHOLD:
send_alert("Consumer lag exceeds SLA")
Handling Late Data
Events can arrive out of order:
# Flink watermarking
stream.assign_timestamps_and_watermarks(
WatermarkStrategy
.for_bounded_out_of_orderness(Duration.of_minutes(5))
.with_timestamp_assigner(lambda event: event.timestamp)
)
# Handle late data
stream.window(TumblingEventTimeWindows.of(Time.minutes(5))) \
.allowed_lateness(Time.minutes(1)) \
.aggregate(aggregator) \
.side_output_late_data(late_data_tag)
Scaling Strategies
Horizontal Scaling
Scale by adding more partitions and consumers:
# Kafka topic with 12 partitions
kafka-topics --create --topic events \
--partitions 12 \
--replication-factor 3
# Consumer group with 12 consumers (1 per partition)
# Scale up/down by adjusting consumer count
Backpressure Handling
# Flink backpressure
env.get_config().set_auto_watermark_interval(1000)
env.set_buffer_timeout(100)
# Rate limiting
from ratelimit import limits, sleep_and_retry
@sleep_and_retry
@limits(calls=1000, period=1) # 1000 calls/sec
def process_event(event):
# Processing logic
pass
Cost Optimization
Right-Sizing Infrastructure
# Calculate required throughput
events_per_second = 100000
avg_event_size_kb = 1
required_throughput_mbps = (events_per_second * avg_event_size_kb) / 1024
# Size Kafka cluster
# Rule: 1 broker can handle ~100 MB/s
required_brokers = ceil(required_throughput_mbps / 100)
Auto-Scaling
# Auto-scale Flink jobs
flink_config = {
'taskmanager.numberOfTaskSlots': 4,
'parallelism.default': 'auto',
'jobmanager.adaptive-scheduler.enabled': True
}
Tiered Storage
# Kafka tiered storage (keep hot data in memory)
kafka_config = {
'log.retention.hours': 168, # 7 days hot
'log.local.retention.hours': 24, # 1 day local
'remote.log.storage.enable': True # Rest in S3
}
Testing Real-Time Systems
1. Unit Tests
def test_event_processing():
event = create_test_event()
result = process_event(event)
assert result.user_id == event.user_id
assert result.amount == expected_amount
2. Integration Tests
# Test with embedded Kafka
from kafka import KafkaProducer, KafkaConsumer
def test_pipeline_integration():
producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
consumer = KafkaConsumer('output-topic')
producer.send('input-topic', test_event)
result = next(consumer)
assert result == expected_output
3. Load Testing
# Simulate production load
from locust import User, task, between
class EventProducer(User):
wait_time = between(0.001, 0.01) # 100-1000 events/sec
@task
def send_event(self):
producer.send('events', generate_event())
Disaster Recovery
Checkpointing
# Flink checkpointing for exactly-once
env.enable_checkpointing(60000) # Every 60s
env.get_checkpoint_config().set_checkpoint_storage_dir('s3://bucket/checkpoints')
Multi-Region Replication
Region A Region B
┌─────────────┐ ┌─────────────┐
│Kafka Cluster│──replicates─→│Kafka Cluster│
│ (Active) │ │ (Standby) │
└─────────────┘ └─────────────┘
Conclusion
Designing real-time data systems requires:
- Clear SLAs: Define acceptable latency
- Right Technology: Choose based on requirements
- Scalability: Plan for 10x growth
- Monitoring: Observability is critical
- Cost Awareness: Balance performance and cost
Start simple, measure everything, and iterate based on real metrics.
Building real-time systems? Let's connect on LinkedIn to discuss architecture and best practices.