System Design

Data Engineering System Design for Real-Time Applications

2026-01-20

14 min read

By Vasudev Rao

System DesignReal-timeStreamingKafkaArchitectureData Engineering

Introduction

Real-time data systems power some of the most critical applications today: fraud detection, recommendation engines, monitoring dashboards, and trading platforms. Designing these systems requires careful consideration of throughput, latency, reliability, and cost.

After building real-time systems processing billions of events daily, I'll share architectural patterns and best practices for designing production-grade real-time data platforms.

Real-Time vs Near Real-Time vs Batch

First, let's clarify terminology:

| Type | Latency | Use Cases | Examples | |------|---------|-----------|----------| | Real-Time | <100ms | Trading, gaming, fraud detection | | Near Real-Time | 1s-5s | Dashboards, alerting, recommendations | Metrics dashboards, anomaly detection | | Micro-Batch | 5s-5min | Analytics, ETL | Streaming aggregations | | Batch | Hours/days | Historical analytics, reporting | Daily reports, data warehousing |

Most "real-time" systems are actually near real-time (1-5s latency), which is sufficient for most use cases.

Core Design Principles

1. Immutability

Events are immutable facts. Never update or delete; always append.

Benefits:

Simple debugging (replay events)
Audit trail out of the box
Time travel capabilities
Easier to reason about state

2. Event Sourcing

Store state as a sequence of events rather than current state.

# Bad: Store current state
user_balance = 1000

# Good: Store events
events = [
    {"type": "deposit", "amount": 500, "timestamp": "..."},
    {"type": "withdrawal", "amount": 200, "timestamp": "..."},
    {"type": "deposit", "amount": 700, "timestamp": "..."}
]
# Current balance = sum(deposits) - sum(withdrawals)

3. Idempotency

Operations should be safe to retry without side effects.

# Use unique event IDs for deduplication
def process_payment(event_id, amount):
    if not already_processed(event_id):
        charge_customer(amount)
        mark_as_processed(event_id)

4. At-Least-Once vs Exactly-Once

At-least-once: Guaranteed delivery, possible duplicates
Exactly-once: Each event processed exactly once (harder to achieve)

Most systems use at-least-once with idempotent processing.

Reference Architecture

┌─────────────┐
│  Data Sources│
│ (Apps, IoT)  │
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   Ingestion  │
│ (Kafka/Kinesis)│
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  Processing  │
│ (Flink/Spark)│
└──────┬───────┘
       │
       ├─────────────┐
       ▼             ▼
┌──────────┐  ┌──────────┐
│  Storage │  │ Real-Time│
│(Delta/S3)│  │ Serving  │
└──────────┘  │(Redis/DB)│
              └──────────┘

Component Deep Dive

Ingestion Layer: Apache Kafka

Kafka is the de facto standard for event streaming.

Why Kafka?

High throughput (millions of messages/sec)
Durable (replicated across brokers)
Scalable (horizontal partitioning)
Decoupled producers and consumers

Architecture:

┌──────────────────────────────────────┐
│           Kafka Cluster               │
│  ┌────────┐  ┌────────┐  ┌────────┐  │
│  │Broker 1│  │Broker 2│  │Broker 3│  │
│  └────────┘  └────────┘  └────────┘  │
│                                        │
│  Topic: events (3 partitions, RF=3)   │
│  ┌─────┐ ┌─────┐ ┌─────┐             │
│  │ P0  │ │ P1  │ │ P2  │             │
│  └─────┘ └─────┘ └─────┘             │
└──────────────────────────────────────┘

Configuration for Real-Time:

# Producer config for low latency
producer_config = {
    'bootstrap.servers': 'kafka:9092',
    'linger.ms': 0,  # Send immediately
    'batch.size': 16384,  # Small batches
    'compression.type': 'lz4',  # Fast compression
    'acks': 1,  # Leader acknowledgment only
    'max.in.flight.requests.per.connection': 5
}

# Consumer config for low latency
consumer_config = {
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'realtime-processor',
    'auto.offset.reset': 'latest',
    'enable.auto.commit': False,  # Manual commits
    'fetch.min.bytes': 1,  # Return immediately
    'fetch.max.wait.ms': 100  # Max wait 100ms
}

Processing Layer: Apache Flink

For true real-time (<100ms), Apache Flink is superior to Spark Streaming.

Flink vs Spark Streaming:

| Feature | Flink | Spark Streaming | |---------|-------|-----------------| | Model | True streaming | Micro-batching | | Latency | <100ms | 500ms-5s | | State Management | Better | Good | | Exactly-once | Native | Achievable |

Example Flink Pipeline:

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(4)

# Read from Kafka
kafka_consumer = FlinkKafkaConsumer(
    topics='events',
    deserialization_schema=...,
    properties={
        'bootstrap.servers': 'kafka:9092',
        'group.id': 'flink-processor'
    }
)

# Process stream
stream = env.add_source(kafka_consumer) \
    .map(lambda event: process_event(event)) \
    .filter(lambda event: event.is_valid()) \
    .key_by(lambda event: event.user_id) \
    .window(TumblingEventTimeWindows.of(Time.seconds(5))) \
    .aggregate(AggregateFunction()) \
    .add_sink(sink)

env.execute("Real-time Pipeline")

Storage Layer: Choosing the Right Database

Different use cases need different databases:

Time-Series Data:

InfluxDB, TimescaleDB: Metrics, logs, sensor data
Optimized for time-range queries
Automatic downsampling

Key-Value Store:

Redis: Session data, caching, leaderboards
Sub-millisecond reads
In-memory with persistence

Document Store:

MongoDB, DynamoDB: User profiles, product catalogs
Flexible schema
Horizontal scaling

Analytical Store:

ClickHouse, Druid: OLAP queries, dashboards
Columnar storage
Fast aggregations

Example: Multi-Store Architecture:

# Write to multiple stores based on access pattern
def process_event(event):
    # Hot path: Real-time dashboard
    write_to_redis(event, ttl=3600)  # 1 hour
    
    # Warm path: Analytics
    write_to_clickhouse(event)
    
    # Cold path: Data lake
    write_to_delta_lake(event)

Design Patterns

1. Lambda Architecture

Combine batch and stream processing:

         ┌──────────┐
         │  Ingestion│
         └─────┬────┘
               │
       ┌───────┴────────┐
       ▼                ▼
┌─────────────┐  ┌──────────┐
│Batch Layer  │  │Speed Layer│
│(Spark Batch)│  │(Flink)    │
└─────────────┘  └──────────┘
       │                │
       └───────┬────────┘
               ▼
        ┌────────────┐
        │Serving Layer│
        └────────────┘

Pros:

Batch handles historical data
Stream handles real-time
Fault-tolerant

Cons:

Maintain two codebases
Complexity

2. Kappa Architecture

Stream-only processing (simplified Lambda):

┌──────────┐
│Ingestion │
└────┬─────┘
     │
     ▼
┌─────────────┐
│Stream Layer │
│(Flink/Spark)│
└────┬────────┘
     │
     ▼
┌────────────┐
│Serving Layer│
└────────────┘

Pros:

Single codebase
Simpler to maintain

Cons:

Reprocessing entire history can be slow

3. Event-Driven Microservices

Event: OrderPlaced
    ↓
┌────────────┬─────────────┬──────────────┐
│  Inventory │  Shipping   │  Notification │
│  Service   │  Service    │  Service      │
└────────────┴─────────────┴──────────────┘

Each service:

Consumes relevant events
Maintains its own state
Publishes new events

Data Quality in Real-Time

Schema Validation

Use Avro/Protobuf for schema enforcement:

from confluent_kafka.avro import AvroProducer

avro_producer = AvroProducer({
    'bootstrap.servers': 'kafka:9092',
    'schema.registry.url': 'http://schema-registry:8081'
}, default_value_schema=event_schema)

# Invalid events rejected automatically
avro_producer.produce(topic='events', value=event_dict)

Data Quality Checks

def validate_event(event):
    checks = [
        event.timestamp is not None,
        event.user_id > 0,
        event.amount >= 0,
        event.event_type in VALID_TYPES
    ]
    
    if not all(checks):
        send_to_dlq(event)  # Dead letter queue
        return False
    return True

Monitoring & Alerting

# Track key metrics
metrics = {
    'events_processed': Counter(),
    'processing_latency': Histogram(),
    'errors': Counter(),
    'lag': Gauge()
}

# Alert on anomalies
if metrics['lag'] > SLA_THRESHOLD:
    send_alert("Consumer lag exceeds SLA")

Handling Late Data

Events can arrive out of order:

# Flink watermarking
stream.assign_timestamps_and_watermarks(
    WatermarkStrategy
        .for_bounded_out_of_orderness(Duration.of_minutes(5))
        .with_timestamp_assigner(lambda event: event.timestamp)
)

# Handle late data
stream.window(TumblingEventTimeWindows.of(Time.minutes(5))) \
    .allowed_lateness(Time.minutes(1)) \
    .aggregate(aggregator) \
    .side_output_late_data(late_data_tag)

Scaling Strategies

Horizontal Scaling

Scale by adding more partitions and consumers:

# Kafka topic with 12 partitions
kafka-topics --create --topic events \
    --partitions 12 \
    --replication-factor 3

# Consumer group with 12 consumers (1 per partition)
# Scale up/down by adjusting consumer count

Backpressure Handling

# Flink backpressure
env.get_config().set_auto_watermark_interval(1000)
env.set_buffer_timeout(100)

# Rate limiting
from ratelimit import limits, sleep_and_retry

@sleep_and_retry
@limits(calls=1000, period=1)  # 1000 calls/sec
def process_event(event):
    # Processing logic
    pass

Cost Optimization

Right-Sizing Infrastructure

# Calculate required throughput
events_per_second = 100000
avg_event_size_kb = 1
required_throughput_mbps = (events_per_second * avg_event_size_kb) / 1024

# Size Kafka cluster
# Rule: 1 broker can handle ~100 MB/s
required_brokers = ceil(required_throughput_mbps / 100)

Auto-Scaling

# Auto-scale Flink jobs
flink_config = {
    'taskmanager.numberOfTaskSlots': 4,
    'parallelism.default': 'auto',
    'jobmanager.adaptive-scheduler.enabled': True
}

Tiered Storage

# Kafka tiered storage (keep hot data in memory)
kafka_config = {
    'log.retention.hours': 168,  # 7 days hot
    'log.local.retention.hours': 24,  # 1 day local
    'remote.log.storage.enable': True  # Rest in S3
}

Testing Real-Time Systems

1. Unit Tests

def test_event_processing():
    event = create_test_event()
    result = process_event(event)
    assert result.user_id == event.user_id
    assert result.amount == expected_amount

2. Integration Tests

# Test with embedded Kafka
from kafka import KafkaProducer, KafkaConsumer

def test_pipeline_integration():
    producer = KafkaProducer(bootstrap_servers=['localhost:9092'])
    consumer = KafkaConsumer('output-topic')
    
    producer.send('input-topic', test_event)
    result = next(consumer)
    
    assert result == expected_output

3. Load Testing

# Simulate production load
from locust import User, task, between

class EventProducer(User):
    wait_time = between(0.001, 0.01)  # 100-1000 events/sec
    
    @task
    def send_event(self):
        producer.send('events', generate_event())

Disaster Recovery

Checkpointing

# Flink checkpointing for exactly-once
env.enable_checkpointing(60000)  # Every 60s
env.get_checkpoint_config().set_checkpoint_storage_dir('s3://bucket/checkpoints')

Multi-Region Replication

Region A                    Region B
┌─────────────┐            ┌─────────────┐
│Kafka Cluster│──replicates─→│Kafka Cluster│
│  (Active)   │            │  (Standby)  │
└─────────────┘            └─────────────┘

Conclusion

Designing real-time data systems requires:

Clear SLAs: Define acceptable latency
Right Technology: Choose based on requirements
Scalability: Plan for 10x growth
Monitoring: Observability is critical
Cost Awareness: Balance performance and cost

Start simple, measure everything, and iterate based on real metrics.

Building real-time systems? Let's connect on LinkedIn to discuss architecture and best practices.