Kafka vs Kinesis: Architectural Trade-offs
Every data engineering team hits this decision eventually. Both Kafka and Kinesis move event streams reliably at scale — but they make fundamentally different trade-offs around operational complexity, throughput ceilings, consumer flexibility, and cost. After running Kafka clusters processing 2 million events/second and Kinesis pipelines ingesting 400GB/hour on separate products, here is what actually matters when choosing between them.
The Decision Frame
There is no universally correct answer. The right choice depends on three questions:
- Who operates it? Kafka is infrastructure you own. Kinesis is infrastructure AWS owns.
- What are your throughput and latency requirements? The gap between them is real and measurable.
- How many consumers do you have, and how independently do they need to move?
Everything else — pricing, ordering guarantees, replay windows — flows from those three constraints.
Architecture Overview
Apache Kafka
Kafka is a distributed commit log. Producers write to topics partitioned across a broker cluster. Consumers read from partitions independently, tracking their own offsets. The cluster state is managed by ZooKeeper (pre-3.0) or KRaft (3.0+).
Producers → Topics (N partitions) → Broker Cluster (M brokers)
↓
Consumer Groups (independent offsets)
├── Group A: real-time analytics
├── Group B: data lake ingestion
└── Group C: ML feature pipeline
# Kafka producer — explicit partition control
from confluent_kafka import Producer
producer = Producer({
'bootstrap.servers': 'kafka-broker-1:9092,kafka-broker-2:9092',
'acks': 'all', # Wait for all ISR replicas
'retries': 5,
'linger.ms': 10, # Batch for 10ms before sending
'batch.size': 65536, # 64KB batch size
'compression.type': 'lz4',
})
def delivery_callback(err, msg):
if err:
print(f"Delivery failed: {err}")
producer.produce(
topic='user-events',
key=user_id.encode(), # Key determines partition
value=event_payload,
callback=delivery_callback
)
producer.flush()AWS Kinesis Data Streams
Kinesis is a managed shard-based stream. Producers write records to a stream divided into shards. Each shard handles 1MB/s write and 2MB/s read. AWS manages the infrastructure entirely — no brokers to provision, no ZooKeeper to babysit.
Producers → Stream (N shards) → Kinesis Service (AWS-managed)
↓
Consumers (shared 2MB/s read per shard)
├── Lambda trigger
├── KCL application
└── Kinesis Firehose → S3
# Kinesis producer — partition key hashes to shard
import boto3
import json
kinesis = boto3.client('kinesis', region_name='us-east-1')
response = kinesis.put_record(
StreamName='user-events',
Data=json.dumps(event_payload).encode(),
PartitionKey=user_id, # Hashed to determine shard
)
print(f"Shard: {response['ShardId']}, Seq: {response['SequenceNumber']}")
# Batch writes — up to 500 records, 5MB total
records = [
{'Data': json.dumps(e).encode(), 'PartitionKey': e['user_id']}
for e in event_batch
]
kinesis.put_records(StreamName='user-events', Records=records)Throughput: The Real Numbers
Kafka Throughput
Kafka throughput scales horizontally with partition count and broker count. There is no hard ceiling imposed by the service — only your hardware and network.
# Kafka throughput estimation
def kafka_throughput(num_brokers, disk_throughput_mb_per_broker,
replication_factor, num_partitions):
# Write throughput limited by replication
write_throughput = (num_brokers * disk_throughput_mb_per_broker) / replication_factor
# Read throughput not limited by replication
read_throughput = num_brokers * disk_throughput_mb_per_broker
parallelism = num_partitions
print(f"Write throughput: ~{write_throughput:.0f} MB/s")
print(f"Read throughput: ~{read_throughput:.0f} MB/s")
print(f"Max parallelism: {parallelism} concurrent consumers")
# 10-broker cluster, 500MB/s disk each, RF=3, 120 partitions
kafka_throughput(10, 500, 3, 120)
# Write throughput: ~1666 MB/s
# Read throughput: ~5000 MB/s
# Max parallelism: 120 concurrent consumersKinesis Throughput
Kinesis throughput is strictly shard-bounded. Each shard: 1MB/s or 1,000 records/s write, 2MB/s or 5 reads/s read. These are hard limits enforced by the service.
# Kinesis throughput estimation
def kinesis_throughput(num_shards, consumers_sharing_read):
write_mb_per_sec = num_shards * 1 # 1MB/s per shard
write_records_per_sec = num_shards * 1000
read_mb_per_sec = (num_shards * 2) / consumers_sharing_read # Shared 2MB/s
read_calls_per_sec = (num_shards * 5) / consumers_sharing_read
print(f"Write: {write_mb_per_sec} MB/s, {write_records_per_sec:,} rec/s")
print(f"Read/consumer: {read_mb_per_sec:.1f} MB/s, {read_calls_per_sec:.1f} calls/s")
print(f"Note: Enhanced fan-out gives each consumer dedicated 2MB/s per shard")
# 100 shards, 3 consumers sharing standard reads
kinesis_throughput(100, 3)
# Write: 100 MB/s, 100,000 rec/s
# Read/consumer: 66.7 MB/s, 1.7 calls/s
# Note: Enhanced fan-out gives each consumer dedicated 2MB/s per shardEnhanced fan-out solves the shared read limit — each registered consumer gets a dedicated 2MB/s per shard push connection. It costs more but eliminates read throttling with multiple consumers.
# Register enhanced fan-out consumer
kinesis.register_stream_consumer(
StreamARN='arn:aws:kinesis:us-east-1:123456789:stream/user-events',
ConsumerName='real-time-analytics'
)
# Each registered consumer gets 2MB/s per shard — independent of othersImpact: At 2 million events/second, Kafka handled it with a 20-broker cluster and 240 partitions. Kinesis would require 2,000 shards — a $52,000/month shard cost before data transfer.
Ordering Guarantees
Kafka: Partition-Level Ordering
Kafka guarantees strict ordering within a partition. Messages with the same key always go to the same partition, so per-key ordering is guaranteed across the entire topic.
# Kafka: all events for user_123 land on the same partition in order
producer.produce(
topic='user-events',
key='user_123', # Same key → same partition → ordered
value=event_a
)
producer.produce(
topic='user-events',
key='user_123',
value=event_b # Always after event_a for this key
)# Global ordering across partitions: not guaranteed
# If you need global order, use a single partition — but lose parallelism
producer.produce(
topic='ordered-events',
partition=0, # Force all writes to one partition
value=event
)
# Throughput ceiling: ~100MB/s; parallelism: 1 consumerKinesis: Shard-Level Ordering
Kinesis guarantees ordering within a shard. The same partition key always maps to the same shard, giving per-key ordering — identical to Kafka's model.
# Kinesis: same partition key → same shard → ordered
kinesis.put_record(
StreamName='user-events',
Data=event_a,
PartitionKey='user_123' # Hashed to shard; ordered within shard
)
kinesis.put_record(
StreamName='user-events',
Data=event_b,
PartitionKey='user_123' # After event_a on the same shard
)The ordering model is equivalent. Both systems give you per-key ordering within a partition/shard. Global ordering requires a single partition/shard in both systems, and destroys parallelism in both.
Consumer Model: The Biggest Practical Difference
This is where the systems diverge most meaningfully in day-to-day operations.
Kafka: Fully Independent Consumer Groups
Each consumer group maintains its own offset independently. Adding a new consumer group costs nothing — it reads the topic from any point without affecting other consumers.
# Three completely independent consumers of the same topic
# Each maintains its own offset, moves at its own pace
# Consumer Group A: real-time, at the tip of the log
consumer_a = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'realtime-analytics',
'auto.offset.reset': 'latest'
})
# Consumer Group B: batch, running 6 hours behind
consumer_b = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'data-lake-ingestion',
'auto.offset.reset': 'earliest'
})
# Consumer Group C: reprocessing historical data from 30 days ago
consumer_c = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'ml-backfill',
})
consumer_c.seek(partition, offset_from_30_days_ago)
# All three consume without awareness of each otherKinesis: Shared Shard Read Throughput (Standard) or Enhanced Fan-Out
Standard Kinesis consumers share the 2MB/s read limit per shard. Adding consumers degrades everyone's read throughput until you enable enhanced fan-out.
# Standard consumer — shares 2MB/s with all others on this shard
import boto3
kinesis = boto3.client('kinesis')
shard_iterator = kinesis.get_shard_iterator(
StreamName='user-events',
ShardId='shardId-000000000000',
ShardIteratorType='LATEST'
)['ShardIterator']
while True:
response = kinesis.get_records(ShardIterator=shard_iterator, Limit=1000)
records = response['Records']
shard_iterator = response['NextShardIterator']
# This consumer is consuming from the shared 2MB/s budget
# A third consumer on this shard means each gets ~0.67MB/s# Enhanced fan-out — dedicated 2MB/s per shard, push-based
import asyncio
from aiokinesisreader import KinesisReader # Example async KCL wrapper
async def consume_with_fan_out():
reader = KinesisReader(
stream_name='user-events',
consumer_arn='arn:aws:kinesis:...:consumer/realtime-analytics',
# Each registered consumer gets its own 2MB/s — no sharing
)
async for record in reader:
process(record)Rule: If you have more than 2 consumers on a Kinesis stream, enable enhanced fan-out. Standard reads at scale are a latency and throttling trap.
Operational Overhead: The Honest Comparison
Kafka Self-Managed
# What you own with self-managed Kafka
# Broker provisioning and scaling
kafka-topics.sh --create --topic user-events \
--partitions 120 --replication-factor 3 \
--bootstrap-server kafka:9092
# Monitoring: broker JVM, disk usage, ISR shrinks, under-replicated partitions
# Alerting on: consumer lag, leader elections, network saturation
# Upgrades: rolling restarts across broker cluster
# Capacity planning: disk growth, partition reassignment as brokers are added
# Reassign partitions when adding brokers
kafka-reassign-partitions.sh \
--bootstrap-server kafka:9092 \
--reassignment-json-file reassignment.json \
--executeRunning self-managed Kafka seriously requires: a dedicated ops rotation, expertise in JVM tuning, ZooKeeper or KRaft operational knowledge, and a monitoring stack covering 30+ broker metrics. Budget 0.5–1 FTE of ops time per Kafka cluster.
Kinesis Managed
# What you own with Kinesis
# Shard management — scaling up and down
kinesis.update_shard_count(
StreamName='user-events',
TargetShardCount=200, # Scale up
ScalingType='UNIFORM_SCALING'
)
# Note: shard splits/merges take ~30 seconds; there is no auto-scaling by default
# Monitor: PutRecords.ThrottledRecords, GetRecords.IteratorAgeMilliseconds,
# ReadProvisionedThroughputExceeded, WriteProvisionedThroughputExceeded
import boto3
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='kinesis-write-throttle',
MetricName='WriteProvisionedThroughputExceeded',
Namespace='AWS/Kinesis',
Statistic='Sum',
Period=60,
EvaluationPeriods=3,
Threshold=100,
ComparisonOperator='GreaterThanThreshold',
Dimensions=[{'Name': 'StreamName', 'Value': 'user-events'}]
)Kinesis operational burden is significantly lower — no brokers, no ZooKeeper, no rolling restarts. The ops surface is shard count management, IAM permissions, and CloudWatch alarms. Budget 0.1 FTE ops time.
Impact: Migrating a 3-stream Kinesis setup to self-managed Kafka added one full-time platform engineer to our team. The throughput gains justified it at our scale. At lower scale, they would not have.
Retention and Replay
Kafka
# Kafka: configure retention per topic
# Time-based (default 7 days, configurable to years)
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type topics --entity-name user-events \
--alter --add-config retention.ms=2592000000 # 30 days
# Size-based retention
kafka-configs.sh --bootstrap-server kafka:9092 \
--entity-type topics --entity-name user-events \
--alter --add-config retention.bytes=1099511627776 # 1TB per partition
# Infinite retention with Tiered Storage (Kafka 3.6+)
# Hot data on broker disks, cold data on S3 — transparent to consumersKinesis
# Kinesis: 24-hour default retention, up to 365 days
kinesis.increase_stream_retention_period(
StreamName='user-events',
RetentionPeriodHours=168 # 7 days — costs extra beyond 24 hours
)
# Extended retention pricing:
# 24 hours: included in shard-hour cost
# 7 days: +$0.020 per shard-hour
# 365 days: +$0.023 per shard-hour (Long-term retention)
# Replay from a specific timestamp
import datetime
timestamp = datetime.datetime(2024, 9, 1, 0, 0, 0)
shard_iterator = kinesis.get_shard_iterator(
StreamName='user-events',
ShardId='shardId-000000000000',
ShardIteratorType='AT_TIMESTAMP',
Timestamp=timestamp
)['ShardIterator']Kafka wins on retention flexibility. You can retain data indefinitely (hardware permitting) with no incremental cost beyond storage. Kinesis charges per shard-hour for every hour beyond 24, which becomes significant at scale.
Cost Comparison at Scale
Running 100GB/day of event data for 30 days with 3 consumers:
Kafka (self-managed on AWS, 6-broker cluster):
EC2 (6x r5.4xlarge): $4,320/month
EBS storage (30TB): $750/month
Data transfer: $200/month
Total: ~$5,270/month
Ops cost (0.5 FTE): ~$8,000/month
─────────────────────────────────────────
Total with ops: ~$13,270/month
Kinesis (100 shards, enhanced fan-out, 7-day retention):
Shard-hours (100 × 720h): $1,800/month
PUT payload units: $360/month
Enhanced fan-out (3): $1,800/month
Extended retention (7d): $360/month
Total: ~$4,320/month
Ops cost (0.1 FTE): ~$1,600/month
─────────────────────────────────────────
Total with ops: ~$5,920/month
At 100GB/day with 3 consumers, Kinesis is meaningfully cheaper. The crossover happens around 500GB/day to 1TB/day depending on consumer count and retention requirements — at that point Kafka's flat infrastructure cost becomes competitive with Kinesis's per-shard pricing.
Decision Framework
def choose_streaming_platform(
throughput_mb_per_sec,
num_independent_consumers,
has_dedicated_platform_team,
already_on_aws,
needs_custom_consumer_logic,
retention_days_required
):
score = {"kafka": 0, "kinesis": 0}
if throughput_mb_per_sec > 500:
score["kafka"] += 3 # Kinesis shard cost becomes prohibitive
else:
score["kinesis"] += 2
if num_independent_consumers > 5:
score["kafka"] += 2 # Independent offsets, no fan-out cost
elif num_independent_consumers <= 2:
score["kinesis"] += 1
if not has_dedicated_platform_team:
score["kinesis"] += 3 # Operational overhead is the biggest risk
else:
score["kafka"] += 1
if already_on_aws and not needs_custom_consumer_logic:
score["kinesis"] += 2 # Native Lambda, Firehose, Glue integration
if retention_days_required > 7:
score["kafka"] += 2 # Kinesis long-term retention gets expensive
winner = max(score, key=score.get)
print(f"Kafka: {score['kafka']} Kinesis: {score['kinesis']}")
print(f"Recommendation: {winner.upper()}")
return winnerReal-World Impact
After running both systems across two separate products over 18 months:
- Kafka peak throughput: 2.1M events/sec on 20-broker cluster
- Kinesis peak throughput: 95,000 events/sec on 100 shards
- Kafka ops incidents/month: 3–4 (broker restarts, partition rebalances, lag spikes)
- Kinesis ops incidents/month: 0.5 (shard throttling, IAM misconfigs)
- Kafka cost at 1TB/day: ~$13K/month (including ops)
- Kinesis cost at 100GB/day: ~$6K/month (including ops)
- Time to first message (new stream): Kafka 2–3 days, Kinesis 15 minutes
Key Takeaways
- Kinesis wins on operational simplicity — if you don't have a platform team, this gap is decisive
- Kafka wins on throughput ceiling and cost at high volume — above ~500MB/s Kinesis shard costs escalate fast
- Both systems give equivalent per-key ordering guarantees — this is not a differentiator
- Kinesis enhanced fan-out is non-negotiable with more than 2 consumers — budget for it upfront
- Kafka's consumer group model is more flexible — independent consumers moving at different speeds is natural; in Kinesis it requires more design
- The real cost of Kafka is the ops FTE, not the infrastructure — factor this in honestly
Neither platform is wrong. Kinesis is the right default for AWS-native teams without streaming expertise. Kafka is the right choice when throughput, consumer flexibility, or retention requirements outgrow what Kinesis can deliver economically.
Next Steps:
- Read about debugging Kafka consumer lag in production
- Explore scaling Spark to 100TB: production patterns
- Check out the real-time streaming pipeline project