Optimizing Spark Jobs to Reduce Costs by 60%
Our Spark infrastructure bill was $40K/month. After systematic optimization, we reduced it to $16K/month while improving performance. Here's how.
The Cost Problem
Initial State:
- 50+ Spark jobs running daily
- Average runtime: 4+ hours per job
- Cluster costs: $40K/month
- 70% wasted compute on idle resources
Optimization Strategy
1. Right-Sizing Clusters
Before:
# One-size-fits-all cluster
cluster_config = {
"node_type": "r5.4xlarge", # Memory-optimized for everything
"num_workers": 20,
"autoscale": False
}After:
# Workload-specific clusters
bronze_cluster = {
"node_type": "i3.2xlarge", # I/O optimized for ingestion
"num_workers": 10,
"autoscale": {"min": 5, "max": 15}
}
silver_cluster = {
"node_type": "r5.4xlarge", # Memory for transformations
"num_workers": 15,
"autoscale": {"min": 8, "max": 25}
}
gold_cluster = {
"node_type": "c5.2xlarge", # Compute for aggregations
"num_workers": 8,
"autoscale": {"min": 4, "max": 12}
}Savings: $12K/month (30% reduction)
2. Partition Optimization
Problem: Small files creating millions of tasks
# Before: 500K partitions = 500K tasks
df = spark.read.parquet("s3://data/events/")
print(df.rdd.getNumPartitions()) # 500,000
# After: Coalesce to optimal size
target_partition_size_mb = 128
num_partitions = calculate_optimal_partitions(df)
df = df.coalesce(num_partitions) # 2,000 partitionsFunction to calculate partitions:
def calculate_optimal_partitions(df, target_mb=128):
"""Calculate optimal partition count"""
total_size_mb = df._jdf.queryExecution().analyzed().stats().sizeInBytes() / (1024 * 1024)
return max(int(total_size_mb / target_mb), spark.sparkContext.defaultParallelism)Savings: 40% faster execution = $5K/month
3. Intelligent Caching
Before: Caching everything
# Bad: Cache unnecessarily
df1 = spark.read.parquet("s3://data/")
df1.cache() # Wastes memory
df2 = df1.filter(col("date") == "2024-01-01")
df2.cache() # Also not reusedAfter: Cache only multi-use dataframes
# Good: Cache strategically
dim_users = spark.read.parquet("s3://dims/users").cache() # Used 10x
# Don't cache single-use
fact_orders = spark.read.parquet("s3://facts/orders") # Used onceRule: Cache only if:
- Used 3+ times in same job
- Size < 20% of cluster memory
- Expensive to recompute
Savings: $3K/month (better memory utilization)
4. Dynamic Partition Pruning
Enable Spark 3.0+ optimization:
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")
# Automatically skips partitions
fact_sales = spark.read.parquet("s3://facts/sales") # 1000 partitions
dim_product = spark.read.parquet("s3://dims/products")
# Spark automatically prunes irrelevant partitions
result = fact_sales.join(
dim_product.filter(col("category") == "Electronics"),
"product_id"
)
# Reads only 50 partitions instead of 1000Savings: $4K/month (80% less data scanned)
5. Spot Instances
Configuration:
cluster_config = {
"node_type_id": "r5.4xlarge",
"driver_node_type_id": "r5.xlarge", # On-demand driver
"num_workers": 20,
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK", # Use spot, fallback to on-demand
"spot_bid_price_percent": 100,
"first_on_demand": 2 # Keep 2 on-demand for stability
}
}Spot Instance Strategy:
- Driver: Always on-demand (stability)
- Workers: 80% spot, 20% on-demand
- Multiple AZs for availability
Savings: $8K/month (60% discount on workers)
6. Cluster Auto-Termination
Before: Clusters running 24/7
After:
cluster_config = {
"autotermination_minutes": 15, # Terminate after 15 min idle
"cluster_log_conf": {
"s3": {
"destination": "s3://logs/",
"enable_encryption": True
}
}
}Savings: $6K/month (eliminated idle time)
7. Job Scheduling Optimization
Before: All jobs run during business hours
After: Spread across 24 hours
# High-priority: Business hours
critical_jobs = ["revenue_report", "user_metrics"]
schedule_time = "08:00"
# Low-priority: Off-peak hours (cheaper spot instances)
batch_jobs = ["historical_backfill", "data_quality_checks"]
schedule_time = "02:00" # 3x cheaper spot availabilitySavings: $2K/month
Cost Breakdown
| Optimization | Monthly Savings | Implementation Effort |
|---|---|---|
| Right-sizing | $12,000 | Medium |
| Partitioning | $5,000 | Low |
| Caching | $3,000 | Low |
| Partition Pruning | $4,000 | Low |
| Spot Instances | $8,000 | Medium |
| Auto-termination | $6,000 | Low |
| Job Scheduling | $2,000 | Low |
| Total | $40K → $16K | 60% Reduction |
Monitoring Dashboard
Track these metrics:
-- DBU consumption by job
SELECT
job_name,
SUM(dbu_units) as total_dbus,
SUM(dbu_units * cost_per_dbu) as cost_usd
FROM cluster_usage
WHERE date >= current_date() - 30
GROUP BY job_name
ORDER BY cost_usd DESC# Alert on cost anomalies
def check_cost_anomaly(current_cost, historical_avg):
threshold = historical_avg * 1.3 # 30% increase
if current_cost > threshold:
send_alert(f"Cost spike detected: ${current_cost:.2f}")Best Practices
DO:
✅ Right-size clusters per workload
✅ Use spot instances (80/20 split)
✅ Enable auto-termination (15 min)
✅ Optimize partition sizes (128MB-1GB)
✅ Cache only multi-use data
✅ Schedule off-peak for batch jobs
✅ Monitor costs daily
DON'T:
❌ Use one cluster for everything
❌ Keep clusters running 24/7
❌ Cache everything "just in case"
❌ Ignore partition counts
❌ Run all jobs during business hours
❌ Skip cost monitoring
Implementation Timeline
Week 1: Right-size clusters, enable auto-termination
Week 2: Optimize partitioning, enable spot instances
Week 3: Implement caching strategy, partition pruning
Week 4: Job scheduling optimization, monitoring
Results
Before:
- $40K/month
- 4+ hour runtimes
- 70% resource waste
- Manual cluster management
After:
- $16K/month (60% reduction)
- 2.5 hour runtimes (38% faster)
- 15% resource waste
- Automated cost controls
Key Takeaways
- Workload-specific clusters save more than generic sizing
- Spot instances are the biggest cost lever (60% discount)
- Auto-termination eliminates idle waste
- Partition optimization improves both cost and performance
- Monitoring is essential for sustained savings
Cost optimization is continuous, not one-time. Set up alerts, review monthly, and iterate.
Related: Scaling Spark to 100TB | Delta Lake Optimization