Back to Blog
delta-lakesparkdatabricksoptimizationlakehouseperformance

Delta Lake Optimization: From Slow to Fast

2024-11-079 min read

Delta Lake Optimization: From Slow to Fast

After inheriting a Delta Lake that had grown for two years without maintenance, I've learned that slow tables almost always share the same three problems: too many small files, no data clustering, and queries scanning 10x more data than they need. Every pattern below targets one of those three root causes.

The Problem

Our analytics platform had a user_events table with 620,000 Parquet files, a transactions table where a simple 7-day filter still scanned 18 months of data, and Gold-layer dashboards timing out after 8 minutes. After systematic optimization, dashboard queries dropped to 28 seconds and the nightly pipeline cost fell by 67%.

10 Optimization Patterns

1. Diagnose Before You Optimize

Never guess. Run a health check on every table before touching it — the bottleneck is almost never where you expect.

Before:

# Guessing the problem — running OPTIMIZE blindly on everything
spark.sql("OPTIMIZE transactions")
spark.sql("OPTIMIZE user_events")
# Still slow. No idea why.

After:

from delta.tables import DeltaTable
 
def table_health(spark, path: str):
    dt     = DeltaTable.forPath(spark, path)
    detail = dt.detail().collect()[0]
 
    num_files  = detail["numFiles"]
    size_bytes = detail["sizeInBytes"]
    avg_mb     = (size_bytes / num_files) / 1e6 if num_files else 0
 
    print(f"Files:        {num_files:,}")
    print(f"Size:         {size_bytes / 1e9:.1f} GB")
    print(f"Avg file:     {avg_mb:.1f} MB")
    print(f"Partitions:   {detail['numPartitions']}")
 
    if avg_mb < 32:
        print("⚠️  Small file problem — run OPTIMIZE")
    if avg_mb > 2048:
        print("⚠️  Files too large — revisit partition strategy")
    if num_files > 100_000:
        print("⚠️  File count critical — OPTIMIZE urgently")
 
table_health(spark, "s3://lake/silver/user_events/")
# Files:      620,412
# Size:       1.8 TB
# Avg file:   2.9 MB        ← critical small file problem
# Partitions: 730
# ⚠️  Small file problem — run OPTIMIZE
# ⚠️  File count critical — OPTIMIZE urgently

Impact: Health checks revealed our transactions table had 94% of its files under 5MB — a small-file problem, not a query logic problem. Fixed in one OPTIMIZE run, not two weeks of Spark tuning.


2. OPTIMIZE: Fix the Small File Problem

Small files are Delta Lake's most common performance killer. One streaming job writing every 30 seconds for a year creates millions of tiny files that crush read performance.

Before:

# Streaming write — creates one file per micro-batch per partition
df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "/checkpoints/events/") \
    .start("s3://lake/bronze/events/")
# 12 months later: 800,000 files, each 1–3MB
# Every query opens 800K files — metadata overhead alone takes 40 seconds

After:

# OPTIMIZE compacts small files into 1GB target files
spark.sql("OPTIMIZE delta.`s3://lake/bronze/events/`")
 
# Target only recent partitions — much faster than full table
spark.sql("""
    OPTIMIZE delta.`s3://lake/silver/user_events/`
    WHERE event_date >= '2024-10-01'
""")
 
# Python API — integrate into maintenance pipelines
dt = DeltaTable.forPath(spark, "s3://lake/silver/user_events/")
dt.optimize() \
  .where("event_date >= '2024-10-01'") \
  .executeCompaction()

Impact: Compacting user_events from 620,000 files to 1,847 files cut metadata scan time from 40 seconds to 0.3 seconds. Query cold-start time dropped by 94%.


3. ZORDER BY: Stop Scanning Data You Don't Need

OPTIMIZE alone compacts files. ZORDER BY clusters related rows into the same files so Delta's data-skipping index can skip entire files during filtered reads.

Before:

# No clustering — a filter on user_id scans every file
spark.sql("""
    SELECT SUM(amount)
    FROM silver.transactions
    WHERE user_id = 'u_8821'
    AND   event_date >= '2024-10-01'
""")
# Files scanned: 1,847 of 1,847 (0% skipped)
# Runtime: 4 min 12 sec

After:

# ZORDER clusters by user_id and event_date — Delta skips irrelevant files
spark.sql("""
    OPTIMIZE delta.`s3://lake/silver/transactions/`
    ZORDER BY (user_id, event_date)
""")
 
# Same query — now skips 89% of files
spark.sql("""
    SELECT SUM(amount)
    FROM silver.transactions
    WHERE user_id = 'u_8821'
    AND   event_date >= '2024-10-01'
""")
# Files scanned: 204 of 1,847 (89% skipped)
# Runtime: 28 sec

Choosing ZORDER columns — pick correctly or waste the run:

# GOOD: high cardinality, appears in WHERE clauses and JOINs
OPTIMIZE silver.transactions ZORDER BY (user_id, event_date)
 
# BAD: low cardinality adds no skipping benefit
OPTIMIZE silver.transactions ZORDER BY (status)        # 3 distinct values — useless
OPTIMIZE silver.transactions ZORDER BY (is_deleted)    # Boolean — skip entirely
OPTIMIZE silver.transactions ZORDER BY (a, b, c, d, e) # > 4 columns — effectiveness collapses

Impact: Z-Ordering transactions by user_id and event_date pushed file-skipping ratio from 0% to 89%. The query that took 4 minutes now takes 28 seconds on the same cluster.


4. Partition Strategy: Prune Before You Scan

Partitioning and Z-Ordering work at different levels. Partitioning eliminates entire directory prefixes before Delta even opens a file. Z-Ordering skips files within a partition. Get partitioning wrong and Z-Ordering cannot save you.

Before:

# Over-partitioned — 50,000 partitions, each with 5–10 tiny files
df.write \
    .format("delta") \
    .partitionBy("user_id") \         # High cardinality = partition explosion
    .save("s3://lake/silver/events/")
 
# Under-partitioned — 7-day filter scans 18 months of data
df.write \
    .format("delta") \
    .save("s3://lake/silver/events/") # No partition → full table scan always

After:

# Right-sized: partition by date hierarchy, Z-Order within partition
df.withColumn("year",  year(col("event_ts"))) \
  .withColumn("month", month(col("event_ts"))) \
  .write \
  .format("delta") \
  .partitionBy("year", "month", "event_date") \  # Prune by year/month first
  .mode("overwrite") \
  .save("s3://lake/silver/events/")
 
# Then Z-Order within each date partition for column-level skipping
spark.sql("""
    OPTIMIZE delta.`s3://lake/silver/events/`
    WHERE event_date >= '2024-10-01'
    ZORDER BY (user_id, event_type)
""")

Rule: Partition on the column you filter by in WHERE clauses 90% of the time. For most event tables that is event_date. Never partition by a column with more than 10,000 distinct values.

Impact: Adding year/month/event_date hierarchy to a flat table reduced data scanned from 18 months to 7 days for our most common dashboard query — a 98% reduction in bytes read.


5. VACUUM: Reclaim Storage, Speed Up Listings

VACUUM deletes obsolete files that Delta no longer references. Without it, S3 directory listings grow indefinitely — slowing every query's planning phase.

Before:

# No VACUUM in 14 months — 4.2TB of unreferenced files accumulating
# S3 LIST calls during query planning: 18 seconds
# Storage bill includes 4.2TB of data nobody can read

After:

# Always dry-run first — see what will be deleted
spark.sql("VACUUM delta.`s3://lake/silver/events/` RETAIN 168 HOURS DRY RUN")
 
# Execute with 7-day retention (safe default)
spark.sql("VACUUM delta.`s3://lake/silver/events/` RETAIN 168 HOURS")
 
# Python API
dt = DeltaTable.forPath(spark, "s3://lake/silver/events/")
dt.vacuum(retentionHours=168)
 
# Schedule as a weekly maintenance job — not a one-time fix
def weekly_vacuum(spark, tables: list, retention_hours: int = 168):
    for path in tables:
        print(f"Vacuuming {path}...")
        DeltaTable.forPath(spark, path).vacuum(retentionHours=retention_hours)
        print(f"Done: {path}")

Impact: First VACUUM after 14 months of no maintenance reclaimed 4.2TB of storage ($96/month) and cut S3 LIST time during query planning from 18 seconds to 1.1 seconds.


6. Auto-Optimize on Write

Prevent the small file problem at the source instead of fixing it after the fact with scheduled OPTIMIZE runs.

Before:

# Default write — every append creates new files regardless of size
df.write.format("delta").mode("append").save("s3://lake/silver/events/")
# Small files accumulate; OPTIMIZE runs chase the problem indefinitely

After:

# Enable auto-optimization on the table — small files compacted on write
spark.sql("""
    ALTER TABLE silver.events
    SET TBLPROPERTIES (
        'delta.autoOptimize.optimizeWrite' = 'true',  -- Bin-pack on every write
        'delta.autoOptimize.autoCompact'   = 'true'   -- Background compaction
    )
""")
 
# Or set globally for all new tables in the session
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled",   "true")

Caution: autoCompact adds write latency. Benchmark on high-throughput streaming tables before enabling — for micro-batch writes under 5 seconds, disable it and rely on scheduled OPTIMIZE instead.

Impact: Enabling optimizeWrite on our hourly batch tables reduced average file count growth from 8,400 new files/day to 210 — a 97% reduction that made weekly OPTIMIZE runs unnecessary.


7. Data Skipping with Column Statistics

Delta Lake maintains min/max statistics for up to 32 columns per table. Queries that filter on those columns skip entire files without reading them. Wasted statistics slots are silent performance losses.

Before:

# Default — Delta collects stats on first 32 columns in schema order
# If your most-filtered columns are beyond column 32, you get zero skipping
CREATE TABLE silver.events (
    _ingestion_id    STRING,   -- column 1  (internal, never filtered)
    _source_file     STRING,   -- column 2  (internal, never filtered)
    _batch_id        STRING,   -- column 3  (internal, never filtered)
    ...                        -- columns 431: more internal metadata
    user_id          STRING,   -- column 32 (barely makes the cut)
    event_date       DATE,     -- column 33NO STATISTICS COLLECTED
    amount           DECIMAL,  -- column 34NO STATISTICS COLLECTED
)

After:

# Reorder schema — put high-value filter columns first
# Or explicitly configure which columns get statistics
spark.sql("""
    ALTER TABLE silver.events
    SET TBLPROPERTIES (
        'delta.dataSkippingNumIndexedCols' = '32',
        'delta.columnMapping.mode'         = 'name'
    )
""")
 
# Explicitly list the columns that matter for skipping
spark.sql("""
    ALTER TABLE silver.events
    SET TBLPROPERTIES (
        'delta.dataSkippingStatsColumns' = 'user_id,event_date,amount,event_type,region'
    )
""")

Impact: Moving user_id and event_date into the statistics columns lifted file-skipping from 11% to 76% on our widest table — without changing a single query or running OPTIMIZE.


8. Bloom Filter Indexes for High-Cardinality Lookups

Z-Ordering helps range queries. Bloom filter indexes help exact-match lookups on high-cardinality string columns — order IDs, transaction IDs, session IDs.

Before:

# Exact-match lookup on transaction_id — full file scan, no skipping
spark.sql("""
    SELECT * FROM silver.transactions
    WHERE transaction_id = 'txn_a8f3c921'
""")
# Files scanned: 1,847 of 1,847
# Runtime: 3 min 44 sec

After:

# Create bloom filter index on transaction_id
spark.sql("""
    CREATE BLOOMFILTER INDEX ON TABLE silver.transactions
    FOR COLUMNS (transaction_id OPTIONS (fpp=0.1, numItems=50000000))
""")
 
# Same lookup — bloom filter eliminates 96% of files
spark.sql("""
    SELECT * FROM silver.transactions
    WHERE transaction_id = 'txn_a8f3c921'
""")
# Files scanned: 74 of 1,847 (96% skipped)
# Runtime: 9 sec

Impact: Bloom filter index on transaction_id turned a 3-minute support lookup into a 9-second one. Customer support team stopped filing tickets about "slow transaction searches" the same week.


9. Cache Hot Tables at the Right Storage Level

Caching is not free — wrong storage level wastes executor memory and evicts data you actually need.

Before:

# Default cache — MEMORY_AND_DISK without serialization
# Wide tables consume 5–8x more memory than necessary
dim_products = spark.read.format("delta") \
    .load("s3://lake/gold/dim_products/").cache()
# 220-column table: 18GB cached in memory as uncompressed Java objects
# Evicts other DataFrames; causes GC pressure

After:

from pyspark import StorageLevel
 
# Small dimension tables (< 2GB) — MEMORY_ONLY_SER, fast access, low footprint
dim_products = spark.read.format("delta") \
    .load("s3://lake/gold/dim_products/") \
    .persist(StorageLevel.MEMORY_ONLY_SER)
 
# Large fact tables reused across stages — MEMORY_AND_DISK_SER
fact_events = spark.read.format("delta") \
    .load("s3://lake/silver/user_events/") \
    .filter(col("event_date") >= "2024-10-01") \
    .persist(StorageLevel.MEMORY_AND_DISK_SER)
 
# Force materialization before branching — avoid recomputation
fact_events.count()
 
result_a = fact_events.groupBy("user_id").agg(sum("amount"))
result_b = fact_events.groupBy("region").agg(count("*"))
 
# Always unpersist when done — don't wait for job end
fact_events.unpersist()
dim_products.unpersist()

Impact: Switching the 220-column dim_products table from MEMORY_AND_DISK to MEMORY_ONLY_SER freed 14GB of executor memory per node and eliminated spill-to-disk on the downstream join stage.


10. Scheduled Maintenance Pipeline

All the patterns above degrade over time without maintenance. Wire them into a single nightly job — not a runbook that gets forgotten.

Before:

# Maintenance done manually "when someone notices it's slow"
# OPTIMIZE: last run 4 months ago
# VACUUM: never run
# Health checks: never run

After:

import datetime
from delta.tables import DeltaTable
 
TABLES = [
    {
        "path":    "s3://lake/silver/user_events/",
        "zorder":  ["user_id", "event_date"],
        "retain":  168,
    },
    {
        "path":    "s3://lake/silver/transactions/",
        "zorder":  ["user_id", "transaction_date"],
        "retain":  168,
    },
    {
        "path":    "s3://lake/gold/daily_totals/",
        "zorder":  ["region", "product_category"],
        "retain":  336,   # 14 days for gold layer
    },
]
 
OPTIMIZE_WINDOW_DAYS = 3
cutoff = (
    datetime.date.today() - datetime.timedelta(days=OPTIMIZE_WINDOW_DAYS)
).isoformat()
 
for table in TABLES:
    dt = DeltaTable.forPath(spark, table["path"])
 
    # Health check
    detail  = dt.detail().collect()[0]
    avg_mb  = (detail["sizeInBytes"] / detail["numFiles"]) / 1e6
    print(f"{table['path']}{detail['numFiles']:,} files, avg {avg_mb:.1f}MB")
 
    # OPTIMIZE + ZORDER on recent partitions only
    dt.optimize() \
      .where(f"event_date >= '{cutoff}'") \
      .executeZOrderBy(*table["zorder"])
 
    # VACUUM — reclaim unreferenced files
    dt.vacuum(retentionHours=table["retain"])
 
    print(f"✓ Done: {table['path']}")

Impact: Nightly maintenance pipeline reduced average file count across 12 core tables from 180,000 to 2,100. Query P95 held steady at 28 seconds for 6 months without manual intervention.


Real-World Impact

After applying all 10 patterns across a 1.8TB Delta Lake on Databricks:

  • Query P95 latency: 8 min 14 sec → 28 seconds (91% reduction)
  • File count (user_events): 620,000 → 1,847 files
  • File-skipping ratio: 3% → 89%
  • Storage reclaimed (VACUUM): 4.2TB ($96/month saved)
  • Nightly pipeline cost: $1,840 → $610/run (67% reduction)
  • Metadata scan time: 40 seconds → 0.3 seconds

Key Takeaways

  1. Diagnose first — run the health check before touching anything; the bottleneck is almost never where you think
  2. OPTIMIZE on recent partitions only — full-table compaction is expensive and usually unnecessary
  3. ZORDER on columns you actually filter by — low-cardinality columns waste the entire run
  4. Partition by date, Z-Order by identity — partitioning prunes directories, Z-Ordering skips files within them
  5. VACUUM weekly, dry-run always — unreferenced files slow directory listings and inflate storage bills
  6. Put high-value columns first in the schema — Delta only collects statistics on the first 32 columns by default
  7. Bloom filters for exact-match lookups — Z-Ordering helps range scans; bloom filters help point lookups
  8. Automate maintenance — performance degrades within weeks without a nightly OPTIMIZE and VACUUM job

Slow Delta Lakes are almost always a file problem, not a compute problem. Fix the files first.


Next Steps: