Delta Lake Optimization: OPTIMIZE, ZORDER, and VACUUM in Production

Running Delta Lake at scale means dealing with a graveyard of small files, bloated transaction logs, and query plans that scan far more data than they should. After managing production Delta tables with billions of rows, here are the exact commands and patterns that made the biggest difference.

The Problem

Our Delta Lake had grown organically over 18 months. Tables had 500,000+ small Parquet files, queries that should take 30 seconds were timing out, and storage costs were ballooning with stale data versions nobody needed. Three commands fixed 90% of our problems: OPTIMIZE, ZORDER BY, and VACUUM.

OPTIMIZE: Compacting Small Files

Delta Lake's streaming ingestion and frequent small appends create thousands of tiny Parquet files — the classic "small files problem." OPTIMIZE compacts them into larger, more efficient files (default target: 1GB).

Basic OPTIMIZE

-- Compact all files in the table
OPTIMIZE delta.`/mnt/lake/events`
 
-- Or by table name (if registered in metastore)
OPTIMIZE events_table

Partition-Targeted OPTIMIZE

Running OPTIMIZE on the entire table is expensive. Target only the partitions that changed:

-- Only compact today's partition
OPTIMIZE events_table
WHERE event_date = '2024-06-10'

from delta.tables import DeltaTable
 
dt = DeltaTable.forPath(spark, "/mnt/lake/events")
 
# Optimise only recent partitions
dt.optimize() \
  .where("event_date >= '2024-06-01'") \
  .executeCompaction()

Impact: Query times dropped from 4 minutes to 18 seconds on a 10TB table after going from 480K files to 1,200 files.

Automate OPTIMIZE with Databricks

On Databricks, enable Auto Optimize to handle compaction automatically on write:

ALTER TABLE events_table
SET TBLPROPERTIES (
  'delta.autoOptimize.optimizeWrite' = 'true',
  'delta.autoOptimize.autoCompact'   = 'true'
);

Caution: Auto Optimize adds write latency. Benchmark before enabling on high-throughput streaming tables.

ZORDER BY: Co-locating Related Data

ZORDER BY is a multi-dimensional clustering technique that physically co-locates rows sharing similar column values within the same files. Queries filtering on Z-Ordered columns skip far more files via Delta's data-skipping index.

Basic ZORDER

-- Compact AND cluster by user_id and event_type
OPTIMIZE events_table
ZORDER BY (user_id, event_type)

ZORDER in PySpark

dt.optimize() \
  .where("event_date >= '2024-06-01'") \
  .executeZOrderBy("user_id", "event_type")

Choosing ZORDER Columns

Pick columns that appear most frequently in WHERE clauses and JOIN conditions:

# Good candidates — high-cardinality, commonly filtered
OPTIMIZE transactions
ZORDER BY (account_id, transaction_date)
 
# Bad candidates — low cardinality adds no benefit
OPTIMIZE transactions
ZORDER BY (status)          -- Only 3 distinct values, useless
ZORDER BY (is_deleted)      -- Boolean, skip this entirely

Rule of thumb: Z-Order on at most 3–4 columns. Beyond that, clustering effectiveness degrades and compaction takes much longer.

Verify Data Skipping is Working

-- Check how many files are skipped
DESCRIBE DETAIL events_table;
 
-- In Databricks: query the operation metrics
DESCRIBE HISTORY events_table LIMIT 5;

Look for numFilesAdded, numFilesRemoved, and numRemovedBytes in the history output.

Impact: File-skipping ratio went from 12% to 87% on our user activity table after Z-Ordering by user_id and session_date.

VACUUM: Reclaiming Storage

Every UPDATE, DELETE, and MERGE in Delta Lake creates new Parquet files while marking old ones as deleted — but not removing them. VACUUM physically deletes files no longer referenced by the table's active transaction log.

Basic VACUUM

-- Default retention: 7 days (168 hours)
VACUUM events_table

Custom Retention

-- Keep 3 days of history (72 hours)
VACUUM events_table RETAIN 72 HOURS

dt.vacuum(retentionHours=72)

Dry Run First

Always dry-run before executing — especially in production:

-- Lists files that WOULD be deleted, deletes nothing
VACUUM events_table RETAIN 72 HOURS DRY RUN

# Dry run in Python
dt.vacuum(retentionHours=72)  # Add dry_run=True on Databricks Runtime 12+

Overriding the Safety Threshold (Handle with Care)

Delta Lake prevents VACUUM with retention < 7 days by default to protect concurrent readers. Override only if you're certain no active readers exist:

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
 
# Now you can vacuum with shorter retention
dt.vacuum(retentionHours=24)
 
# Re-enable the guard immediately after
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")

Warning: Disabling the check and vacuuming aggressively will break time travel queries. Only do this on tables where you've explicitly scoped your time travel window.

Production Scheduling Pattern

Run these three operations together in a maintenance job. Here's the pattern we use:

from delta.tables import DeltaTable
from pyspark.sql import SparkSession
import datetime
 
spark = SparkSession.builder.getOrCreate()
 
TABLES = [
    {"path": "/mnt/lake/events",       "zorder": ["user_id", "event_date"]},
    {"path": "/mnt/lake/transactions", "zorder": ["account_id", "txn_date"]},
    {"path": "/mnt/lake/sessions",     "zorder": ["session_id"]},
]
 
RETENTION_HOURS = 168  # 7 days
OPTIMIZE_WINDOW_DAYS = 3
 
cutoff = (datetime.date.today() - datetime.timedelta(days=OPTIMIZE_WINDOW_DAYS)).isoformat()
 
for table in TABLES:
    dt = DeltaTable.forPath(spark, table["path"])
    
    print(f"Optimizing {table['path']}...")
    dt.optimize() \
      .where(f"event_date >= '{cutoff}'") \
      .executeZOrderBy(*table["zorder"])
    
    print(f"Vacuuming {table['path']}...")
    dt.vacuum(retentionHours=RETENTION_HOURS)
    
    print(f"Done: {table['path']}")

Schedule this as a nightly Databricks job during off-peak hours.

Monitoring Table Health

Build the habit of inspecting table health before and after maintenance:

def table_health(path: str):
    dt = DeltaTable.forPath(spark, path)
    detail = dt.detail().collect()[0]
    
    print(f"Table:          {path}")
    print(f"Num files:      {detail['numFiles']:,}")
    print(f"Size (GB):      {detail['sizeInBytes'] / 1e9:.2f}")
    print(f"Avg file (MB):  {detail['sizeInBytes'] / detail['numFiles'] / 1e6:.1f}")
    print(f"Partitions:     {detail['numPartitions']}")
 
table_health("/mnt/lake/events")

Healthy average file size is 128MB–1GB. If you're averaging under 10MB, run OPTIMIZE. If you're seeing files well above 2GB, your partition strategy needs revisiting.

Real-World Impact

After implementing scheduled OPTIMIZE + ZORDER + VACUUM across our 12 core Delta tables:

Query P95 latency: 4.2 min → 22 seconds (91% reduction)
File count: 480,000 → 1,400 files
Storage: 38TB → 21TB (45% reclaimed)
Data-skipping ratio: 14% → 89%
Monthly storage cost: $3,800 → $2,100

Key Takeaways

Run OPTIMIZE on recently written partitions, not the whole table every time
Z-Order by the columns you actually filter on — high cardinality wins
Never skip the DRY RUN before VACUUM in production
Don't Z-Order on more than 3–4 columns; diminishing returns kick in fast
Monitor average file size — it's the single best health metric
Automate maintenance as a scheduled job, not a manual task

Small files kill Delta Lake performance. Compact early, compact often.

Next Steps:

Read about Spark optimization patterns that cut costs by 60%
Explore streaming ingestion into Delta Lake
Check out the enterprise lakehouse architecture project