Data Engineering Wiki

⚡ Module 3 – Spark & Delta Performance

Module-3 – Spark & Delta Performance

Module-3 – Spark & Delta Performance

🎯 Objectives

Master Spark DataFrame API & Catalyst optimizer.
Optimize jobs: partitioning, caching, broadcast joins, AQE.
Leverage Delta Lake: ACID, schema evolution, time travel.

🗓️ Weekly Plan

Week 9 – Spark architecture: DAG, stages, tasks, executors.
Week 10 – Performance tuning: shuffles, partitioning, bucketing, broadcast hints.
Week 11 – AQE & advanced caching strategies.
Week 12 – Delta Lake deep dive: MERGE, Z-ordering, compaction, time travel.

🔑Key Concepts

1. Spark APIs

RDD vs DataFrame vs Dataset.

2. Catalyst Optimizer

Logical plan → optimized logical → physical plan.

3. Joins & Shuffles

Broadcast joins vs shuffle joins, skew mitigation.

4. File Formats

Parquet internals, Delta Lake commit log.

5. Delta Lake Features

ACID transactions, time travel, schema evolution.

6. Partitioning & Z-Ordering

Static vs dynamic partitioning, Z-order clustering.

🔨 Mini-Projects

Taxi Data Pipeline: Ingest 1 GB CSV → Delta Bronze/Silver/Gold on ADLS Gen2.
Skew Handling: Benchmark & fix a skewed join via broadcast/salting.
Delta Demo: Implement schema evolution & time-travel queries.

📚 Resources

Databricks Spark documentation
Delta Lake official guide
Spark: The Definitive Guide by Bill Chambers & Matei Zaharia