Module-3 – Spark & Delta Performance

Module-3 – Spark & Delta Performance

🎯 Objectives

  • Master Spark DataFrame API & Catalyst optimizer.
  • Optimize jobs: partitioning, caching, broadcast joins, AQE.
  • Leverage Delta Lake: ACID, schema evolution, time travel.

πŸ—“οΈ Weekly Plan

  • Week 9 – Spark architecture: DAG, stages, tasks, executors.
  • Week 10 – Performance tuning: shuffles, partitioning, bucketing, broadcast hints.
  • Week 11 – AQE & advanced caching strategies.
  • Week 12 – Delta Lake deep dive: MERGE, Z-ordering, compaction, time travel.

πŸ”‘Key Concepts

1. Spark APIs

  • RDD vs DataFrame vs Dataset.

2. Catalyst Optimizer

  • Logical plan β†’ optimized logical β†’ physical plan.

3. Joins & Shuffles

  • Broadcast joins vs shuffle joins, skew mitigation.

4. File Formats

  • Parquet internals, Delta Lake commit log.

5. Delta Lake Features

  • ACID transactions, time travel, schema evolution.

6. Partitioning & Z-Ordering

  • Static vs dynamic partitioning, Z-order clustering.

πŸ”¨ Mini-Projects

  • Taxi Data Pipeline: Ingest 1 GB CSV β†’ Delta Bronze/Silver/Gold on ADLS Gen2.
  • Skew Handling: Benchmark & fix a skewed join via broadcast/salting.
  • Delta Demo: Implement schema evolution & time-travel queries.

πŸ“š Resources

  • Databricks Spark documentation
  • Delta Lake official guide
  • Spark: The Definitive Guide by Bill Chambers & Matei Zaharia