Module-3 β Spark & Delta Performance
Module-3 β Spark & Delta Performance
π― Objectives
- Master Spark DataFrame API & Catalyst optimizer.
- Optimize jobs: partitioning, caching, broadcast joins, AQE.
- Leverage Delta Lake: ACID, schema evolution, time travel.
ποΈ Weekly Plan
- Week 9 β Spark architecture: DAG, stages, tasks, executors.
- Week 10 β Performance tuning: shuffles, partitioning, bucketing, broadcast hints.
- Week 11 β AQE & advanced caching strategies.
- Week 12 β Delta Lake deep dive: MERGE, Z-ordering, compaction, time travel.
πKey Concepts
1. Spark APIs
- RDD vs DataFrame vs Dataset.
2. Catalyst Optimizer
- Logical plan β optimized logical β physical plan.
3. Joins & Shuffles
- Broadcast joins vs shuffle joins, skew mitigation.
4. File Formats
- Parquet internals, Delta Lake commit log.
5. Delta Lake Features
- ACID transactions, time travel, schema evolution.
6. Partitioning & Z-Ordering
- Static vs dynamic partitioning, Z-order clustering.
π¨ Mini-Projects
- Taxi Data Pipeline: Ingest 1 GB CSV β Delta Bronze/Silver/Gold on ADLS Gen2.
- Skew Handling: Benchmark & fix a skewed join via broadcast/salting.
- Delta Demo: Implement schema evolution & time-travel queries.
π Resources
- Databricks Spark documentation
- Delta Lake official guide
- Spark: The Definitive Guide by Bill Chambers & Matei Zaharia