π 30-Week Azure Data-Engineering Study Plan
Each week ~15β20 hrs. Click a week to expand details.
Week 1 π Data-Warehouse Overview; Inmon vs Kimball; Fact/Grain Choices (20 h)
- π Read IBMβs βWhat is a Data Warehouse?β (1 h)
- π Read Inmonβs Building the Data Warehouse Intro & Chapter 1 (2 h)
- π Read Kimball Toolkit Chapters 1β2 (star schemas & conformed dims) (2 h)
- πΉ Watch βInmon vs Kimballβ on YouTube (45 min)
- π Sketch 3 fact-table grains for a sample domain (1 h)
- π» Hands-on: design ERD for transactional & snapshot facts (2 h)
- π Build one-page Inmon vs Kimball cheat-sheet (2 h)
- π Create Anki flashcards for key terms (1 h)
- π Answer 5 mini βdesign a DWβ prompts (2 h)
- π Review & record yourself explaining both approaches (2.5 h)
Week 2 π Slowly Changing Dimensions (Types 1β6) (18 h)
- π Read Kimball Toolkit Ch 3 on SCD patterns (2 h)
- π Read blog posts on Types 4 & 6 from Kimball Group (1 h)
- π» Hands-on: implement SCD 1/2/3 in SQL on sample data (3 h)
- π» Build a Python/Polars pipeline for SCD 2 with history table (3 h)
- π Write an SQL stored proc for Type 2 merge logic (2 h)
- π Flashcards: pros/cons of each SCD type (1 h)
- π Scenario Q&A: 10 design prompts (2 h)
- π Review & refine your implementations (4 h)
Week 3 β‘ Advanced SQL Performance Tuning (18 h)
- π Read SQL Performance Explained Ch 2β4 (3 h)
- π» Hands-on: run EXPLAIN on 10 analytical queries (2 h)
- π» Add & test clustered/non-clustered/columnstore indexes (3 h)
- π» Implement range & hash partitioning; test pruning (2 h)
- π Deep dive into columnstore indexes blog (1 h)
- π¨ Mini-Project: optimize a dashboard query on 1 GB dataset (4 h)
- π Flashcards: index & partition concepts (1 h)
- π Self-quiz on tuning strategies (2 h)
Week 4 π§© OLAP vs Relational; Materialized Views & ETL Mapping (16 h)
- π Read articles on OLAP cube architectures (1 h)
- π» Create & refresh materialized views in Postgres (2 h)
- π Draft source-to-target mapping doc for OLTPβDW (2 h)
- π Read Kimball ETL mapping templates (1 h)
- π¨ Mini-Project: build SSAS cube vs relational report, compare perf (5 h)
- π Flashcards: OLAP vs OLTP trade-offs (1 h)
- π Write summary of best practices (2 h)
- π Review & self-test (2 h)
Week 5 πΌ pandas vs Polars β Performance & Memory (16 h)
- π Read pandas & Polars docs on IO & lazy APIs (2 h)
- π» Benchmark CSVβParquet with pandas vs Polars (3 h)
- π Deep dive into Polars lazy mode (1 h)
- π» Hands-on: build a sample ETL in both libs; measure memory (3 h)
- π¨ Mini-Project: Polars pipeline to clean & write Parquet (4 h)
- π Flashcards: key API differences (1 h)
- π Write a short comparison blog snippet (2 h)
Week 6 π οΈ ETL Design Patterns & Idempotency (16 h)
- π Read articles on config-driven pipelines (1 h)
- π» Build YAML/JSON-driven ETL framework (3 h)
- π» Implement watermarking & safe retry logic (2 h)
- π¨ Mini-Project: generic CSVβDB loader with idempotency (5 h)
- π Flashcards: design pattern names & use-cases (1 h)
- π Self-review & refine code (4 h)
Week 7 βοΈ Testing & Logging in Python (16 h)
- π Read pytest docs on fixtures & parametrization (1 h)
- π» Write unit tests for ETL transforms (3 h)
- π Read Python logging cookbook (1 h)
- π» Implement structured JSON logging & retries (2 h)
- π¨ Mini-Project: add tests & logs to your Week 6 pipeline (6 h)
- π Review test coverage & log outputs (2 h)
- π Quiz yourself on pytest & logging concepts (1 h)
Week 8 π¦ CI Basics with GitHub Actions (14 h)
- π Read GH Actions Python CI guide (1 h)
- π» Create
.github/workflows/ci.yml
to run pytest & flake8 (3 h) - π Read Poetry packaging docs (1 h)
- π» Configure Poetry & lock file for your project (2 h)
- π¨ Mini-Project: integrate CI into your Week 7 repo (5 h)
- π Review CI logs & fix failures (2 h)
Weeks 9β12 π₯ Spark & Delta Performance Module (~18 h/wk)
- Week 9 β Spark internals (DAG, stages, executors): read Spark: The Definitive Guide Ch 1β2, watch internals video, hands-on DAG inspection, mini-project Spark job (18 h)
- Week 10 β Joins & shuffles: read docs on broadcast vs sort-merge, fix skewed joins, project on sample dataset (18 h)
- Week 11 β AQE & caching: read official blog, enable AQE, benchmark with/without cache, mini-project (18 h)
- Week 12 β Delta Lake deep dive: read Delta Lake guide, implement MERGE/Z-ordering, time-travel queries, project (18 h)
Weeks 13β16 π Streaming & Data Quality Module (~17 h/wk)
- Week 13 β Lambda vs Kappa & Event Hubs basics: articles & hands-on ingestion (17 h)
- Week 14 β Structured Streaming APIs: triggers, output modes, checkpoints, code labs (17 h)
- Week 15 β Stateful processing: window ops, watermark cleanup, demos (17 h)
- Week 16 β Data quality frameworks: Great Expectations suites & dbt tests, quality dashboard (17 h)
Weeks 17β20 ποΈ Azure Lakehouse Module (~16 h/wk)
- Week 17 β ADLS Gen2 setup & security (RBAC, ACLs, firewall) (16 h)
- Week 18 β Databricks workspace, clusters & notebooks (16 h)
- Week 19 β Unity Catalog governance & lineage (16 h)
- Week 20 β Medallion pattern: implement Bronze/Silver/Gold pipeline (16 h)
Weeks 21β24 π§ ADF & Synapse Module (~16 h/wk)
- Week 21 β ADF pipelines: linked services, datasets, triggers (16 h)
- Week 22 β Mapping Data Flows: transformations & expressions (16 h)
- Week 23 β Synapse SQL pools: serverless vs dedicated tuning (16 h)
- Week 24 β CI/CD: ARM templates & Git integration for ADF/Synapse (16 h)
Weeks 25β28 π Fabric Lakehouse & Real-Time Module (~16 h/wk)
- Week 25 β Fabric architecture: OneLake & shortcuts (16 h)
- Week 26 β Fabric Data Factory pipelines & notebooks (16 h)
- Week 27 β DirectLake in Power BI: live query patterns (16 h)
- Week 28 β Governance & lifecycle: roles, promotion pipelines (16 h)
Weeks 29β30 π― Interview & System Design Module (~15 h/wk)
- Week 29 β RΓ©sumΓ© & LinkedIn optimization; project storytelling (15 h)
- Week 30 β Mock interviews: STAR, technical Q&A & system-design drills (15 h)