Data Engineering Wiki

🚀 Module 1 – DW & SQL Foundations

Module-1 – DW & SQL Foundations

Module-1 – DW & SQL Foundations

🎯 Objectives

Grasp dimensional-modeling patterns (Kimball vs Inmon).
Master advanced Slowly Changing Dimensions (Types 1–6).
Optimize SQL queries for analytics.
Design a small data-warehouse schema end-to-end.

🗓️ Weekly Plan

Week 1 – Data-warehouse overview; Inmon vs Kimball; fact/grain choices.
Week 2 – Deep dive into SCD Types 1, 2, 3, 4, 6 (hybrids).
Week 3 – SQL performance tuning: explain plans, indexing, partitioning.
Week 4 – OLAP vs relational, materialized views, ETL mapping.

🔑Key Concepts

1. Schema Patterns

Star schema: single fact table + conformed dimensions
Snowflake schema: normalized dimension tables
Galaxy schema: multiple fact tables sharing dims

2. Dimension Table Types

Conformed dimensions: reused across facts
Junk dimensions: grouping low-cardinality flags
Degenerate dimensions: dimension stored in fact (no separate table)

3. Slowly Changing Dimensions (SCD)

Type 1: overwrite old value
Type 2: add new row + versioning columns (effective_date, end_date)
Type 3: add new attribute column (e.g. prev_address)
Type 4: history table separate from current table
Type 6 (Hybrid): combine Types 1/2/3 for specific use cases

4. Fact Table Granularity

Transaction facts: one row per event
Periodic snapshot facts: one row per period summary
Accumulating snapshot facts: track process lifecycle

5. Keys & Indexing

Surrogate vs natural keys: hidden identity vs business key
Clustered vs non-clustered indexes
Columnstore indexes: for high-performance analytics

6. Partitioning Strategies

Range partitioning: by date or numeric range
List/hHash partitioning: by category or hash function
Benefits: improved prune, maintenance, parallelism

7. Materialized Views

Pre-computed aggregations
Refresh modes: on-demand vs incremental
Use-cases: speeding up complex joins/aggregates

🔨 Mini-Projects

Retail DW Prototype: Model & implement a 3-star schema in Postgres (DDL + ERD).
SCD Pipeline: Load CSV to Type-2 dimension with history tracking.
SQL Tuning Lab: Optimize 10 complex analytical queries on a 1 GB dataset.

📚 Resources

Ralph Kimball’s Dimensional Modeling Techniques
SQL Performance Explained by Markus Winand
PostgreSQL docs: partitioning & indexing