Data Engineering Wiki

🌊 Module 4 – Streaming & Data Quality

Module-4 – Streaming & Data Quality

Module-4 – Streaming & Data Quality

🎯 Objectives

Build robust streaming pipelines using Spark Structured Streaming.
Ingest from Azure Event Hubs or Kafka.
Implement data-quality checks with Great Expectations & dbt.

🗓️ Weekly Plan

Week 13 – Lambda vs Kappa; Event Hubs fundamentals.
Week 14 – Structured Streaming: triggers, checkpoints, watermarking.
Week 15 – Stateful ops: windowed aggregates, late data handling.
Week 16 – Data quality: GE suites; dbt tests & lineage.

🔑 Key Concepts

Ingestion Systems

Event Hubs vs Kafka: partitions, consumer groups.

Structured Streaming API

readStream/writeStream, output modes, triggers.

Fault Tolerance

Checkpoints & write-ahead logs for exactly-once.

Stateful Processing

Window functions, watermark & state cleanup.

Data Quality Frameworks

Great Expectations expectations & checkpoints.
dbt tests: unique, not_null, relationships.

🔨 Mini-Projects

JSON Stream → Lakehouse: consume events → Bronze/Silver Delta.
Late Data Handling: manage late events with watermark.
Quality Suite: build GE validations + dbt test report.

📚 Resources

Spark Structured Streaming guide
Great Expectations docs
dbt official documentation