Module-4 – Streaming & Data Quality

Module-4 – Streaming & Data Quality

🎯 Objectives

  • Build robust streaming pipelines using Spark Structured Streaming.
  • Ingest from Azure Event Hubs or Kafka.
  • Implement data-quality checks with Great Expectations & dbt.

πŸ—“οΈ Weekly Plan

  • Week 13 – Lambda vs Kappa; Event Hubs fundamentals.
  • Week 14 – Structured Streaming: triggers, checkpoints, watermarking.
  • Week 15 – Stateful ops: windowed aggregates, late data handling.
  • Week 16 – Data quality: GE suites; dbt tests & lineage.

πŸ”‘ Key Concepts

Ingestion Systems

  • Event Hubs vs Kafka: partitions, consumer groups.

Structured Streaming API

  • readStream/writeStream, output modes, triggers.

Fault Tolerance

  • Checkpoints & write-ahead logs for exactly-once.

Stateful Processing

  • Window functions, watermark & state cleanup.

Data Quality Frameworks

  • Great Expectations expectations & checkpoints.
  • dbt tests: unique, not_null, relationships.

πŸ”¨ Mini-Projects

  • JSON Stream β†’ Lakehouse: consume events β†’ Bronze/Silver Delta.
  • Late Data Handling: manage late events with watermark.
  • Quality Suite: build GE validations + dbt test report.

πŸ“š Resources

  • Spark Structured Streaming guide
  • Great Expectations docs
  • dbt official documentation