Module-4 β Streaming & Data Quality
Module-4 β Streaming & Data Quality
π― Objectives
- Build robust streaming pipelines using Spark Structured Streaming.
- Ingest from Azure Event Hubs or Kafka.
- Implement data-quality checks with Great Expectations & dbt.
ποΈ Weekly Plan
- Week 13 β Lambda vs Kappa; Event Hubs fundamentals.
- Week 14 β Structured Streaming: triggers, checkpoints, watermarking.
- Week 15 β Stateful ops: windowed aggregates, late data handling.
- Week 16 β Data quality: GE suites; dbt tests & lineage.
π Key Concepts
Ingestion Systems
- Event Hubs vs Kafka: partitions, consumer groups.
Structured Streaming API
readStream
/writeStream
, output modes, triggers.
Fault Tolerance
- Checkpoints & write-ahead logs for exactly-once.
Stateful Processing
- Window functions, watermark & state cleanup.
Data Quality Frameworks
- Great Expectations expectations & checkpoints.
- dbt tests: unique, not_null, relationships.
π¨ Mini-Projects
- JSON Stream β Lakehouse: consume events β Bronze/Silver Delta.
- Late Data Handling: manage late events with watermark.
- Quality Suite: build GE validations + dbt test report.
π Resources
- Spark Structured Streaming guide
- Great Expectations docs
- dbt official documentation