Module-2 – Python ETL & CI

Module-2 – Python ETL & CI

🎯 Objectives

  • Build modular ETL pipelines in Python (Polars/pandas).
  • Apply best practices: testing, logging, error handling.
  • Set up CI/CD workflows for data code.

πŸ—“οΈ Weekly Plan

  • Week 5 – pandas vs Polars; memory & performance trade-offs.
  • Week 6 – ETL patterns: idempotency, config-driven pipelines.
  • Week 7 – Code quality: pytest fixtures, structured logging, retry logic.
  • Week 8 – CI basics: GitHub Actions for tests & linting; packaging with Poetry.

πŸ”‘Key Concepts

1. Data Libraries

  • pandas: DataFrame basics, IO APIs, groupbys.
  • Polars: lazy vs eager execution, memory efficiency.
  • PyArrow: zero-copy IPC, Parquet integration.

2. ETL Design Patterns

  • Incremental loads: watermark columns, CDC.
  • Idempotent pipelines: safe retries without duplicates.
  • Configuration: YAML/JSON configs, environment variables.

3. Testing & Error Handling

  • pytest fixtures: reusable setup/teardown.
  • Mocking I/O: `monkeypatch`, `requests-mock`.
  • Logging: Python `logging` module, retry/backoff strategies.

4. Packaging & CI/CD

  • Poetry: `pyproject.toml`, lock files.
  • GitHub Actions: workflows, matrix builds, artifacts.

πŸ”¨ Mini-Projects

  • CSVβ†’Postgres ETL Library: create a Python package with unit tests, logging, error handling.
  • API Loader: fetch data from a REST API with retry & backoff.
  • CI Pipeline: configure GitHub Actions to run tests & lint on every push.

πŸ“š Resources

  • Polars & pandas documentation
  • pytest official guide
  • GitHub Actions for Python projects