Data Engineering Wiki

🐍 Module 2 – Python ETL & CI

Module-2 – Python ETL & CI

Module-2 – Python ETL & CI

🎯 Objectives

Build modular ETL pipelines in Python (Polars/pandas).
Apply best practices: testing, logging, error handling.
Set up CI/CD workflows for data code.

🗓️ Weekly Plan

Week 5 – pandas vs Polars; memory & performance trade-offs.
Week 6 – ETL patterns: idempotency, config-driven pipelines.
Week 7 – Code quality: pytest fixtures, structured logging, retry logic.
Week 8 – CI basics: GitHub Actions for tests & linting; packaging with Poetry.

🔑Key Concepts

1. Data Libraries

pandas: DataFrame basics, IO APIs, groupbys.
Polars: lazy vs eager execution, memory efficiency.
PyArrow: zero-copy IPC, Parquet integration.

2. ETL Design Patterns

Incremental loads: watermark columns, CDC.
Idempotent pipelines: safe retries without duplicates.
Configuration: YAML/JSON configs, environment variables.

3. Testing & Error Handling

pytest fixtures: reusable setup/teardown.
Mocking I/O: `monkeypatch`, `requests-mock`.
Logging: Python `logging` module, retry/backoff strategies.

4. Packaging & CI/CD

Poetry: `pyproject.toml`, lock files.
GitHub Actions: workflows, matrix builds, artifacts.

🔨 Mini-Projects

CSV→Postgres ETL Library: create a Python package with unit tests, logging, error handling.
API Loader: fetch data from a REST API with retry & backoff.
CI Pipeline: configure GitHub Actions to run tests & lint on every push.

📚 Resources

Polars & pandas documentation
pytest official guide
GitHub Actions for Python projects