Module-2 β Python ETL & CI
Module-2 β Python ETL & CI
π― Objectives
- Build modular ETL pipelines in Python (Polars/pandas).
- Apply best practices: testing, logging, error handling.
- Set up CI/CD workflows for data code.
ποΈ Weekly Plan
- Week 5 β pandas vs Polars; memory & performance trade-offs.
- Week 6 β ETL patterns: idempotency, config-driven pipelines.
- Week 7 β Code quality: pytest fixtures, structured logging, retry logic.
- Week 8 β CI basics: GitHub Actions for tests & linting; packaging with Poetry.
πKey Concepts
1. Data Libraries
- pandas: DataFrame basics, IO APIs, groupbys.
- Polars: lazy vs eager execution, memory efficiency.
- PyArrow: zero-copy IPC, Parquet integration.
2. ETL Design Patterns
- Incremental loads: watermark columns, CDC.
- Idempotent pipelines: safe retries without duplicates.
- Configuration: YAML/JSON configs, environment variables.
3. Testing & Error Handling
- pytest fixtures: reusable setup/teardown.
- Mocking I/O: `monkeypatch`, `requests-mock`.
- Logging: Python `logging` module, retry/backoff strategies.
4. Packaging & CI/CD
- Poetry: `pyproject.toml`, lock files.
- GitHub Actions: workflows, matrix builds, artifacts.
π¨ Mini-Projects
- CSVβPostgres ETL Library: create a Python package with unit tests, logging, error handling.
- API Loader: fetch data from a REST API with retry & backoff.
- CI Pipeline: configure GitHub Actions to run tests & lint on every push.
π Resources
- Polars & pandas documentation
- pytest official guide
- GitHub Actions for Python projects