Apache Airflow - Complete Guide with Practical Examples

What is Apache Airflow?

Apache Airflow is an open-source workflow orchestration platform that allows you to programmatically author, schedule, and monitor workflows. It was created at Airbnb in 2014 and became an Apache Top-Level Project in 2019.

Core Philosophy: "Workflows as Code"

Unlike GUI-based tools, Airflow treats workflows as Python code, enabling:

Version control for pipelines
Code reviews for workflow changes
Testing of workflow logic
Reusability through templates

Core Concepts

1. DAG (Directed Acyclic Graph)

A DAG is a collection of tasks with defined dependencies. "Acyclic" means no circular dependencies.

Task A → Task B → Task C
           ↘
             Task D

2. Tasks

Individual units of work. Each task is an instance of an Operator.

3. Operators

Templates for predefined tasks:

BashOperator: Execute bash commands
PythonOperator: Execute Python functions
EmailOperator: Send emails
SQLOperator: Execute SQL queries
HttpOperator: Make HTTP requests
Sensors: Wait for conditions to be met

4. XComs

Cross-communication mechanism for tasks to share small pieces of data.

5. Hooks

Interfaces to external systems (databases, APIs, cloud services).

6. Connections

Stored credentials and connection info for external systems.

7. Variables

Global key-value store for configuration.

Important Use Cases

🔄 1. ETL/ELT Pipelines

Extract data from sources, transform it, and load to destinations.

📊 2. Data Warehouse Population

Schedule regular data loads from operational databases to analytics warehouses.

🤖 3. ML Pipeline Orchestration

Coordinate training, validation, and deployment of ML models.

📧 4. Report Generation & Distribution

Generate and email periodic business reports.

🔍 5. Data Quality Monitoring

Run validation checks and alert on data anomalies.

☁️ 6. Cloud Infrastructure Automation

Trigger cloud jobs (EMR, Dataflow, BigQuery).

🔗 7. API Data Ingestion

Pull data from external APIs on schedules.

🗃️ 8. Database Maintenance

Run periodic cleanup, archival, and optimization tasks.

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        AIRFLOW ARCHITECTURE                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   ┌──────────────┐     ┌──────────────┐     ┌──────────────┐   │
│   │   Web Server │     │   Scheduler  │     │   Workers    │   │
│   │   (Flask)    │     │              │     │  (Celery/K8s)│   │
│   └──────┬───────┘     └──────┬───────┘     └──────┬───────┘   │
│          │                    │                     │           │
│          │                    │                     │           │
│          └────────────────────┼─────────────────────┘           │
│                               │                                  │
│                    ┌──────────┴──────────┐                      │
│                    │   Metadata Database │                      │
│                    │   (PostgreSQL)      │                      │
│                    └─────────────────────┘                      │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Quick Start

Using Docker (Recommended)

cd apache-airflow

# Start local Airflow with Docker
docker-compose up -d

# Access UI at http://localhost:8080
# Login: admin / admin

This starts:

Webserver: UI at port 8080
Scheduler: Runs DAGs on schedule
Triggerer: Handles deferrable operators
PostgreSQL: Metadata database

Useful Commands

# View logs
docker-compose logs -f airflow-scheduler

# Stop Airflow
docker-compose down

# Stop and remove volumes (fresh start)
docker-compose down -v

# Run Airflow CLI commands
docker-compose exec airflow-webserver airflow dags list
docker-compose exec airflow-webserver airflow tasks list daily_sales_etl_pipeline

Manual Installation (Alternative)

# Create virtual environment
python -m venv airflow_venv
source airflow_venv/bin/activate

# Install Airflow (check https://airflow.apache.org for latest constraint file)
AIRFLOW_VERSION=2.8.1
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"

# Initialize database
airflow db init

# Create admin user
airflow users create \
    --username admin \
    --firstname Admin \
    --lastname User \
    --role Admin \
    --email admin@example.com \
    --password admin

# Start scheduler (in one terminal)
airflow scheduler

# Start webserver (in another terminal)
airflow webserver --port 8080

Project Structure

apache-airflow/
├── README.md                    # This file
├── dags/                        # DAG definitions
│   ├── example_etl_pipeline.py
│   ├── example_ml_pipeline.py
│   ├── example_data_quality.py
│   ├── example_api_ingestion.py
│   └── example_report_generation.py
├── plugins/                     # Custom operators, hooks, sensors
├── tests/                       # DAG tests
└── docker-compose.yml           # Local development setup

Best Practices

Idempotency: Tasks should produce the same result if run multiple times
Atomicity: Tasks should be all-or-nothing
No Side Effects in DAG Definition: DAG parsing should be fast
Use Variables/Connections: Don't hardcode credentials
Small Tasks: Break complex logic into smaller, testable tasks
Proper Retries: Configure retry policies appropriately
SLAs: Set SLAs for critical pipelines
Testing: Test DAGs and tasks before deployment

See Example DAGs

Check the dags/ folder for practical examples covering:

ETL pipelines
ML workflows
Data quality checks
API data ingestion
Report generation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Airflow - Complete Guide with Practical Examples

What is Apache Airflow?

Core Philosophy: "Workflows as Code"

Core Concepts

1. DAG (Directed Acyclic Graph)

2. Tasks

3. Operators

4. XComs

5. Hooks

6. Connections

7. Variables

Important Use Cases

🔄 1. ETL/ELT Pipelines

📊 2. Data Warehouse Population

🤖 3. ML Pipeline Orchestration

📧 4. Report Generation & Distribution

🔍 5. Data Quality Monitoring

☁️ 6. Cloud Infrastructure Automation

🔗 7. API Data Ingestion

🗃️ 8. Database Maintenance

Architecture Overview

Quick Start

Using Docker (Recommended)

Useful Commands

Manual Installation (Alternative)

Project Structure

Best Practices

See Example DAGs

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Apache Airflow - Complete Guide with Practical Examples

What is Apache Airflow?

Core Philosophy: "Workflows as Code"

Core Concepts

1. DAG (Directed Acyclic Graph)

2. Tasks

3. Operators

4. XComs

5. Hooks

6. Connections

7. Variables

Important Use Cases

🔄 1. ETL/ELT Pipelines

📊 2. Data Warehouse Population

🤖 3. ML Pipeline Orchestration

📧 4. Report Generation & Distribution

🔍 5. Data Quality Monitoring

☁️ 6. Cloud Infrastructure Automation

🔗 7. API Data Ingestion

🗃️ 8. Database Maintenance

Architecture Overview

Quick Start

Using Docker (Recommended)

Useful Commands

Manual Installation (Alternative)

Project Structure

Best Practices

See Example DAGs