Apache Airflow is an open-source workflow orchestration platform that allows you to programmatically author, schedule, and monitor workflows. It was created at Airbnb in 2014 and became an Apache Top-Level Project in 2019.
Unlike GUI-based tools, Airflow treats workflows as Python code, enabling:
- Version control for pipelines
- Code reviews for workflow changes
- Testing of workflow logic
- Reusability through templates
A DAG is a collection of tasks with defined dependencies. "Acyclic" means no circular dependencies.
Task A → Task B → Task C
↘
Task D
Individual units of work. Each task is an instance of an Operator.
Templates for predefined tasks:
- BashOperator: Execute bash commands
- PythonOperator: Execute Python functions
- EmailOperator: Send emails
- SQLOperator: Execute SQL queries
- HttpOperator: Make HTTP requests
- Sensors: Wait for conditions to be met
Cross-communication mechanism for tasks to share small pieces of data.
Interfaces to external systems (databases, APIs, cloud services).
Stored credentials and connection info for external systems.
Global key-value store for configuration.
Extract data from sources, transform it, and load to destinations.
Schedule regular data loads from operational databases to analytics warehouses.
Coordinate training, validation, and deployment of ML models.
Generate and email periodic business reports.
Run validation checks and alert on data anomalies.
Trigger cloud jobs (EMR, Dataflow, BigQuery).
Pull data from external APIs on schedules.
Run periodic cleanup, archival, and optimization tasks.
┌─────────────────────────────────────────────────────────────────┐
│ AIRFLOW ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Web Server │ │ Scheduler │ │ Workers │ │
│ │ (Flask) │ │ │ │ (Celery/K8s)│ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ │ │ │
│ └────────────────────┼─────────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ Metadata Database │ │
│ │ (PostgreSQL) │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
cd apache-airflow
# Start local Airflow with Docker
docker-compose up -d
# Access UI at http://localhost:8080
# Login: admin / adminThis starts:
- Webserver: UI at port 8080
- Scheduler: Runs DAGs on schedule
- Triggerer: Handles deferrable operators
- PostgreSQL: Metadata database
# View logs
docker-compose logs -f airflow-scheduler
# Stop Airflow
docker-compose down
# Stop and remove volumes (fresh start)
docker-compose down -v
# Run Airflow CLI commands
docker-compose exec airflow-webserver airflow dags list
docker-compose exec airflow-webserver airflow tasks list daily_sales_etl_pipeline# Create virtual environment
python -m venv airflow_venv
source airflow_venv/bin/activate
# Install Airflow (check https://airflow.apache.org for latest constraint file)
AIRFLOW_VERSION=2.8.1
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
# Initialize database
airflow db init
# Create admin user
airflow users create \
--username admin \
--firstname Admin \
--lastname User \
--role Admin \
--email admin@example.com \
--password admin
# Start scheduler (in one terminal)
airflow scheduler
# Start webserver (in another terminal)
airflow webserver --port 8080apache-airflow/
├── README.md # This file
├── dags/ # DAG definitions
│ ├── example_etl_pipeline.py
│ ├── example_ml_pipeline.py
│ ├── example_data_quality.py
│ ├── example_api_ingestion.py
│ └── example_report_generation.py
├── plugins/ # Custom operators, hooks, sensors
├── tests/ # DAG tests
└── docker-compose.yml # Local development setup
- Idempotency: Tasks should produce the same result if run multiple times
- Atomicity: Tasks should be all-or-nothing
- No Side Effects in DAG Definition: DAG parsing should be fast
- Use Variables/Connections: Don't hardcode credentials
- Small Tasks: Break complex logic into smaller, testable tasks
- Proper Retries: Configure retry policies appropriately
- SLAs: Set SLAs for critical pipelines
- Testing: Test DAGs and tasks before deployment
Check the dags/ folder for practical examples covering:
- ETL pipelines
- ML workflows
- Data quality checks
- API data ingestion
- Report generation