Apache Airflow Platform to design and manage workflows.
Workflow is Sequence of tasks - triggered and scheduled which are used for managing data pipelines.
Traditional Approach :
Copy Database -> Cron Scripts -> Target Database /File system / HDFS Challenges in Traditional Approach :
Define tasks and dependencies in python
Airflow advantages :
DAG - Directed Acyclic Graph - Pipeline Tasks
Workflow of multiple tasks which can be run independently Start -> Tasks -> END
Python Script - Organize tasks and set context
Steps
Default Arguments - Python Dictionary of applicable values
Depends on Past : TRUE/FALSE - Current instance will depend on past run status (Pass or fail)
eMail : Notification
eMail on failure: Notification
eMail on retry : Notification
retries : number of retries
DAG schedule Intervals
Monthly @monthly 0 0 1 __
Tasks
Setup Dependancies - Order of tasks
Install Airflow as Docker Container
Initialize Airflow Database
Restart Airflow Scheduler
Install PS/GIT/VIM Command
Kill all airflow schedulers
Set Load Examples to False in airflow.cfg
Generate Fernet Key in bash
Components Metadata Database - MYSQL Webserver - Flask Scheduler - Python Celery
Building a Pipeline
Create a DAG Name Start Date Schedule max active run concurrency
Tasks
Task ID python_callback - Python code / SQL /File DAG NAme Upstream Retries Pool Variables
Executor Types Debugging Testing pipelines
Airflow Admin UI can be accessed using http://:8080/admin/
Tree View - Historical View
Print the list of active DAGs
airflow list_dags
Sync DAGs with GIT Repository
apt-get update && apt-get install git -y && apt-get install cron -y && apt-get install vim -y
git reset --hard && git pull
Airflow Scheduler Examples
Every 5 Mins : */5 * * * *
Testing Dags
Generalizing data load processes with Airflow
https://towardsdatascience.com/generalizing-data-load-processes-with-airflow-a4931788a61f