Data Insights
Article

ETL Pipelines with Airflow: the Good, the Bad and the Ugly

Ari Bajo Rouvinen
October 8, 2021
8 minutes
Limitless data movement with free Alpha and Beta connectors
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->

🤩 The Alternative

ELT Pipelines with Airflow, Airbyte and dbt

Airflow, Airbyte and dbt are three open-source projects with a different focus but lots of overlapping features. Originally, Airflow is a workflow management tool, Airbyte a data integration (EL steps) tool and dbt is a transformation (T step) tool.

As we have seen, you can also use Airflow to build ETL and ELT pipelines. The feature overlapping doesn’t stop here, it also works the other way around. Airbyte Open-Source  and Airbyte Cloud also provide a scheduler and integrate with dbt for the T step. dbt Cloud also comes with a scheduler.

Airbyte Connection UI to replicate data from Postgres to BigQuery and apply custom dbt transformations.

Where to Schedule ELT Pipelines?

If you are a data company and have different kinds of pipelines, you can use Airflow to schedule all your ELT pipelines. You can then integrate Airflow with Airbyte for the EL steps with the AirbyteTriggerSyncOperator. This way, you can trigger incremental syncs with the Airflow scheduler and run full refresh syncs from the Airbyte UI without the latency you would have when doing an Airflow backfill job.

You can also integrate Airbyte with dbt to run transformations for each single connection that loads data to destination. Keep in mind that you should limit Airbyte usage of dbt to simple data normalization for each loaded table. If you need more complex data transformation logic that combines data from different dbt models, you can integrate Airflow with dbt instead.

Unfortunately, the community has not yet agreed about the best way to integrate Airflow and dbt. Some use the BashOperator, others the airflow-dbt integration package contributed by the GoCardless team and a few others use the dbt-cloud-plugin. The engineering team at Updater has shared their work recreating the dbt graph in Airflow by parsing the dbt manifest.json file.

Airflow DAG parsed from the dbt manifest.json file. Credits to the Updater and Astronomer.io teams.

Conclusion

Airflow shines as a workflow orchestrator. Because Airflow is widely adopted, many data teams also use Airflow transfer and transformation operators to schedule and author their ETL pipelines. Several of those data teams have migrated their ETL pipelines to follow the ELT paradigm. We have seen some of the challenges of building full data replication and incremental loads DAGs with Airflow. More troublesome is that sources and destinations are tightly coupled in Airflow transfer operators. Because of this, it will be hard for Airflow to cover the long-tail of integrations for your business applications.

One alternative is to keep using Airflow as a scheduler and integrate it with two other open-source projects that are better suited for ELT pipelines, Airbyte for the EL parts and dbt for the T part. Airbyte sources are decoupled from destinations so you can already sync data from 100+ sources (databases,  APIs, ...) to 10+ destinations (databases, data warehouses, data lakes, ...) and remove boilerplate code needed with Airflow. With dbt you can transform data with SQL in your data warehouse and avoid having to handle dependencies between tables in your Airflow DAGs.

If you would like to get involved, we hope you’ll join Airbyte’s Slack community, the most active community around data integration!

The data movement infrastructure for the modern data teams.
Try a 14-day free trial