![]() This project is created from the prospective of a data engineering team that is responsible for creating and maintaining data infrastructure such as data lake, data warehouse, orchestration, and CI/CD pipelines for analytics. The dataops-infra folder contains code and intructions to deploy the platform infrastructure described in the Architecture overview section. These two folders should be considered as two separate repositories following their own release cycles. This setup is meant to demonstrate how DataOps can foster effective collaboration between data engineers and data analysts, separating the platform infrastructure code from the business logic. In this repository there are two main project folders: dataops-infra and analytics. AWS CodeBuild (optional), automate deployments.Amazon Simple Storage Service, to store Airflow and dbt DAGs.Amazon ElastiCache for Redis, as a Celery backend for Airflow.Amazon Relational Database System, as metadata store for Airflow.Amazon Elastic Container Repository, to store Docker images for Airflow and dbt.Amazon Elastic Container Service, to run Apache Airflow and dbt.The architecture includes following AWS services: This repository contains code to deploy the architecture described in the blog post: "Build DataOps platform to break silos between engineers and analysts". Let me know if you have any questions or comments in the comments section below.DataOps Platform with Apache Airflow and dbt on AWS ![]() If you are interested in learning how to setup and run dbt checkout this article dbt tutorial. If you are building a data pipeline where multiple engineers and non engineers are stakeholders in how the data is transformed and you have a powerful data warehouse to support such requirements, dbt is a very competitive choice as it frees you up from having to manage the dependencies, has test support natively and has a very low learning curve enabling engineers and non engineers to contribute to the transformation logic. Production run using dbt cloud or through Airflow trigger. Online, searchable data catalog and lineage If most of your transformations are at a data warehouse level, this tool makes it extremely easy to do The key points, on why someone would want to use dbt areĮasy to use for non engineers (shared data knowledge between engineering and non engineering teams)Įxtremely flexible data model (recreate data easily, backfills are easy) Additionally, it also provides data quality check natively. This is where dbt shines as it provides an easy, version controlled way of writing transformations using just SQL. This has led many companies to use the data warehouse to perform the data transformation and load part of the ETL process (otherwise know as ELT). In recent years, Data warehouses have become extremely flexible(UDFs,etc) and powerful, with features like separation of storage and processing, elastic scaling and Machine Learning capabilities(Bigquery’s ML). It provides less functionality compared to other OSS ETL orchestration tools such as Airflow, Luigi, But this comes with the advantage of dbt being extremely simple to understand and run compared to other OSS ETL orchestration tools especially for a non engineer. Why would I switch from sql scripts to dbt scripts considering the learning curve?ĭbt is designed to solve for the T part of ETL, by working on raw data already present in a data warehouse. It is not very clear to me why would I use dbt instead of running SQL queries on Airflow Some common questions from Data Engineers about dbt are If you are interested in learning dbt checkout this article. In this article we aim to go over the reasoning behind why someone might want to use dbt.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |