What does dbt give you?
Data build tool (dbt) is an open-source templating engine for your SQL transformations; with native support for connectivity to Snowflake, Amazon Redshift, Google BigQuery, and PostgreSQL dbt takes care of materializing your data. In addition to the four database platforms, Fishtown Analytics supports there is community support for connectivity to Azure Synapse, Microsoft SQL Server, Presto, Spark, Exasol, Oracle, and Dremio. dbt takes the concept of DRY (Don’t Repeat Yourself) by allowing for the reusability (and modularity) of SQL code by using YAML configuration files and Jinja language elements and embracing ELT (Extract Load Transform) paradigm as opposed to ETL (Extract Transform Load).
What has changed?
ETL thinking had dominated how automation tools were used to pull data out of live databases, transform it in place, and load it to a target Kimball or Inmon data model. Changes to these data pipelines accumulated cost because to make a change meant needing to ensure that the change did not break the existing logic and did not adversely affect downstream analytics. The more massaging of the business logic the bigger the code spaghetti! This is symptomatic of legacy data warehouses and why building and maintaining such a code beast inevitably led to questions around data warehouse viability (does the cost of maintaining the software outweigh the benefits of it). Indeed the data warehouse was destined to the closet of history!
“Separation of concern is a design principle for separating a computer program into distinct sections such that each section addresses a separate concern.” — Wikipedia
ELT turned this thinking on its head, “give us the RAW data and we’ll transform the outcome in the data warehouse.” Tools used to get data out of front-end applications did just that, once the data is landed it is up to the data designers to create repeatable patterns off that raw data.
- concern 1: extract and load the raw data,
- concern 2: transform the data.
Fishtown Analytics introduced a job role called the “Analytics Engineer”; the role fits between a Data Engineer and a Data Analyst (ref) that takes the output of EL (extract and load) that is raw data and applies a data engineering approach to produce information marts through automation and repeatability (the T in ELT).
How does it work?
At the time of writing, dbt is not even on version one and yet it is picking up such momentum. The makers of dbt have taken a data engineering approach and made it available to the data analysts. Yes, there is a learning curve but it is built around what every analyst should already know, SQL. So what have they added…?
1. GitHub
Dbt comes in two interoperable flavors, dbt CLI and dbt Cloud. CLI is for those that like to edit code in a tool like nano (or another text editor) and are very familiar with the standard Git workflows. Dbt Cloud does everything the CLI does but with a GUI and an automated Git workflow (you just need to click a button). Both integrate to your Github personal or enterprise account and in fact, you still get to manage your repos within Github the way you want it but dbt Cloud takes away having to think about what you have to do next in your Git workflow.
You still have master and branches, but what this also means is that all your code and configuration is version-controlled. Develop in your test environment and submit a pull request when the code is ready for deployment into production. The dbt init command in either flavour instantiates the dbt folder structure that you customize for your project.
2. Repeatable code
Dbt encourages the reuse of repeatable code, it’s one of the strengths of using dbt. Like developing transformations in an ETL tool you may find that some of the code in a transformation on the job canvas can be utilized elsewhere. The solution used to be that you needed to build intermediate tables and reuse that logic downstream in other data marts, or create parameterized sub-jobs if the ETL tool provided it. Not so with dbt! Dbt uses macros, snippets of SQL code that can be reused simply by referencing them in a macro call. This means your SQL logic is really a utility box of business rules that can be used anywhere within your project. The macros themselves can be designed in a way that they contain switches and options that when called execute portions of the code necessary depending on the switch chosen. The switch may be model or environment-based. Not every SQL logic should be a macro but those that find their way as being repeated should be.
Macros, code switches, and configuration is achieved by using Jinja templates along with configuration driven by the use of YAML files. Now, dbt is open source and in its design, the makers of dbt encourage this conceptually. You have the option of releasing a portion or the whole of your project as a package to the open-source community You can clone and include other open-source packages into your own project and save on design and build time to develop business rules (in SQL & Jinja) that someone else has already developed. These can be other data models or even SQL + Jinja macros.
In dbt your focus is on SQL + YAML + Jinja data transformation but also what the outcome should look like. By default the output is a VIEW and in fact, when you trace your log you will see a CREATE OR REPLACE VIEW statement however dbt supports other materializations as well, they are:
b) tables — execution of views means you are repeating the code logic every time you use the view, if the outcome is used by multiple downstream models then the option to create db models as a table can be used instead. This is a CTAS (create table as) statement and replaces the model outcome at every execution.
c) ephemeral — to the casual observer they are views that are not stored anywhere; to the SQL literate, they are CTEs (common table expressions) however the difference here is that unlike standard SQL these CTEs can be reused anywhere and defined once. A standard SQL CTE must be utilized by the immediate subsequent SELECT statement, a dbt ephemeral materialization turns a CTE into reusable SQL snippets!
d) incremental — when rebuilding a target table at every run is not a cost-effective option your next bet is using an incremental model, i.e. insert new records at every run. You can define a unique key to insert new records by and programmatically (using Jinja and SQL) to ensure only new records are inserted and dbt takes care of generating the INSERT INTO or MERGE INTO on your behalf.
e) snapshots — these are your slowly changing dimension (SCD) type, 2 models. Dbt takes care of the start and end dates on your behalf and you configure how new records are determined, either by a timestamp column or by the record values by a parent key.
Because of the support dbt has to connect to multiple platforms your SQL statements only need to transform the data, through configuration dbt does the materialization! Dbt handles the supported platform’s nuances making dbt platform agnostic.
3. Automated testing
There are standard schema tests that you can configure in YAML along with your dbt project:
- not null — no missing values occur in a column,
- unique — the column is unique,
- relation — referential integrity between a primary and foreign key (not database enforced, rather it is generated code to check the RI),
- accepted values — whether a column only contains values within a specific domain, these can be supplied from a seed, another dbt concept for data enrichment that can also be referenced like a data model.
And there are custom data tests you can develop and reuse like SQL macros that test the integrity of the model after model execution or the source data before model execution.
In addition to the automated tests, you can include model pre and post-hooks to execute before or after the model is executed!
4. DAGs
Direct Acyclical Graphs — the orchestration of models is achieved by using Jinja within your SQL models. By utilizing the Jinja script dbt knows what the order of execution of the models is. This allows for models to easily be reused by simply referencing them (using Jinja code) without physically having to move or copying and pasting the code snippets. Any changes to the snippet are felt downstream (as always) and by mapping this data lineage the makers of dbt have also delivered automated documentation. That is, once your models are developed (and you’ve added descriptions of the data in YAML) a browsable frontend can be automatically produced that shows model lineage all the way to the source tables that contain the original values.
The model and the documentation live side by side! Starting to look a lot like DevOps pipelines! The DAG is also dbt’s execution flow, starting from the models that are based on the source tables, dbt scans the references in the developed code and executes the model updates based on the dependencies. Not only that but you have the option of including pre-and post-hooks along with your models to do things like granting privileges to database objects as you see fit.
Dbt cloud has the facility to switch between environments and schedule DAGs with your own dbt commands to run, test, and generate document outputs.
5. Model agnostic
The outcome and data model is as flexible as the blend of data models you choose to use. Whether that be data vault, dimensional models, Inmon, or flat wide tables the flexibility offered by dbt means you can stick to one or mix the models you need in your platform to suit your data needs.
At the time of writing schema evolution is not supported by dbt (if your materialization is either incremental or snapshot), however, that only means that if you are building a data vault that you may be leaning towards building one that is based on a persistent staging layer. Also not directly supported is the theme of allowing multiple source updates to the same target data model table. For a hub table in a data vault that is an absolute requirement however, there are ways to get around that. You could,
a) use model aliases — although .sql filenames are the model names in dbt you can change the model name by including an alias config snippet in your SQL code. However, this does not prevent concurrent updates to the same target table. A hub table is a unique list of business keys but if the business keys will never clash (by design) then this is a non-issue. If they would then you have several options:-
b) use another orchestration tool — like Airflow and its pools concept, this ensures only one thread at a time gets to update the target model and therefore ensuring idempotence.
c) create your own dbt update macro — that depending on the target platform ensures a level of locking of the target table before loading. Snowflake for instance locks the target table (and all of its micro partitions) if you use a MERGE statement or if you use transactions (BEGIN .. COMMIT). There are methods for checking table locks on target tables and you can include that in your dbt script, to check and poll the target table until a lock is available before executing the update to the target data model table.
d) model multiple source tables to a target hub by using UNION ALL, this is a workaround because a data vault should be able to be updated at any time of day and any point in time but by using a UNION ALL statement means that you are querying data sources that have no new data even though other data sources pointing to the same target table does. Each hub table update should function autonomously, this workaround means that every source file loading to a common hub must be included in a single SQL statement.
As for facts and dimensions, if the content becomes large enough that daily rebuilds become a problem then a fact table is an incremental materialization without a unique id configured and a slowly changing dimension type 2 is a dbt snapshot… essentially. However, when building a star schema the paradigm changes a bit. Out the box there isn’t a utility that fetches dimension surrogate sequence keys to load to a fact table, you would have to design that portion yourself. Dbt was not built to do that but offers the tools for you to build one yourself. Instead, think of snapshots as captured dimensional changes about a business entity aligned to the source — — that is why you will see snapshots being the first step after the source table — and is typically called the source-aligned-layer (SAL). Rapidly changing data like transactions can be loaded to incremental materializations without the unique-id, the same materialization can be used for data vault 2.0 satellite tables. These are to suit the data models we know, in dbt you do not necessarily have to follow either methodology; there is a great article on how could horizontally and vertically align your data warehouse here.
Why does it work?
DevOps, CICD, data pipelines, infrastructure as code… these are more than just buzzwords, these are a cultural shift on how to treat analytics at scale. DataOps is applicable to dbt, essentially you are treating the data analytics as code like you have everything else. Tests and deployment are automated and the flexibility is in-grain to suit your enterprise. By being able to share macros, models, and packages you can stand on the shoulders of existing data models and methods even if they don’t apply to your own data. Open-source becomes a culture, instead of building analytics from scratch you can ask “has someone done this before?” and you can leverage an existing data model/macro to do just that! Open-source encourages you to share common models/macros yourself!
Data warehousing is revived with dbt and the slew of modern cloud data warehouse platforms making the possibilities previously expensive (or unheard-of) in legacy data warehouses are now possible, affordable, and scalable. Entrepreneurial and large enterprises can both leverage SaaS at scale and not worry about the long lead times that we used to have to think about when building a data warehouse. Modern data governance and regulation still apply but a little bit of configuration and design upfront can spin up entire enterprise data warehouses that scale as you scale and mold the platform as your data governance sees fit.