Data Vault on Snowflake: Dynamic Information Marts
Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers.
Announced at Snowflake Summit 2022, Dynamic Tables is the next level table type optimised for real-time streaming transformation. Dynamic Tables behave like Materialised Views but are really the declarative version of Streams and Tasks (introduced back in 2019) where Snowflake manages the data pipelines on your behalf. That said, Dynamic Tables can be used for batched workloads, combining batch and near-time streaming and solely for real-time streaming transformations. What’s more is that Dynamic Tables can be stacked into a direct acyclic graph (DAG) and you do not need to manage the underlying streams and tasks deployed on your behalf.
Does it replace Streams and Tasks? No, where you desire more control over how the offsets work for your data pipelines you would likely choose Streams and Tasks over Dynamic Tables. We covered some of this in this Streams on Views article.
It is an exciting bit of Snowflake technology, many have already asked, “where can Dynamic Tables be used in Data Vault?” Well, we will answer this in due course.
Series
- Join Key Data Type
- Snapshot PIT tables
- Incremental PIT tables
- Information Marts
- Simulating Fact tables with Bridges
- Dynamic Information Marts
A reminder of the standard data vault table types,
And the query assistance tables needed to boost join performance…
Episode 18: Dynamic Tables
You may be thinking, why build a series of articles about tuning your data vault for performance on Snowflake? As much as we would like to solve all our data challenges by simply writing an SQL query, the truth is Dynamic Tables will still inherit the same performance consideration as traditional Snowflake tables. That means, we need to consider
- Clustering and not clustering data vault constructs
- Choosing an appropriate join key, whether that is a surrogate hash key, temporal sequence key or simply the natural keys
- The structure of our queries in terms of taking advantage of Snowflake’s native OLAP query support, think star-optimised join queries in this regard
There are many techniques to consider getting us to this point, of where and when to use Dynamic Tables and if it is the optimal tables structure for our data architecture. Dynamic Tables are declared as a regular CTAS (CREATE TABLE AS) SQL query statement with the addition of assigning a virtual warehouse (currently) and a lag statement that instructs Snowflake when to check for new updates from source data structures supporting the Dynamic Table. Dynamic table updates exist in two flavours (and this is crucial for our discussion):
- FULL REFRESH — when the SQL declared in a dynamic table cannot be executed to use incremental updates. Snowflake aims to reduce the likelihood of this.
- INCREMENT REFRESH — the optimal use case of dynamic tables, as new records are ingested into the data structures supporting a dynamic table, they flow through and transform by what you have coded as a CTAS.
What is also important to understand with regards to Dynamic Tables is the isolation level, for Dynamic Tables this is account-level snapshot isolation whereas traditional Snowflake tables are read committed isolation. This makes Dynamic Tables extremely powerful because for updates to process to your Dynamic Tables they take the required data in its account-level state to process through to the dynamic table without interfering with any updates that are concurrently happening to those supporting data structures, including other dynamic tables. The data will flow.
The other important technical detail you must understand about Dynamic Tables is its join-behaviour. A Dynamic Table you define could include SQL joins between data structures as views, tables, and other dynamic tables. For a data vault you will definitely be executing SQL joins between a lot of tables! To understand the join behaviour we can look back at Snowflake’s Streams on Views, these structures give the first hint of the inherent Dynamic Table join behaviour. Each Snowflake Stream is an offset on a change-tracked underlying table, when a stream is deployed on a view the offset is not deployed on the view, but the underlying tables supporting that view. Each offset on each underlying table is independent and process new change-tracked records when the SQL query utilising that stream is used to push the data and transform data into a target table. This essentially means that if a new record is introduced to one of those supporting tables in the join that you would expect the SQL join not to find the corresponding record in the join because those records have already been processed. Not so!
Snowflake will find that record in the other table to support your SQL join and process that record. Dynamic tables support the same SQL join behaviour, let’s demonstrate this with some sample code.
- Create the demo environment
create table if not exists dim_table (
id int
, name varchar(10)
);
create table if not exists fact_table (
id int
, amt int
);
create or replace view dim_fact_equi as
select f.id as id
, f.amt
, d.name
from fact_table f
natural join dim_table d
;
create or replace stream stream_dim_fact_equi on view dim_fact_equi;
2. Insert a matching fact and dimension record, notice how the view and stream on view contain the same values
3. If we insert a fact without a matching dimension record, the new fact is discarded in the view because there is no matching dimension. The stream on the view will also show this
4. When we eventually insert the matching dimension record it does appear in the view and stream on view
5. When we consume the stream (we take the transformed records and push it to a downstream table) the stream on view appears empty when you query it
6. When we add a new fact record it will appear in the stream on view, how does Snowflake do this when we know by defining a stream on a view persists an offset on each of the underlying tables supporting the view. The matching dimension record was already processed!
7. Inserting another fact record also appears in the stream on view! Snowflake will look at a previous version (Time-Travel) of the table and find the match!
Yes, very powerful!
Dynamic Tables in a Data Vault
Hopefully you have a clearer understanding of how Dynamic Tables work at this point. Now, as to where they fit in a data vault let’s start with where they do not belong.
- They should not be used as hub, link, and satellite tables (today). As we have discussed in the previous section a Dynamic Table can refresh. This means that if you use your Raw and Business Vault as the auditable source of your enterprise’s data you risk losing that auditability simply because a Dynamic Table can refresh! Also, the version of Dynamic Tables today will also automatically delete records as a result of physical deletions from the underlying source data of that Dynamic Table.
- You could instead have deployed your Data Vault over what’s colloquially known as a persistent staging area (PSA) and modelled hub, link, and satellite SQL views. At this point you should ask yourself, why have you deployed a malleable data vault over a landed area you cannot change because of auditability? And why have you chosen a data vault model to be constructed over a PSA if you know that the analytical tools you will be using are not data vault model friendly? Seriously, if you have deployed a PSA then you should skip data vault altogether and deploy a virtual dimensional model over your PSA! Persisting your enterprise data into Raw Vault hub, link, and satellite as tables offer better management and performance of your enterprises data around what your organisation prizes: your business entities (modelled as Data Vault hub tables).
- Dynamic Tables also need time to materialise, meaning if you’re building pipelines based on the latest state of a satellite table’s descriptive attributes, those jobs/tasks need to wait for that satellite to complete its update if it were deployed as a dynamic table. Of course, you could chain Dynamic Tables with Dynamic Tables but see the first bullet point above!
Where do we see Dynamic Tables being used in a Data Vault architecture?
Information Marts
Let’s summarise and elaborate their use in the information marts layer:
Note: As support for more Dynamic Table functionalities grow, we will expect to see more of these queries be supported by incremental refresh mode.
It seems PIT tables can only refresh, can we do better?
To answer that question, we’ll look back at two PIT entries we discussed in articles 2 and 3 of this series, the incremental, and snapshot PIT tables respectively.
Current PIT table
Recall how we built this PIT table variation
merge into CURRENT_SAT_CARD_MASTERFILE_PIT p
using (select distinct
dv_hashkey_hub_account
, dv_loaddate
, account_id
, dv_hashdiff_sat_card_masterfile
from staged.card_masterfile stg) stg
on p.account_id = stg.account_id
when matched then
update set p.sat_card_masterfile_dv_hashkey_hub_account = stg.dv_hashkey_hub_account
, p.sat_card_masterfile_dv_loaddate = stg.dv_loaddate
, p.sat_card_masterfile_dv_hashdiff = stg.dv_hashdiff_sat_card_masterfile
when not matched then
insert (account_id
, sat_card_masterfile_dv_hashkey_hub_account
, sat_card_masterfile_dv_loaddate
, sat_card_masterfile_dv_hashdiff)
values (stg.account_id
, stg.dv_hashkey_hub_account
, stg.dv_loaddate
, stg.dv_hashdiff_sat_card_masterfile);
We had to schedule this PIT table load and run it in parallel to the satellite table load, whilst using the same staged view. Now with Dynamic Tables all we do is write this SQL in its place.
create or replace dynamic table CURRENT_SAT_CARD_MASTERFILE_PIT
lag = '1 min'
warehouse = vault
as
select dv_hashkey_hub_account as sat_card_masterfile_dv_hashkey_hub_account
, dv_loaddate as sat_card_masterfile_dv_loaddate
, account_id
from sat_card_masterfile
qualify row_number()
over
(partition by dv_hashkey_hub_account order by dv_applieddate desc, dv_loaddate desc) = 1
;
- No orchestration needed, Snowflake manages it
- No need to create the table with column data types, it is a CTAS
- No holding of the staged content, CTAS is based on the data vault satellite table
- CTAS is based on the current record from the satellite table, that is hidden from the user of View Current!
And now, the snapshot PIT table!
LAG-PIT Snapshot variation
In this series of articles, we have covered two variations of the snapshot PIT code
- Standard PIT code, provided as a part of the standards. This query build is not suited for incremental dynamic table refresh today because the query contains a sub-query expression
- Improved PIT code, moving where the cross-join is applied to satellite table join to the hub/link table build. This query build is not suited for incremental dynamic table refresh today because the query contains non-equality predicates
Can we build a snapshot PIT table another way? Yes, by incrementally (pun unintended) introducing another innovation to the snapshot PIT table build, window functions ignoring nulls! Here’s how we do it!
STEP 1: Expand Hub with Snapshot Dates, align Satellite entries
Like the improved snapshot PIT code, we expand the hub load time entries with the snapshot period requested. So far, the code is the same, now for the new code portion.
STEP 2: Coalesce with previous non-NULL entries
Using Snowflake LAG Window function, we retrieve the previous non-NULL entries to fill in the gaps between satellite dates. We use that entry to coalesce in the following order
- Satellite load date, matches the satellite change timestamp
- Previous load date we derived using the window LAG function
- Ghost record, in case the snapshot date did not have a satellite timestamp yet
And the code to make this happen…
, coalesce(s1.dv_loaddate
, lag(s1.dv_loaddate) IGNORE NULLS over
(partition by st.dv_hashkey_hub_account
order by st.snapshotdate)
, to_timestamp('1900-01-01 00:00:00')) as sat_card_masterfile_dv_loaddate
Essentially, we have gotten rid of the PIT refresh and turned the snapshot PIT into an incremental PIT with far simpler SQL text! This effectively means we can daisy-chain dynamic tables from PITs (query assistance) to Information Marts!
Can we also replicate the logarithmic PIT builds? Yes, we add a very simple reference to a flag in the PIT table build.
It hasn’t escaped the author that the word LAG is a homonym in the context of PIT tables deployed as Dynamic Tables!
- LAG used in the Window function we just described,
- LAG used to instruct Dynamic Tables when to update!
We might also add light-weight transformations into our PIT and not need an additional information mart at all. However, the beauty of having query assistance structures is that we have a single QA structure that may support multiple Information Marts on top of it. Consider that when you’re designing these structures!
Command & Query Responsibility Segregation (CQRS)
Microservices and domain-driven design have established design patterns for processing event (real-time) data and being able to efficiently query business entity state information. CQRS segregates these two operations into distinct databases CQRS calls:
- Command — inserts new event data, the write model.
- Query — retrieves data from a separate read model
Event data are statements of fact, they will never change but because events can be as frequent as the chosen event you wish to model, they can be buffered into millions of records a day. To the database query attempting to scour across all records for your business entities and events of interest can become cumbersome and inefficient. And this is where the query portion of the design pattern comes into play. Like modelling in a data vault is designed to do, CQRS designates a data model that is continuously being updated with domain events, aggregates the events to the desired aggregation level and presents the data in a query-able data model.
On Snowflake, you can now do that in a single database, in a data vault we model non-historised link or satellite tables and supply the aggregated data as an information mart.
Non-historised tables means your loading patterns are not checking that the new event data is new or changed, they are new. If you are utilising a pattern like this, consider that attempting to add a surrogate hash key and record hash (HashDiff) will add unnecessary latency to your loading pattern (as we saw in the first article of this series). Your information mart will then resolve the natural key to surrogate hash key join across your real-time and batch data. Declare it once as a Dynamic Table and your aggregate information mart will take care of the rest! Learn more about Snowflake’s Streaming Snowpipe API here.
Note: use this article as a guide, test against your own workloads to make an informed decision.
Dynamic Table in Snowsight
Once a Dynamic Table is defined you can navigate through to view and manage them. Interactively you can,
- View the DAG supporting a Dynamic Table as defined based on the SQL query you defined.
- View the refresh history including the number of records inserted and deleted and if the refresh was incremental or not.
- View the runtimes for the updates and click through to the query plan for the query. Note that this query is not viewable in the query history for the rest of your account, it is segregated into its own UI within Snowsight
- Click here to discover more Dynamic Table-specific Snowsight features
You can also monitor Dynamic Tables using the SQL DYNAMIC_TABLE_REFRESH_HISTORY table function.
Summary
As you may have noticed, Dynamic Table creation syntax includes a clause to run on a virtual warehouse, this means you pay for keeping a table up to date with the information mart query you designed. Is it necessary for your use case? As we have seen throughout this series and the previous one, Snowflake already has the necessary components to run your queries efficiently as information mart views. Whether that is utilising the build and probe algorithm or Snowflake’s unique approach to result cache you’re only paying for execution of that view and sometimes not at all! Granted, a Dynamic Table will be faster in certain cases but does your analytics use case warrant it? You need to investigate and decide.
Why you wouldn’t consider dynamic tables for point-in-time constructs is hinted in the name itself, PIT tables imply a static reference to the underlying data vault tables from raw and business vault. Something like having the data ready for a report on a Monday morning while your data vault continues to update throughout the day, the PIT table references that data needed for that report efficiently. If the performance you already get from using SQL views over point-in-time tables is satisfactory then you may desire to keep things simple and not deploy information marts as dynamic tables! While you are retrieving keys and timestamps for your PIT constructs you may also perform lightweight aggregations whilst loading those said keys and timestamps and here dynamic tables may be quite handy if those aggregation start to become complex.
In the spirit of the previous articles in this series, let’s highlight the following when: considering dynamic tables
- Cost to load — this would be relatively the same as your normal load but consider whether you want to deploy surrogate hash keys at all. As we noted, the load performance of a natural key data vault is vastly superior to a surrogate hash key data vault, but the trade-off is in the increased complexity of your table joins. Also note that depending on your query, at this time some of the dynamic table loads must be executed as a full refresh but we’re still at the beginning of the dynamic table journey which Snowflake aims to support more and more incremental updates in the future.
- Cost to query — Dynamic Tables will be superior to query just by definition because they are tables. For argument’s sake let’s compare the runtimes for information marts deployed as views and deployed as dynamic tables (these are binary hash key runtimes)
— Skew-optimised SQL view over a PIT table:
§ Daily: 1 minute 36 seconds
§ Weekly: 47 seconds
§ Monthly: 31 seconds
— Dynamic Information Mart based off the same data:
§ Daily: 46 seconds
§ Weekly: 13 seconds
§ Monthly: 4.2 seconds
Note however that a view will be real-time, a Dynamic Table has a lag.
- Cost of storage — Dynamic Tables have a storage footprint, as do Snowflake traditional tables. In addition, Dynamic Tables keep a configurable level of Time-Travel, but those inactive micro-partitions are purged when its Time-Travel expires
- Cost of complexity — vastly simplified as we have seen! In previous articles we have shown how materialising certain versions of PIT tables required multiple steps including a rather complex SQL MERGE that needed careful orchestration, Dynamic Tables are deployed as CTAS statements and Snowflake manages the rest!
Don’t forget though, you can suspend and continue Dynamic Tables through a simple SQL command. Once resumed, the Dynamic Table catches up to the desired active state.
This concludes this six-part series covering performance tuning consideration for your data vault along with recommendations around dynamic tables and where to use them. Dynamic Tables do not replace Streams and Tasks, they are instead a declarative form of the imperative style of managing data pipelines, which in your case, you might be totally fine with!
Until next time!
Further reading
- Data Vault Guru: a Pragmatic Guide to Building a Data Vault https://www.amazon.com/dp/B08KSSKFMZ
- Keep your Data Vault honest with this test suite, https://github.com/PatrickCuba/the_data_must_flow
- Domain Driven Design, https://martinfowler.com/bliki/DomainDrivenDesign.html
- Data Vault and Domain Driven Design, https://medium.com/snowflake/1-data-vault-and-domain-driven-design-d1f8c5a4ed2
- Dynamic Tables https://docs.snowflake.com/en/user-guide/dynamic-tables-about
- Command Query Responsibility Segregation (CQRS) pattern, https://www.ibm.com/cloud/architecture/architectures/event-driven-cqrs-pattern/
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.