Data Vault on Snowflake: Performance Tuning with Keys
Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers.
As your Data Vault grows a common concern is if the platform can continue to deliver on the performance you first achieved when you started your Data Vault journey on Snowflake. Yes, as we have seen in previous articles, the larger the tables are the more micro-partitions represent that table and we have looked at how constructing a PIT table can improve join performance around a hub or link table and how a PIT table can be used to fetch the current record for a business entity using Snowflake’s JoinFilter. The common theme in those three articles is pruning, being able to select fewer micro partitions to satisfy a query on Snowflake improves join performance, naturally.
In this series of articles, we will explore some of the key considerations (pun unintended) when building your data vault on Snowflake in the first place as well as the pros and cons of each decision. Armed with these facts you will be able to make an informed decision on the architecture patterns for your deployment on Snowflake. Note that some of these decisions are not mutually exclusive but we will point out where they are.
Series
- Join Key Data Type
- Snapshot PIT tables
- Incremental & Merge PIT tables
- Information Marts & Ghost Skew
- Simulating Fact tables with Bridges
- Dynamic Information Marts
A reminder of the standard data vault table types,
And the query assistance tables needed to boost join performance…
Episode 13: Join Key Data Type
A fundamental decision when building your data vault architecture is deciding how you will join your data vault tables after your data has been loaded into their respective hub, link, and satellite tables. The factors that may help you decide on what those join keys, and their data types are:
- Cost to load (ex. hashing, more on this soon) — compute cost.
- Cost to query — the speed of querying your loaded data.
- Cost of storage — for your data vault model and query assistance structures.
- Cost of complexity — the number of keys in the join.
With this in mind, let’s introduce the join key types we commonly see in a data vault:
- Surrogate hash keys
— as a binary data type
— as a varchar data type
- Surrogate sequence ids
— as a durable key
— as a temporal key
- Natural keys (no surrogate at all)
Your “mutually exclusive” decision will be between choosing either one of binary or text surrogate hash keys or durable surrogate sequence keys. You will need to persist with one of these join key data types, but they can be mixed with the other join key data types (temporal sequence keys and natural keys). Alternatively, you could also build a natural key data vault with no surrogate keys at all.
For this and the remainder of the series this will be the data model we will expand on,
FYI, Kent Graziano suggested exploring building a natural-key only data vault in this article, we will articulate what that means for your data vault!
Cost of complexity
Landing data in an accessible location for ingestion into Snowflake is a requirement for file-based batch loads, the cost of complexity is with regards to join key data types. However, for all data types the following four steps are applicable to all join key data types and therefore this part of the complexity is applied evenly.
1. First, we must apply the standard business key treatments to standardise the business keys themselves,
- Apply a standard case, UPPER function, this is for Passive Integration
- TRIM leading and trailing blanks (dates are never a part of, or treated as a business key)
- CAST to text/varchar, all business keys are of a text data type (keys are not metrics and therefore are never of a numeric data type, see example here).
2. Second, apply the zero key treatment to deal with null keys and relationship optionality of business keys in a relationship.
3. Third, we need to decide on an appropriate business key collision code, remember, the hub table is the integration point between source systems and therefore you do have the risk of loading business keys with the same or similar key value that in fact represent totally different business objects. To ensure that we don’t, we add a tie-breaker value to differentiate between the two. Otherwise, you would end up joining data around a hub table that are not related!
Caution: assign a business key collision code only if it is necessary!
4. The tenant id is also included in surrogate key assignment and can be further utilised to subset a single data vault model across your tenants.
Now let’s see what this means for each of the join key data types.
Surrogate hash key
Why would we want to create hash keys at all? The answer is twofold,
- They are deterministic and will always produce a hash digest of the same length and the same input will always produce the same hash.
- For some MPP-based platforms they also distribute very well across storage nodes and therefore inherently have very good locality for data describing a particular business entity. For explanation see here.
The second bullet point above is not relevant to Snowflake as Snowflake’s MPP architecture is its elastic scalability with regards to the separation of storage and having as many compute nodes (virtual warehouses) working off the same data. To ensure the keys and columns used to produce a surrogate hash value that does not clash with another surrogate hash key value we must select a strong enough hashing algorithm that is least likely to produce hash key collisions. This is where we need to decide between the following algorithms (ordered by weakest to strongest):
- MD5 or MD5_BINARY — has had extensive recorded vulnerabilities, see here.
- SHA1 or SHA1_BINARY– has had one recorded collision thus far, see here.
- SHA2 or SHA2_BINARY– has no recorded vulnerabilities currently.
Note that the stronger the algorithm the more CPU intensive the algorithm is.
What makes using surrogate hash keys attractive is their deterministic nature, i.e., by using surrogate hash keys we can:
- Simplify the number of keys needed to join between data vault tables from at least three columns to a single column to join on, i.e., business key(s) + business key collision code + tenant id is hashed to one surrogate hash key column.
- We can load all mapped target data vault artefacts from a single staged file-based batch file/table in parallel.
Surrogate sequence id as a durable key
Surrogate sequence key as a durable key is when a business key will always carry the same surrogate sequence ids, it’s durable. The choice of using a surrogate sequence key has a profound impact on your data load orchestration. Because the sequence key must be assigned to a business key in a hub table first, the loads to related satellite and link tables that are mapped from that single staged file-based batch file/table must wait for the hub load to complete before proceeding to fetch what that surrogate sequence key for each loaded business key is. For this reason, durable surrogate sequence ids are not recommended for loading your data vault.
Surrogate sequence id as a temporal key
Surrogate sequence key as a temporal key is when a new id is assigned per record added to your satellite table. This key has no relation to other data vault tables but as we have demonstrated in a separate article it can be used in the construction of a point-in-time (PIT) table populated with these sequence keys only. As such, this key can be used with the other suggested join key types. The cost to load is null as they are defined as an auto-increment (identity) column when defining the satellite table.
Natural keys
An option for your data vault is to not create hash keys at all, by using natural keys you do not need to think about the possibility of hash algorithm collisions, but you still need to consider business key collisions and tenant id. The difference here is that they will be separate columns that must be included in all data vault tables. So,
- Natural key data vaults do not simplify the number of business key(s) and columns to join on. This of course includes business key(s) + a business key collision code + tenant id.
- We can still load all mapped target data vault artefacts from a single staged file-based batch file/table in parallel.
Uniquely, you should also include the untreated natural key in your satellite table loads and in a subsequent article we will show why that may be useful to you!
Cost analysis: Surrogate Hash Keys, simplifies joining data vault tables and encourages parallel loading
Cost to load
Each of the above join key types implies a different CPU cost to load; to illustrate their respective cost we will create a table with 100 million records with each of the join types and record the time to build.
The following has been run on a MEDIUM virtual warehouse.
It should be noted here that to minimise the cost of running a hashing algorithm, the algorithm should be executed as few times as possible. What we mean by this is that the staged surrogate hash key will be sent to a hub and its related link and satellite tables, or a link and its related link-satellite tables. Stage the hashing once, load to many target tables.
Also note, data vault surrogate keys should never be used beyond the construction of information marts to perform the optimal joins between the data vault tables in-database. The resulting information mart will not have data vault surrogate keys but only the needed business keys.
Did you notice, we did not profile the temporal surrogate keys? This column is defined as an auto-increment / identity column and simply increments for every record loaded. Expect the load time to be the same as loading by natural keys.
What about the record-digest (aka, HashDiffs)? These take the satellite table attributes, hash them, and persists the digest value as a column in the satellite table. If you want to avoid this hashing too then you will need to compare every satellite table attribute to the staged content to determine what is a true change. This might not be a significant saving considering that when you start modelling you may be modelling 10s of columns or hundreds into your satellite table!
Would a non-HashDiff and a HashDiff load have comparable performance difference? Let’s answer that with this quadrant.
When we look at the difference in terms of the wide table and the number of records, the gap can be significant depending on the number of records compared and inserted in the target wide satellite table.
Cost analysis: Loading with Natural Keys can have a significant cost benefit, but using hashing for load comparison remains the optimal and predictable way to load satellite tables.
Cost of storage
Choosing the join key data type has an impact of the size of the data vault tables on disk and the number of micro partitions needed for each table.
We have not considered durable sequence keys as this is an anti-pattern.
We have shown how as a table grows the tables naturally cluster by load date timestamp (or even the applied date timestamp). What if we were to explicitly define the clustering key per data vault table, what impact would that have to storage?
And the outcome…
As you can see from the non-clustered and clustered satellite tables, we see varying size savings ranging from 1% to 45% (from very wide tables to very thin tables). This is due to micro-partitions being immutable, each days’ delta load could produce micro-partitions of varying sizes (compressed and encrypted) and therefore if the table you’re loading is explicitly clustered, Snowflake’s re-clustering operation re-writes the data into new micro-partitions of a similar size and achieve better compression for every column for every micro-partition. This operation is seamless without any downtime and the table ends up with fewer micro-partitions. Because we have chosen to cluster by load date in this example, only the micro-partitions of the same load date as those new records are being loaded are re-clustered if Snowflake determines it will improve static pruning.
Serverless operations carries a cost to cluster, see here for a full explanation on how Snowflake credits work. Choose a column of low cardinality to cluster, like we have decided to here by applying the cluster by expression.
Natural key-based data vault sizes may vary, this is because a natural key can be in any size and in fact could be larger than their surrogate hash key counterpart, a surrogate hash key will always be of the same size no matter what column values you feed it.
As an example, let’s demonstrate what clustering will do by executing the above clustering information function, before and after defining a cluster key for the satellite SAT_CARD_MASTERFILE table.
Because the satellite table is essentially loaded in load date order, the clustering before explicitly defining a cluster key is not that bad, but it can improve. Does this influence querying? Let’s see! Scroll down to the table below!
Cost analysis: Natural Key data vault has the lowest storage footprint, by clustering, the table sizes are reduced even further!
Cost to query
Finally, the cost to load records into data vault occurs once but the cost to query those records can occur for as many queries you run on that data. We won’t discuss query assistance tables in this article, instead we will combine the above data model tables using a standard SQL join query and compare runtimes per join-key data type apart from the durable sequence ids.
The following runtimes were executed on a X4LARGE virtual warehouse.
Remember, Snowflake querying is optimised through Snowflake’s unique approach to caching, learn more about that here. The first query produces a “result set” and that result is kept in cache for 24 hours so your subsequent queries seeking the exact same result will instead query the in-memory result!
As for the query plan, it is somewhat of a Bushy Tree!
Cost analysis: Natural key data vault has the best query time when combining all the relevant data vault tables
Summary
The question we data vault practitioners see in the field is, why have hub tables at all if we can have natural keys in the satellite tables themselves? Passive Integration, we seek to integrate across data sources the single representation of a business object by the immutable business key (or the nearest durable key we can use to represent that entity). With the central hub table per business object type we can trace everything we need to know about that entity across all source systems, after all, these data sources are nothing more than the outcomes of business processes automated by source systems that we historize into a data vault.
Join key data type (like on any other platform) plays an important role in architecture performance but should never dictate what your data vault model will look like.
In the next article in the series, we apply what we learnt in this article on one of data vault’s query helper tables, the point-in-time (PIT) table (and Snowflake’s Right-Deep Join Tree).
Note: use this article as a guide, test against your own workloads to make an informed decision.
Further reading
- Data Vault Guru: a Pragmatic Guide to Building a Data Vault https://www.amazon.com/dp/B08KSSKFMZ
- Keep your Data Vault honest with this test suite, https://github.com/PatrickCuba/the_data_must_flow
- Combine your Data Vault with a Data Mesh with Domain Driven Design, https://medium.com/snowflake/1-data-vault-and-domain-driven-design-d1f8c5a4ed2
- Merging historical data sources, https://roelantvos.com/blog/wp-content/uploads/2018/08/Data-Vault-Implementation-and-Automation-Merging-Time-Variant-data-sets.pdf
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.