Data Vault on Snowflake: Information Marts & Ghost Skew
Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers.
The data vault contains many tables to join, and we use PIT tables to simplify those table joins before presenting that data into a meaningful form for the business user as information marts. Just as PITs as snapshots and PITs as incremental builds have special considerations on Snowflake, so too does how you query those PIT variations for those said information marts.
Note that building information marts based on PITs is not the only option to delivering information to your users!
Series
- Join Key Data Type
- Snapshot PIT tables
- Incremental & Merge PIT tables
- Information Marts & Ghost Skew
- Simulating Fact tables with Bridges
- Dynamic Information Marts
A reminder of the standard data vault table types,
And the query assistance tables needed to boost join performance…
Episode 16: Information Marts
In a data vault, Information marts are typically delivered as views over query assistance structures such as PITs and Bridges, this means information is up to date as soon as their underlying query assistance structures are updated. Of course, this means that every query against those information marts executes the view query logic, Snowflake’s caching may assist with that querying but beyond the techniques we have discussed in this series and the previous series, you may consider physicalising those information marts.
Before you do, there may be another technique you may consider optimising your view query performance, Ghost Skew.
Let’s dig in!
Ghost Skew
As we have seen in the previous article, depending how far back your snapshot PIT is created for you may end up with keys in your PIT table that references nothing but the ghost records from ALL those selected satellite tables! This implies that at that snapshot date there is no data in the surrounding satellite tables for that business object yet.
Let’s break this skew down:
- ALL join keys for a business key reference a ghost record
- ALL join keys up to a point in time reference ghost records but then start to reference change details about a business key
- SOME join keys point to ghost records in some of the referencing satellite tables
- SOME or ALL join keys that are not referencing ghost records don’t change for a long period, skew in the transactional data itself!
- No skew in the form of continuous change details happening at almost every snapshot period at regular intervals. We can call this a normal distribution of join keys.
Just how severe could the Ghost Skew be? How do we measure it? Because Ghost Records are artificial records we inserted by PIT table construction, the way we measure their occurrence is by evaluating how much of Ghost Skew exists versus non-Ghost Records.
Let’s start by a simple count,
Step 1: Data Preparation
We are interested in the proliferation of ghost records that are present in all or some of the satellite keys. To prepare the data, we do a little one-hot encoding,
case when sat_card_masterfile_dv_hashkey_hub_account = to_binary('00000000000000000000')
and sat_card_transaction_header_dv_hashkey_hub_account = to_binary('00000000000000000000')
and sat_card_balancecategories_dv_hashkey_hub_account = to_binary('00000000000000000000')
and sat_bv_account_card_summary_dv_hashkey_hub_account = to_binary('00000000000000000000')
then 1
when sat_card_masterfile_dv_hashkey_hub_account = to_binary('00000000000000000000')
or sat_card_transaction_header_dv_hashkey_hub_account = to_binary('00000000000000000000')
or sat_card_balancecategories_dv_hashkey_hub_account = to_binary('00000000000000000000')
or sat_bv_account_card_summary_dv_hashkey_hub_account = to_binary('00000000000000000000')
then 2
else 0
end as ghost_skew
Step 2: Calculate Skew
Although with non-Ghost record-keys we may not have direct control over, for the Ghost records we do. We will use Pearson’s Coefficient of Skewness to calculate Ghost Skew.
- The formula for the first coefficient of skewness (mode skewness)
- The formula for the second coefficient of skewness (median skewness)
The rule of thumb with skewness is this:
- If skewness is less than -1 or greater than 1, the distribution is highly skewed
- If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed
- If skewness is between -0.5 and 0.5, the distribution approximately symmetric
A negative skew indicates that the mode and median are greater than the mean, and the inverse is indicative of a positive skew. Because our underlying dataset has a dramatic injection of ghost records, we will rely on median coefficient.
By contrast let’s compare the statistics we get from the incremental PIT tables
Snapshot PIT tables have vastly greater data skew than the incremental PIT tables, next we will look at how to code for this skew in Snowflake. Keep in mind that in this example we were only looking to deal with Ghost Skew, for non-ghost record skew you may have to look at the frequency of updates to your business entities.
Cost to query
Based on what we’ve learnt about ghost skew, how does this affect information mart builds?
The following statistics were gathered using an X4LARGE virtual warehouse.
Standard Information Mart query
The queries we have run thus far on both Snapshot and Incremental PIT tables have been of the standard Information Mart calibre, recall that these stats were introduced in articles 2 and 3 of this series.
Standard PIT, un-clustered:
Incremental PIT, un-clustered:
Standard PIT, clustered:
Incremental PIT, clustered:
Skew optimised Information Mart query
PIT tables may come with an amount of skew, to deal with the skew we run a skew-aware version of the information mart query that produces the same output. To illustrate what effect this has on query times we run this new information mart structure over the same PIT tables as before.
Standard PIT, un-clustered:
The standard information mart query runs in the tens of minutes; we have reduced that runtime to less than 2 minutes, some query times return within seconds.
Standard PIT, clustered:
We also see similar improvement on clustered PIT tables; the most dramatic improvement is for SNOPIT tables.
Incremental PIT, un-clustered:
We don’t see such a dramatic improvement for incremental PIT tables, but that is because we never had a large ghost skew!
Incremental PIT, clustered:
The same applies to clustered incremental PIT tables.
How did we improve information mart performance over the standard PIT table structures? We’ll chart clustered standard information mart query times versus the clustered skew optimised information marts based on the same snapshot PIT tables.
For simplicity, we’ll show only the modifications we have applied to the PIT table:
- Ghosts-Only — isolating where ghost records are referenced for all satellite table keys)
- All-Other — for the remaining keys (moderate skew).
The code then sets the two outcomes together in a UNION ALL step.
Cost analysis: Massive improvement in query time when dealing Ghost Skew in snapshot PIT tables, no improvement noted for incremental PIT querying
Summary
Understanding the shape of your data is vital to getting the performance you desire on Snowflake. Remember, Snowflake traditional tables are made up of micro-partitions and pruning and clustering remains the optimal way to improve performance on Snowflake. By profiling the keys needed in the join conditions we can code for skewness in join keys based on different distribution profiles to help Snowflake’s query planner to process the data even faster.
What is clear is depending on the size of the PIT table, data type and distribution of keys, these may influence the type of PIT table you choose. Remember, in the first article in the series we did say that some key types are not mutually exclusive. In other words, you could deploy a data vault based on surrogate hash keys or just natural keys and build PIT tables using surrogate hash keys, natural keys, or even temporal sequence ids from each satellite table.
In the next article we’ll take a detour from building PIT tables and showcase a variation of building a bridge table to simulate a dimensional fact table.
Note: use this article as a guide, test against your own workloads to make an informed decision.
Further reading
- Data Vault Guru: a Pragmatic Guide to Building a Data Vault https://www.amazon.com/dp/B08KSSKFMZ
- Keep your Data Vault honest with this test suite, https://github.com/PatrickCuba/the_data_must_flow
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.