Data Vault on Snowflake: Expanding to Dimensional Models
Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers.
With the informed perspective we now have on the types of keys we use to join data vault tables; this article revisits the temporal surrogate sequence key we discussed when building a SNOPIT. Yes, with the temporal sequence key in place we can properly model a Kimball fact and dimension using data vault tables instead. This technique may in fact influence how you model your data vault — — and probably should! Without further ado let’s dig in!
Series
- Join Key Data Type
- Snapshot PIT tables
- Incremental PIT tables
- Information Marts
- Simulating Fact tables with Bridges
- Dynamic Information Marts
A reminder of the standard data vault table types,
And the query assistance tables needed to boost join performance…
Episode 17: Simulating Fact tables with Bridge tables
Data Vault was never meant to replace dimensional modelling, instead the data vault design is meant to be the flexible enterprise data model supporting eventual consistency while remaining non-destructive to change. Data vault captures the historical changes of source-system domains that are built to automate business processes, data vault tracks the business objects/entities, their interactions/transactions/business events, and the change data about them. Therefore, data vault is modelled to represent your enterprise. How do we then present that data for the purpose of analytical value (consumer domain)?
We build Kimball-styled dimensional models on top of the data vault in the form of facts and dimensions and to support efficient querying of the data vault we have query assistance tables in the form of point-in-time (PIT) and bridge tables to support this.
- Dimensional models are not easy to change but data vault is,
- Data vault models are not easy to join but dimensional models are.
A Brief History of Dimensional Models
For clarity, let’s define these Kimball style dimensional models briefly,
- Slowly changing dimensions (SCD) — are the descriptive detail tables that are slowly changing about a business entity, usually driven by a natural key. There are several types of these dimensions we often see (not limited to):
— Type 1 — overwrite attributes in place, we only keep current values in this dimension, learn more about type 1 dimensions here.
— Type 2 — we track history by end-dating a record and inserting a new record representing the new active values by natural key, the active record will have a high-date (ex. ‘9999–12–31’) and optionally a current record indicator. We will be using this dimension type. Learn more about this dimension type here.
— Type 3 — we track history in a single record with a “previous value” column. This is more efficient for a single query not needing to traverse multiple records per natural key to track historical changes, but it is limited to just one historical column. Learn more about this dimension type here. We can in fact combine these types as to form an SCD Type 6 (1+2+3) dimension!
— Type 4 — we track current and history of natural key state changes in separate tables, one for current active records and another for history only.
- Fact tables — these are rapidly changing details and tend to be metrics (things you can apply aggregations to) and there are several types:
— Fully additive metrics — basic transaction values that can be aggregated across records supporting functions such as sum(), add(), avg() etc. We will be modelling this fact type.
— Semi-additive metrics — metrics that represent values you cannot use basic aggregate functions to pull together, like a rolling balance or pre-aggregate values already included in the fact table.
— Non-additive metrics — usually indicators or other rapidly changing values that cannot be aggregated at all, except for counts. Learn more about these fact table types here.
Fact tables can be further categorised as transaction, periodic snapshot and accumulating snapshot fact tables, learn more of this further fact categorisation here.
- Date/Calendar dimension — a role-playing dimension we use to slice and dice a star schema by a chosen date or timestamp. We usually have important business date flags and business-related checkpoints like quarter-ends, seasonal dates, cycles, and reporting dates. Learn more about this dimension table here.
Dimensional models are what we see most in data analytics and what most BI tools are familiar with using.
Transactions in a Data Vault model
Often when data modellers want to model transactions into a data vault the instinctive direction is to build transaction-links or (under the new terminology) to build non-historized link tables. This in fact (pun unintended) is not the most optimal deployment of batched transactions in a data vault. By batch we mean the data is not ingested in (near) real-time and instead is available to be ingested like any other non-transactional dataset, as data-at-rest (i.e., a file).
To illustrate this point let’s present our data ready to be data vault modelled, a single batch file called “card_transactionfile”:
- the file supplied by our cards source contains interactions between customer and account each supplied with their respective business keys. For simplicity we’ll keep this example at just two business entities
- each transaction contains a transaction id column and transaction amounts as debits or credits to the account. One batch file could have many transactions for a given relationship between account and customer unique by the transaction id and perhaps the transaction timestamp.
- the file also contains personally identifiable data about the customer (included for illustrative purposes on how we model that in data vault).
Based on this single file we will model:
- hub_account — modelled to an existing account hub table, we will passively integrate this transaction file’s content to the hub account table. The account id located in card_transactionfile will be loaded here.
- hub_customer — modelled to a new hub table, the transaction file contains customer id and personally identifiable detail we will model below.
- lnk_customer_account — modelled to a new link table, this link table captures the unique relationship between account and customer as depicted by the transaction file. There are only two business objects depicted in this link table. The link table is a child table to the two parent hub tables described above. Link tables are the recorded interactions between business entities and nothing more.
- sat_lnk_card_transactionfile — a new satellite table that will hold the transaction amounts, descriptions, and other batched transaction details. Because there can be many transactions in a single batch, the dependent child key here is the transaction id. This link-satellite table is a child table to the parent link table.
- sat_card_transactionfile_pii — a victim of our satellite-splitting exercise, we push customer-only details to this new customer-hub satellite table. As the PII details are identifying you should only expect to see a single record in this satellite per customer. This makes the management of such regulatory requirements such as GDPR’s Article 17 “right to be forgotten” / “right to erasure” far easier to manage if the identifying details are modelled here!
Always check with your own experts with regards to your legal/regulatory requirements for data management, governance, and retention on how to manage personally identifiable data!
A few technical details about the satellite tables we need to mention here before we start putting the data model together,
- Each satellite table will have an auto-increment / identity column, the same as we had defined for our SNOPIT, this is our temporal surrogate sequence key.
- Each satellite table will by default have a view inferring the record end-date, we described this in detail here and, in our example, we define the view name as: vh_{$satellite-table-name} — VH = View History
Putting it all together
Rather than deploy an entirely separate dimensional model based on a data vault, we will take our existing data vault components to simulate a dimensional modelling, reuse data vault tables as dimensions!
- as_of table we used in the construction of a PIT table will be reused as a date dimension for slicing and dicing our dimensional model. Add the following line to the as_of table build to simulate a date-key (ex. 20220622 as an integer represents ‘2022–06–22’)
- SCD Type 2 dimensions — each satellite table has a temporal sequence key as the autoincrement column we described above:
— vh_sat_card_masterfile view
— vh_sat_card_transactionfile_pii view
— vh_sat_card_transactionheader view
— vh_sat_card_balancecategories view
Recall that each dimension view has a virtual end date deployed with the following code:
- Transaction fact table — brdg_cardaccount_daily — this is our data vault bridge table based on link-satellite table sat_card_transactionfile. We use the bridge table to simulate a transaction fact table and include pre-calculated columns based on the other metrics from the link-satellite table.
This is an example of a derived (pre-calculated) metric we can add to our bridge/fact table for rolling card balance, click this link for more Snowflake window functions.
Depending on your requirement, Snowflake SQL also provides other analytical group functions you may consider for your metrics, such as:
- Group by Rollup — if you have a need to include sub-total rows in your fact table
- Group by Cube — builds on Group by Rollup to include a further aggregation on the sub-total metric
- Group by Grouping Sets — computing multiple Group By clauses in a single statement
The default deployment of data vault presented here already has all the components we need to simulate a star schema, now for the crux of the model, the bridge table build — our fact table.
insert into brdg_cardaccount_daily
with vh_sat_card_transactionfile as (
select dv_hashkey_lnk_account_customer
, dv_loaddate
, dv_sid
, card_transaction
, card_credit_debit
, sum(card_transaction) over
(partition by dv_hashkey_lnk_account_customer order by dv_loaddate
rows between unbounded preceding and current row) as card_transaction_summ
from vh_sat_card_transactionfile s
)
select hub_a.account_id
, lnk.dv_hashkey_hub_customer
, hub_c.customer_id
-- dimension join keys retrieved from each satellite table
, coalesce(sat_hub.dv_sid, 0) as sat_card_masterfile_dv_sid
, coalesce(sat_link.dv_sid, 0) as sat_card_transactionfile_dv_sid
, coalesce(sat_hub2.dv_sid, 0) as sat_card_name_pii_dv_sid
-- simulate date dimension key
, year(sat_link.dv_loaddate) * 10000 + month(sat_link.dv_loaddate) * 100 + day(sat_link.dv_loaddate) as date_sid
-- metrics
, sat_link.card_transaction
, sat_link.card_credit_debit
, sat_link.card_transaction_summ
< Other SELECTED Satellites/Dimensions keys and metrics >
from datavault.lnk_account_customer lnk
inner join datavault.hub_account hub_a
on lnk.dv_hashkey_hub_account = hub_a.dv_hashkey_hub_account
inner join datavault.hub_customer hub_c
on lnk.dv_hashkey_hub_customer = hub_c.dv_hashkey_hub_customer
-- chief table
inner join vh_sat_card_transactionfile sat_link
on (lnk.dv_hashkey_lnk_account_customer = sat_link.dv_hashkey_lnk_account_customer)
left join vh_sat_card_name_pii sat_hub2
on lnk.dv_hashkey_hub_customer = sat_hub2.dv_hashkey_hub_customer
and sat_hub2.dv_loaddate <= sat_link.dv_loaddate
and sat_hub2.dv_loaddate_end >= sat_link.dv_loaddate_end
left join vh_sat_card_masterfile sat_hub
on lnk.dv_hashkey_hub_account = sat_hub.dv_hashkey_hub_account
and sat_hub.dv_loaddate <= sat_link.dv_loaddate
and sat_hub.dv_loaddate_end >= sat_link.dv_loaddate_end
< Other SELECTED Satellites/Dimensions >
;
Although the above fact table build is transactional, you could easily modify the above into snapshot fact tables as we have done for logarithmic PIT tables. In addition, business vault satellite content could be included too; this would segregate soft business rules into a defined architectural plane for derived content.
Although the example above has selected all available columns from each satellite table, your final dimensions may be selective on the columns you expose to in the information mart and for that you need to consider condensing the records in that dimension in the final output, hint: SELECT DISTINCT is not the correct method to resolve this.
The final information mart that uses the above star schema uses an EQUI-JOIN with the surfaced temporal sequence keys we brought in from the data vault satellite tables (the auto-increment column). This of course, takes advantage of right-deep join tree join algorithm we previously discussed when building a SNOPIT.
Summary
Using data vault hub, link and satellite tables as a dimensional data model brings about dimensional querying right on top of the data vault! The data vault remains the auditable, agile, and automated data architecture that is flexible to change. As we have shown, dimensional modelling is not in competition with data vault, it shapes the data for easier processing of information as information marts — as we have shown, the two are complimentary.
This technique reduces the need to engineer and copy data from the data vault and reshape the data into facts and dimensions. And since the data has not moved at all, the dimensional mart (sans the fact table) is ready for consumption as soon as the data vault is updated! Should we need to reshape the fact table or not need it anymore, the original content is readily backed by the link-satellite table, your solution remains auditable! We have also shown functional business rules in the sense that we have added pre-calculations to the bridge table. Because these calculations can be re-created easily from the raw and business vault, these functional rules are classified as information mart rules, and not business vault rules.
In this series we have looked at:
- How your choice of join key data type influences your data vault build (not the model!)
- How the standard PIT table code can be improved
- How we can build and use incremental and merge PIT tables
- How based on Ghost Skew we can optimise Information Mart build
- How we can reuse Data Vault artefacts as Facts and Dimensions
In the final part of this series we will turn our attention to how we can use Snowflake’s new Dynamic Tables in a Data Vault, stay tuned!
Note: use this article as a guide, test against your own workloads to make an informed decision.
Further reading
- Data Vault Guru: a Pragmatic Guide to Building a Data Vault https://www.amazon.com/dp/B08KSSKFMZ
- Keep your Data Vault honest with this test suite, https://github.com/PatrickCuba/the_data_must_flow
- The Lost Art of Building Bridges, https://medium.com/snowflake/the-lost-art-of-building-bridges-3dbe89cbe253
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.