Data Vault on Snowflake: Hub locking on Snowflake
Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. There will be scenarios where you may need to consider transaction isolation and such a scenario does exist for Data Vault, the common hub table.
Episode catalogue,
- Immutable Store, Virtual End-Dates
- Snowsight dashboards for Data Vault
- Point-in-Time constructs & Join Trees
- Querying really BIG satellite tables
- Streams & Tasks on Views
- Conditional Multi-Table INSERT, and where to use it
- Row Access Policies + Multi-Tenancy
- Hub locking on Snowflake
9. Virtual Warehouses & Charge Back
11. Handling Semi-Structured Data
12. Feature Engineering and Business Vault
A reminder of the data vault table types,
Episode 8: Hub locking on Snowflake
We explored the concept of Passive Integration in the previous episode. This is of course the ideal situation for a Data Vault model, from a technology perspective if we are looking to integrate by and concurrently load to common hub tables per business object definition. Given that Snowflake is READ COMMITTED transaction isolation level, how then, do we guarantee that independent hub table loaders leave the target hub table with the same integrity after the load? Let’s explore how this is done in Snowflake and what this could mean to your data vault model.
As depicted above, any landed source data is staged and loaded to modelled hub, link, and satellite tables.
- Satellite tables are single source and are unique by defined parent key + load date and the descriptive state content,
- Link tables are rarely multi-source and are unique by defined unit of work
- Hub tables are always multi-source and are unique by business object
Notice how each source loads to a common hub table in the above example, they must leave the hub table in the same integrity as each hub-loader found it; a unique list of business objects defined by business key + business key collision code + multi-tenant id.
The Problem
The problem is the same as what we described in episode 6 when discussing conditional multi-table inserts; if each thread attempting to load to the same target hub table is executed at the same time, due to the nature READ COMMITTED transaction isolation each thread views the target table without UNCOMMITTED transactions coming from other threads. Observe…
Duplicates break the integrity of the data vault model, and you will start to see information marts based on these structures starts to suffer too.
By the way, one of the tiles in our dashboard (episode 2) was checking for this issue!
The Solution
Snowflake does in fact allow for table locking and the syntax is simple, change the SQL INSERT statement into an SQL MERGE statement and the target hub table defined in the MERGE statement will be locked when it is its turn to update the target table. Of course, this implies that the hub-loader template is changed, and all configured hub-loaders will then use the SQL MERGE INTO statement instead! Observe…
Three SQL Data Manipulation Language (DML) lock Snowflake tables for updates,
- SQL MERGE — Inserts, updates, and deletes values in a table based on values in a second table or a subquery, see bit.ly/3sFE4iE
- SQL UPDATE — Updates specified rows in the target table with new values, see bit.ly/37qzxbL
- SQL DELETE — Remove rows from a table, see bit.ly/3N04CSy
On the grand scheme for our automation the first animation can be updated to the following.
Does this approach add latency? Probably, but it will be so minor because hub tables typically only process a few condensed records at a time using anti-semi joins. But the beauty of this approach is that we did not need to explicitly lock the target table using externally defined semaphores, Snowflake instead randomly decides which thread will get processed first and for hub tables, that’s just fine!
While you’re there…
SQL MERGE statements allow for SQL INSERTs and UPDATEs, it is an opportunity to explore a Data Vault artefact that was previously deprecated, the “Last Seen Date” column in the hub and link table. Because hub and link tables are wafer thin and tend to be short, the number of micro-partitions that make up these tables is therefore very small. Why not then allow for SQL UPDATEs to the hub and link table? To revisit how Snowflake stores tables as micro-partitions please visit episode 1, bit.ly/3LWR8qR
Yes, the cost of SQL UPDATEs were expensive operations, but these are true for satellite tables where we might see as little a single descriptive column to hundreds of them. SQL UPDATEs are still not recommended here! Let’s update our previous example to show where Last-Seen-Date can be useful. Observe…
Other Data Vault artefacts
Certain other Data Vault artefacts were developed to track this very information because they are INSERT-ONLY table structures, these are:
- Record Tracking Satellites (RTS) — designed to record every time we see a business object or unit of work
- Status Tracking Satellites (STS) — designed to track when a business object or unit of work appears, updates or is deleted in the source if the source is provided as a snapshot
- Effectivity Satellites (EFS) — combined with a driver key we track the driver key(s) against the non-driver key(s) of a relationship
Because “Last Seen Date” only records the latest occurrence of the business object or unit-of-work it is not a reliable source to check:
- Source frequency of occurrence as tracked in RTS
- Source deletions as tracked in STS, and
- Changing relationships when such needs to be tracked for the current relationship.
Driving Key and Effectivity
If the data source does not contain a business date tracking the effectivity of the driving entity of the relationship, or you wish to track effectivity of a different driving entity than that of what is tracked in the data source then the need for an Effectivity Satellite arises, see: bit.ly/3oS4k70. Yes, a LAST_SEEN_DATE column in the link table will give you what the current relationship is without needing one of the most complex data vault patterns, let’s explore by way of an example.
No matter what the driving key is when utilising the LAST_SEEN_DATE you will get the current active relationship for that driving key/relationship. It also does not require that you deploy multiple effectivity satellites for each driving key you want to track on a single link table. However, you will not be able to trace the historical movement of that driving to non-driving key relationship, that is the exclusive realm of the Effectivity Satellite unless (again) the source system provides this movement.
To lock or not to lock
Always test these scenarios for yourself! The idea behind this technique is to take advantage of what Snowflake technology provides and keeping Data Vault automation still single purpose.
As quoted from the Zen of Python: “There should be one — and preferably only one — obvious way to do it.”
- A hub loader should load one hub table from a single source file
- A link loader should load one link table from a single source file
- A satellite loader should load one satellite table from a single source file
- There should be one way to do staging of landed content
Leverage Snowflake technology to ensure you meet your automation goals.
Reference:
- Zen of Python — https://peps.python.org/pep-0020/
- Transactions — https://docs.snowflake.com/en/sql-reference/transactions.html#read-committed-isolation
- Data Vault Agility on Snowflake — https://medium.com/snowflake/data-vault-agility-on-snowflake-747a04485499
- Published here too: https://www.snowflake.com/blog/data-vault-techinques-hub-locking/
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.