Data Vault on Snowflake: Hub locking on Snowflake

7 min readOct 31, 2022

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. There will be scenarios where you may need to consider transaction isolation and such a scenario does exist for Data Vault, the common hub table.

Episode catalogue,

10. Out-of-sequence data

9. Virtual Warehouses & Charge Back

11. Handling Semi-Structured Data

12. Feature Engineering and Business Vault

A reminder of the data vault table types,

Episode 8: Hub locking on Snowflake

We explored the concept of Passive Integration in the previous episode. This is of course the ideal situation for a Data Vault model, from a technology perspective if we are looking to integrate by and concurrently load to common hub tables per business object definition. Given that Snowflake is READ COMMITTED transaction isolation level, how then, do we guarantee that independent hub table loaders leave the target hub table with the same integrity after the load? Let’s explore how this is done in Snowflake and what this could mean to your data vault model.

Parallel loading and eventual consistency

As depicted above, any landed source data is staged and loaded to modelled hub, link, and satellite tables.

Satellite tables are single source and are unique by defined parent key + load date and the descriptive state content,
Link tables are rarely multi-source and are unique by defined unit of work
Hub tables are always multi-source and are unique by business object

Notice how each source loads to a common hub table in the above example, they must leave the hub table in the same integrity as each hub-loader found it; a unique list of business objects defined by business key + business key collision code + multi-tenant id.

The Problem

The problem is the same as what we described in episode 6 when discussing conditional multi-table inserts; if each thread attempting to load to the same target hub table is executed at the same time, due to the nature READ COMMITTED transaction isolation each thread views the target table without UNCOMMITTED transactions coming from other threads. Observe…

Duplicates break the integrity of the data vault model, and you will start to see information marts based on these structures starts to suffer too.

By the way, one of the tiles in our dashboard (episode 2) was checking for this issue!

The Solution

Snowflake does in fact allow for table locking and the syntax is simple, change the SQL INSERT statement into an SQL MERGE statement and the target hub table defined in the MERGE statement will be locked when it is its turn to update the target table. Of course, this implies that the hub-loader template is changed, and all configured hub-loaders will then use the SQL MERGE INTO statement instead! Observe…

Three SQL Data Manipulation Language (DML) lock Snowflake tables for updates,

SQL MERGE — Inserts, updates, and deletes values in a table based on values in a second table or a subquery, see bit.ly/3sFE4iE
SQL UPDATE — Updates specified rows in the target table with new values, see bit.ly/37qzxbL
SQL DELETE — Remove rows from a table, see bit.ly/3N04CSy

On the grand scheme for our automation the first animation can be updated to the following.

Parallel loading and eventual consistency and hub locking

Does this approach add latency? Probably, but it will be so minor because hub tables typically only process a few condensed records at a time using anti-semi joins. But the beauty of this approach is that we did not need to explicitly lock the target table using externally defined semaphores, Snowflake instead randomly decides which thread will get processed first and for hub tables, that’s just fine!

While you’re there…

SQL MERGE statements allow for SQL INSERTs and UPDATEs, it is an opportunity to explore a Data Vault artefact that was previously deprecated, the “Last Seen Date” column in the hub and link table. Because hub and link tables are wafer thin and tend to be short, the number of micro-partitions that make up these tables is therefore very small. Why not then allow for SQL UPDATEs to the hub and link table? To revisit how Snowflake stores tables as micro-partitions please visit episode 1, bit.ly/3LWR8qR

Yes, the cost of SQL UPDATEs were expensive operations, but these are true for satellite tables where we might see as little a single descriptive column to hundreds of them. SQL UPDATEs are still not recommended here! Let’s update our previous example to show where Last-Seen-Date can be useful. Observe…

Other Data Vault artefacts

Certain other Data Vault artefacts were developed to track this very information because they are INSERT-ONLY table structures, these are:

Record Tracking Satellites (RTS) — designed to record every time we see a business object or unit of work
Status Tracking Satellites (STS) — designed to track when a business object or unit of work appears, updates or is deleted in the source if the source is provided as a snapshot
Effectivity Satellites (EFS) — combined with a driver key we track the driver key(s) against the non-driver key(s) of a relationship

Compare SAT, RTS, STS, HUB Last Seen Date

Because “Last Seen Date” only records the latest occurrence of the business object or unit-of-work it is not a reliable source to check:

Source frequency of occurrence as tracked in RTS
Source deletions as tracked in STS, and
Changing relationships when such needs to be tracked for the current relationship.

Driving Key and Effectivity

If the data source does not contain a business date tracking the effectivity of the driving entity of the relationship, or you wish to track effectivity of a different driving entity than that of what is tracked in the data source then the need for an Effectivity Satellite arises, see: bit.ly/3oS4k70. Yes, a LAST_SEEN_DATE column in the link table will give you what the current relationship is without needing one of the most complex data vault patterns, let’s explore by way of an example.

No matter what the driving key is when utilising the LAST_SEEN_DATE you will get the current active relationship for that driving key/relationship. It also does not require that you deploy multiple effectivity satellites for each driving key you want to track on a single link table. However, you will not be able to trace the historical movement of that driving to non-driving key relationship, that is the exclusive realm of the Effectivity Satellite unless (again) the source system provides this movement.

To lock or not to lock

Always test these scenarios for yourself! The idea behind this technique is to take advantage of what Snowflake technology provides and keeping Data Vault automation still single purpose.

As quoted from the Zen of Python: “There should be one — and preferably only one — obvious way to do it.”

A hub loader should load one hub table from a single source file
A link loader should load one link table from a single source file
A satellite loader should load one satellite table from a single source file
There should be one way to do staging of landed content

Leverage Snowflake technology to ensure you meet your automation goals.

Reference:

Zen of Python — https://peps.python.org/pep-0020/
Transactions — https://docs.snowflake.com/en/sql-reference/transactions.html#read-committed-isolation
Data Vault Agility on Snowflake — https://medium.com/snowflake/data-vault-agility-on-snowflake-747a04485499
Published here too: https://www.snowflake.com/blog/data-vault-techinques-hub-locking/

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

Data Vault on Snowflake: Hub locking on Snowflake

Episode 8: Hub locking on Snowflake

The Problem

The Solution

While you’re there…

Other Data Vault artefacts

Driving Key and Effectivity

To lock or not to lock

Reference:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Patrick Cuba

No responses yet