Data Vault on Snowflake: Conditional Multi-Table INSERT, and where to use it

Patrick Cuba
10 min readOct 3, 2022

--

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. In today’s episode we will explore the appropriate place to use a Conditional Multi-Table INSERT for your Data Vault, and where not to use it!

Episode catalogue,

  1. Immutable Store, Virtual End-Dates
  2. Snowsight dashboards for Data Vault
  3. Point-in-Time constructs & Join Trees
  4. Querying really BIG satellite tables
  5. Streams & Tasks on Views
  6. Conditional Multi-Table INSERT, and where to use it
  7. Row Access Policies + Multi-Tenancy
  8. Hub locking on Snowflake

10. Out-of-sequence data

9. Virtual Warehouses & Charge Back

11. Handling Semi-Structured Data

12. Feature Engineering and Business Vault

A reminder of the data vault table types,

We will again be using the PIT query assistance table in our Data Vault implementation.

Episode 6: Conditional Multi-Table INSERT, and where to use it

Snowflake’s SQL multi-table insert allows for you to insert data into multiple target tables in a single SQL statement in parallel from a single data source. The SQL conditional multi-table insert extends that capability by allowing for WHEN conditions to be configured for each of the target tables based on the content of the source data. Think of it like a CASE… WHEN condition you’d see in an SQL SELECT statement that executes IF…THEN…ELSE logic, except in this case 😊 the conditional output is defining the target table to send data content to!

Let’s see what that looks like:

from Snowflake docs, see bit.ly/3h97Ypz

In a Data Vault context there is a smart way to use the conditional multi-table insert and make it completely business driven. I’ll show you what I mean shortly, but first…

Where it should not be used

As data is landed before the data platform in a landing zone the content is staged by applying hard rules like trimming columns and ensuring the timestamps align. Data Vault staging will also add data vault tag columns, such as

· Surrogate hash keys — destined to be loaded to hub tables as hub-hash-keys, and if a relationship in the data exists, will also be destined to be loaded to link tables as link-hash-keys

  • Record source — denoting where the record came from
  • Load date timestamp — for when the data is loaded to data vault
  • Applied date — the date of the business event as it occurred in the source, essentially the extract date
  • Record hash, or HashDiffs — record digest, from a single source there could be one or many hashdiffs depending on the satellite split
  • and more…

Here it might be tempting to use this staged content as a base for a multi-table insert

a simple model, a business object, and its descriptive attributes

Table names and content have been simplified for clarity

SELECT DISTINCT will eliminate duplicates in the staged content and the data vault model integrity remains valid. In reality, data does not arrive in such perfect harmony, source-system data is profiled and will be mapped to hub, link, and satellite tables but what if we have modelled the following scenario,

  • a “same-as” link table that essentially references the same hub in the model; and
  • defined satellite splitting to ensure the right attributes are allocated to the right entity from a single load.
a more realistic model, what does it mean for a multi-table insert?
multi-table insert will lead to model integrity issues

Integrity Lost! Why did it happen?

There are essentially two problems highlighted with using a multi-table insert here:

1. Record condensing

Data is profiled and modelled to specific target table grain. Our example identified a relation between two key columns that represent a linking of cards, since both are essentially card numbers they will be loaded to the same hub, hub_account (Hub1). A 3rd portion of the relationship exists and that is with the customer, those keys are loaded to hub_customer (Hub2). We also applied some satellite splitting because not all the content is about the related cards. Since this is coming from a single staged file the content used as a base for a multi-table insert statement must use a single SELECT statement over this data. A single SELECT DISTINCT does not distinguish which portions of the same file to apply this rule to, it applies to all and hence why the hub-satellite content is not deduped.

2. Multi-table insert threads are executed in parallel

This means if we were loading to the same hub table at the same time each thread executing the same load condition is not aware of the other thread’s execution condition even if it’s coming from the same SQL statement.

Remember, Snowflake uses a READ COMITTED transaction isolation level and thus an INSERT statement does not lock the target table for inserts and each executing thread will see the table without the other thread’s uncommitted statements. In other words, the parallel execution of the multiple threads in a multi-table insert sees the same target table state with each thread unaware that another thread is attempting to load to the same target table with the same load conditions with potentially overlapping content!

independent, configured hub, link and sat loaders

READ COMITTED transaction isolation is excellent for keeping the platform scalable and allowing for the extreme levels of concurrency possible on Snowflake by not risking object locking and possible race conditions! In a future episode we will discuss how to do table locking on Snowflake and where we will need it.

You could imagine that the above scenarios could be coded for using multi-table inserts, and you could design and build a conditional ingestion pattern that if you encounter this modelling scenario then use Pattern A else Pattern B and so on. This code suite will become monolithic and tedious to maintain! Just how many scenarios would you have to cater for?

The essence of an architecture approach that uses repeatable patterns does not have such switch conditions to load the same table type. Rather there should be only one way to load a hub, a link, or a satellite table and the automation of these patterns reuses those simple building blocks through parameterisation to deploy this as an idempotent technique, like service-orient architecture. No matter the data vault model, you simply configure the hub, link, and satellite loaders as such.

Where can we use multi-table inserts? Well…

Where you could use it,

Let’s discuss the components of the above architecture

  • as_of switchboard, this is the date dimension table used as a base to set PIT windows (start & end dates). In addition to the regular columns, you’ll find in a date dimension this table will also include columns used as flags to denote whether a date or timestamp used in the window falls on a specific reporting timestamp, daily, weekly, monthly etc.

· the example data model in the above animation contains a hub and two satellite tables and data are constantly being loaded to those tables using independent and parameterised hub and sat-loaders respectively — the loaders are executed in parallel. PIT flow manifold in the other hand can be executed in one of two ways:

1. as PIT table rebuilds — a window is selected as a time period for reporting, if we were loading for a period window from 1 Jan 22 to 30 Jun 22 and when parsed through the manifold the as_of table’s flags control the flow of surrogate hash keys and load dates directed towards the logarithmic PIT table constructs. For our illustration above, let’s make a simple example without showing table growth. Each satellite gets 1000 records a day and only deltas are loaded, so if we attempted to load 1000 records a day for the month of January, we will likely have less than 31,000 records in the satellite but 31,000 recorded keys and load dates in the daily-PIT table. Here’s a summary for each PIT table.

Remember, satellite tables only load changes to the previous current state, so if for a consecutive day the record state hasn’t changed then that record is not loaded.

2. as a part of the data pipeline — yes, you could be running the flow through the PIT flow manifold daily and incrementally publish keys and load dates to the logarithmic PIT tables. The PIT window in this case is the single date this publishing of keys and load dates is run on. If the number of hub business keys remains consistent as they do in the above example, then you will expect the same record count as the above table. If (as a prospering business does) the hub tables are growing, then the record counts will differ between this pipeline PITs and rebuilt PITs.

In both implementation types the code is identical, the variance is only in the selection of a PIT-window.

How do we make this business driven? Deploying the switchboard as reference table controlled by reporting needs of the business. The code for the PIT flow manifold is deployed once and it sends keys and load dates to the target PIT tables controlled by the switchboard! The animation above was purely for illustration but essentially you can design the as_of switchboard table to include switches for

  • US financial year end, UK financial year end
  • Business daily instead of just Daily
  • First Business Day of the Week
  • Last Business Day of Week

Observe…

my_loaddate is the PIT Window, each PIT output is logarithmic

For PIT table rebuilds change the PIT window to a start and end date

Query assistance tables (PITs & Bridges) are disposable and only used to store keys and very light derived content — the type that does not need to be stored permanently because the metrics used for this calculation are stored in the raw and business vault. Because these constructs use inherent data platform techniques to boost query performance and simplify table join the information marts are intended to be deployed as views. Multi-table inserts is just another technique we can use in Snowflake to simplify our Data Vault deployment even further.

Other performance tips:

  • hash key and hashdiff column generation should be done in one place, and that is in staging. Do not perform hashing in the data vault table loaders (hub, link and satellite). By performing the hashing in staging, the hash keys and hash diffs are simply used and carried in the loaders and query assistance table builders.
  • To generate the hash key and hashdiff column values do not use functions intended to concatenate semi-structured table data (example array_construct or object_construct). These functions will perform slower than the concat function that is intended to be used with structured table columns, performance results: bit.ly/3OlitmM.
  • store hash columns as binary data type, they will use half the storage footprint if you were to store them as varchar columns and perform better in SQL joins.
  • include untreated business keys in satellite tables, that avoids the need to perform an SQL join to a hub table to fetch the treated business key(s). Learn more about business key treatments here bit.ly/2YnlSh0.
  • do not bring hash-keys into the information marts, they serve no purpose to the business user.
  • Consider satellite splitting when designing your raw vault satellites, this will avoid the need to denormalise the content further downstream and save you credits at query time.
  • Business keys are always of data type varchar, this avoids any need to refactor should another source provide a business key to the same hub table that is of a different data type, for examples and justification see bit.ly/3iEiHZB.
  • Raw vault reflects raw data, do not conform the column and table names in raw vault or this will lead to rework and refactoring when source schema drifts.

Reference:

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

No responses yet