Data Vault Loader Traps
You might have already had to deal with getting the load to satellite tables correct. The data vault test suite is designed to keep your data vault loading patterns honest. Going further from this idea, this post will provide a case study on what can go wrong when you don’t follow this loader rule.
1: Ensure you are comparing against current records for satellite loads.
Day 1 staged content:
Day 1 satellite after staged content has been loaded:
Day 2 staged content:
Day 2 satellite after staged content has been loaded: for Account ID 1, the change is loaded.
{ account_id: 1, attribute: ‘Change’, loaddate: 2-Jan }
differs from
{ account_id: 1, attribute: ‘Original’, loaddate: 1-Jan }
Day 3 staged content: the previous state of Account ID: 1 has returned, ‘Original’.
If we compare the whole satellite for the account ID, you will get the following:
{ account_id: 1, attribute: ‘Original, loaddate: 3-Jan }
differs from
{ account_id: 1, attribute: ‘Change, loaddate: 2-Jan }
but matches
{ account_id: 1, attribute: ‘Original’, loaddate: 1-Jan }
This is obviously incorrect because Account ID: 1 should have reverted to ‘Original’, and the outcome should be:
And no, including the business key in the calculation of HashDiff does not fix this! Instead, the satellite comparison must be between the staged content and the current record for an account ID. Now, a link that stores a relationship will also track relationship changes and a relationship can revert to the original relationship! We deal with this pattern using driving keys and effectivity satellites when the source does not supply a business date tracking that very event!
2: Using multi-table insert to load to a common hub table will load duplicate records.
Multi-table insert does as advertized: it loads target tables in parallel. When modelling a same-as link or a hierarchy link, the target tables will be:
- A link table
- A hub table for the business object’s first business key and the same hub table for the business object’s second business key
The loader for a hub table must ensure that only new business keys are loaded to the target hub table, leaving the hub table with the same integrity as before we loaded a unique list of business keys to it. In a multi-table insert, there is a thread for each insert. There is no way to ensure that both threads will consolidate the check to not load duplicates. Each thread will independently execute ‘insert records where not exists’ code (anti-semi join). Let’s demonstrate what that looks like.
Day 1 staged content:
Day 1 hub after staged content has been loaded:
Day 2 staged content: account ID and other account ID are the same.
Day 2 hub after staged content has been loaded:
We have duplicates because each loader acts independently! To illustrate this, let’s now try to load with a previously seen business key.
Day 3 staged content: account ID and other account ID are the same but both exist in the hub table.
Day 3 hub after staged content has loaded: no change.
On day 4, to further illustrate this concept, we will have staged duplicates in the account_id column.
So, now what happens? Recognize that the code to check against the target hub table must execute a select DISTINCT. However, since the select distinct of the columns in staging is already unique, the code will load BOTH account ID 4s.
Day 4 hub after staged content has loaded: more duplicates.
Therefore, multi-table inserts should not be used for loading data vault hubs, links, and satellites. They do have their place in loading point-in-time (PIT) tables though.
3: Use semi-structure functions on structured columns for concatenation.
It might seem trivial but if data load volumes are large, picking the wrong function to perform concatenation before hashing becomes an expensive task!
Using semi-structure functions to perform column concatenation for creating a record hash will be slower than using the standard concatenation function available to structured data. Using the function might be pleasing to the eye, but it is slower in execution. Observe:
You can see a difference of nearly 60 seconds. Do you find that significant for your workload? Well, by second billing with a minimum of 60 seconds billed you might. Use the right function for the right data type.
Structured concatenation function:
, sha1_binary(UPPER(concat(dv_tenantid, ‘||’, dv_bkeycolcode_hub_account, ‘||’, coalesce(to_char(trim(account_id)), ‘-1’)))) as dv_hashkey_hub_account
Semi-structured concatenation function:
, sha1_binary(UPPER(ARRAY_TO_STRING(ARRAY_CONSTRUCT(dv_tenantid, dv_bkeycolcode_hub_account, coalesce(to_char(TRIM(account_id)), ‘-1’)), ‘||’))) AS dv_hashkey_hub_account
4: Perform hashing operations in the loaders.
The automated process for getting data into a data vault is a combination of autonomous functions; each function has a single purpose driven by configuration. Once a file/table is landed, it is staged. The act of staging adds DV-Tags:
· DV_LOADDATE
· DV_APPLIEDDATE
· DV_RECORDSOURCE
· DV_HASHKEY_<HUB-TABLENAME | LINK-TABLENAME>
· DV_HASHDIFF
The staged content (can be delivered as a view) is then used as a source for the loaders. There are three loader-types:
· Hub loaders — ensure that they will only load new business keys and uniquely
· Link loaders — ensure that they will only load new relationships and uniquely
· Satellite loaders — ensure they only load new changes and uniquely
What is the big deal?
Hash keys that are loaded to a hub-satellite have a parent hub table. If you are calculating the hash in the loaders for the hub-satellite table AND the hub table, you’re calculating the hash twice (although they will be the same value)! What happens if you load a link table that has two or more parent hub tables?
Endnotes:
- Code for these tests
- Test was executed by switching off result cache and flushing the XSMALL virtual warehouse after each execution.
- Make sure you test implementation performance before committing to this implementation.