You might be doing #datavault Wrong!

Patrick Cuba
18 min readAug 30, 2021

--

Let’s begin….

Why this is important?

Architecture, Data Vault is deployed using surrogate hash keys to simplify joins between hubs, links and satellites and the most optimal data type to use is the binary data type. Hashing does not need to be concerned with collation or any other variance of the common alphabet and therefore the binary / raw data type will be enough. Base64 function is used to convert binary to varchar (string) data type, this indicates that the architecture deployed allows for individual tools (like Talend or PowerBI) to do the data vault joins in their respective servers; these tools do not support the binary data type. Folks! This is the wrong place to be doing data vault joins!

  1. Data Vault comes with the need to perform many joins, to expose this to the business is not only confusing to the average report writer but non-performant
  2. Applying the joins in an external server (not on the data platform itself that supports binary data types and contains the data vault) indicates that data is being transferred over to this server before applying the joins! This is foolhardy (network carnage)! An anti-pattern! And indicates a lack of understanding of what happens between the tool and platform. All the predicate work must be done in-database (sometimes called “predicate pushdown”).
  3. The idea behind Data Vault 2.0 is that the information marts are exposed to the business or BI tool. This ensures no ambiguity is present for the business user and the Information Mart is in fact structured for the BI Tool of choice. This also minimises the data transfer if it is at all needed! Besides, hash keys should mean nothing to the business then why expose these platform optimizing data type structures to the business?

See more: bit.ly/3dn83n8

Why this is important?

Architecture, yes surrogate sequence keys are the fastest to join on, but imposing this paradigm imposes a data-pipeline and architecture restriction: dependency between data vault tables. Data Vault 2.0 avoids this quite rightly by allowing for all data vault objects to be loaded autonomously and in parallel. This opens up a door: removed loading restrictions that ultimately reduces the cost of implementation. Surrogate hash keys are deterministic and besides, satellite tables do not have end-dates, imposing such a restriction means the loading tool must perform updates on tables which in turn are expensive operations!

Phase 1: Load Hubs (Only) = Anti-Pattern
Phase 2: Load Hub-Satellites+lookup Hub surrogate sequence keys,;Load Links+lookup Hub surrogate sequence keys = Anti-Pattern
Phase 3: Load Link Satellites+Lookup Link Surrogate Sequence Keys = Anti-Pattern

How should we load everything?

In any order

Why this is important?

Architecture. A discipline such as Enterprise Architecture exists because it is well understood that the business must be modelled before the data, integration and application architectures are modelled and through industry proven patterns. The data must therefore map to what the business considers to be its business objects. Pick up “Business Architecture Management” (book link: https://bit.ly/3xpx7BV) to understand this more clearly, it also means that the relationships between business objects are represented by business process automation engines, i.e. your data sources. After all you do not purchase a tool like Salesforce for its logo! You purchase Salesforce because it mostly fulfills your business processes, the data output of which is captured into Raw Vault, the gaps modelled into Business Vault!

see: https://bit.ly/3wZbGql

The activities that occur between hubs are the business processes, these are captured as data vault link tables representing the unit of work.

Why then load anything but the immutable key into the hub table?

Why this is important?

In principle, and related to Business Architecture, and by definition, a hub is:

— a center of activity

— a center around which other things revolve or from which they radiate; a focus of activity, authority, commerce, transportation, etc.

Ask yourself, what is center to the business? Business Objects: defined as a passive element that has relevance from a business perspective

How do we define those business objects? By immutable business keys! Hubs must be the immutable business key that is business architecture aligned. Anything else in hub does not reflect the business architecture! Period!

Why this is important?

Automation. Data Vault is delivered by patterns, they are simple, they are repeatable, they increase the project cadence! Therefore populating a data vault using handcrafted code is an anti-pattern. Why?

— Code is repeated

— Changes are staggered

— Cadence is slow

— Testing is not automated

— Standards are not adhered to

— Exceptions are introduced

— Value is lost

— Complex ETL introduced technical debt because they are not pattern / template based!

Pro-tip: have you tried hashing a null value?

See: https://bit.ly/3vjTXdg

Why this is important?

Automation. Wikipedia defines idempotence as “the property of certain operations in mathematics and computer science whereby they can be applied multiple times without changing the result beyond the initial application”.

The outcome of automated business process automation tools (a collection of business rules) must be idempotent and thus ingested into Raw Vault. The same applies to derived business rules within your data analytics hub, the outcome if run multiple times without any of the input changed, must produce the same output and recorded in Business Vault.

See: Data Vault Mysteries… Business Vault, bit.ly/3BUt81s

and Apache Spark GraphX and the Seven Bridges of Königsberg, bit.ly/3ezZ6Wh

Why this is important?

Business Architecture. A function of a data warehouse is the integration of source systems. These source systems contain the business keys (immutable value) that represents the business object, therefore in an ideal world these should be shared across source systems and the data vault provides this passive integration in hub tables. No, the data vault is not a master data management system, that is an entirely different system all together! In fact, MDM can be used as a data source to the Data Vault!

See: · A Rose by any other name… Wait.. is it still the same Rose? bit.ly/3xlFK0s

and Passive Integration explained… bit.ly/3pTWCXP

Why this is important?

Architecture. To properly understand point-in-time (PIT) tables you need to understand its purpose, when you do you’ll understand why PIT tables are not effective without Ghost Records!

Why EQUIJOINS Matter! bit.ly/3dBxOQK

Data Vault Mysteries… Zero Keys & Ghost Records, bit.ly/3vjTXdg

The common theme with the above articles is this: joining all data vault tables should be possible with EQUIJOINS, if you are using SELECT DISTINCT to get the data out then likely one of two things have happened:

  • the query is wrong or
  • the data vault model is defective

Above all… know your data!

Why this is important?

Automation & Agility. This fits the data vault pattern, while the hub table records the unique list of respective business objects, the link table records the unique list of relationships between those business objects!

Unit of Work: the participating business objects in an activity forms a relationship.

See: https://bit.ly/36SA5mV

Why this is important?

Architecture. The job of the satellite table is to record the change details of its parent entity, whether that be a hub (business object) or a link (unit of work, relationship). The satellite content should be unique, if a packet of data is supplied with multiple business key columns then does the descriptive content supplied with these keys describe the relationship or one of the participating business keys, or a mixture of both? Don’t let the querying tool decide what that is, rather do the data profiling upfront and split the content to the appropriate satellite table.

Why this is important?

Automation. Trust that the source data has been correctly ingested into the data analytics platform is paramount to the analytics platform success. Documentation and robust testing must happen through source to target reconciliation ensuring that there are no points of failure. If failure can occur then what are the remedial approaches to ensure the customer or business can keep faith in the quality of data presented before them.

Tests can be broken down into

  • source to target reconciliation
  • source system integration testing
  • data quality testing that in fact improve and standardize data and
  • business data quality tests, testing that the business process automation itself is doing the job it needs to. Example, if I have a home loan account then there must be a residential address the loan refers to!
  • Referential integrity through primary and foreign key constraints

It is a popular system engineering term “fail fast, learn faster”

· Data Vault Test Automation, bit.ly/3dUHPIS

Why this is important?

Architecture, Agility and Automation. Data Vault is a different approach to what tradition Inmon and Kimball approach is. The latter two approaches have been in industry far longer than Data Vault, they have their disciplines and so too does Data Vault 2.0. Sticking to the standards can ensure success and what better way than to get the appropriate training and extend that training with onsite coaching. Afterall, if done right,

  • Data Vault does not replace Kimball modelling, it recognizes Kimball’s place is in Information Mart delivery whereas the Data Vault is focussed on the Data Warehouse delivery.
  • Bill Inmon has recommended Data Vault 2.0 as the methodology of choice in a modern data warehouse architecture, he did so at the wwdvc.

Grab the book!

“The Data Vault Guru”, 2020, amzn.to/3d7LsJV, amzn.to/3nsqTfR, amzn.to/30IxOYF

Interview: Meet Patrick Cuba author of “The Data Vault Guru”, bit.ly/2PYM9hp

Clone the Visio/Lucidchart stencils!

GitHub: thedatamustflow, bit.ly/31bx6UR

Let’s Vault!

Why this is important?

Patterns, patterns, patterns! When you hear that a business key is a number do not store that business key as a numeric data type! After all it is not a measure but an immutable value representing the business object! Your Data Vault should be built to change by applying the very basic of repeatable patterns to deliver the three key data vault model types, hubs, links and satellites…. https://bit.ly/3yA3HBf

Why this is important?

Architecture, Agility and Automation. Don’t boil the ocean, pick a business case, set the KPIs, build or buy a data vault automation tool and deploy repeatable patterns! Develop a Steel Thread!

These patterns are in architecture, agility of the data model and agile best practices. Set the standards early and see the repeatable patterns deploy your models at scale.

Why this is important?

Standards, standards, standards! Shortcuts are taken because the standards are not understood. Or the path is not clear as to why the standards are there! Keep in mind that standards are not just modelling standards, they are agile best practices and recommended architectures as well! Get the training, join the Data Vault 2.0 community, get practicing data vaulting!

Data Vault is relatively new when compared to Kimball and Inmon data modelling, Data Vault does not replace Kimball instead it recognizes that Kimball modelling is intended for business facing analytics and the Data Vault is intended for the data warehouse which is why it serves auditability, automation, agility and it is platform agnostic.

See: Advantage Data Vault 2.0, bit.ly/3htcwpw

and Building Data Vault modelling capability through the Mob, bit.ly/2NiceTN

Why this is important?

Automation. Again! Nothing but Business Objects are represented in a hub table, relationship / unit of work is in the link table. Each has a purpose, like a graph’s vertex and edge. Stick to the standards, build a data model to scale!

Pro-tip: if a customer has written a query to join only hubs and links and they have used the DISTINCT statement in their SQL query, be worried!

Only two reasons why users would use a DISTINCT keyword when joining hubs and links…

  1. Defective Data Vault model
  2. Poorly written join query

Pro-Tip 2: customers can confuse join SQL syntax between a link and two hubs and a link to the same hub, the query structure is identical!

Why this is important?

Architecture, Agility and Automation. A simple way to automate data vault build but not an efficient way. Building a deployment this way is the way data vault implementations would have been done back in the 90s! Process, automation and technology improvements means that data vault is deployed using modern architecture and methods. Defining DAGs and using processing power constantly without delay, this waves approach is not an agile approach and even career limiting!

Example Anti-Pattern

So much processing time is lost! If one load fails the entire load is on hold! Think of distributed computing when populating a data vault at any time of day and in any order!

Why this is important?

Architecture, Agility and Automation. Conformance of columns to a user-friendly set of column names is an anti-pattern in raw and business vault. Only the business key is conformed into a hub table because it will likely have multiple sources. If you conform column names in for satellite column names it means you’re introducing technical debt. You have to manage this ongoing and if the source starts supplying a column name you’re already named to in the raw satellite table, what then? Snowballed technical debt!

By extension business vault is not used to conform raw vault column names, it is used for derived content based on raw vault. Column name conformance must happen at the information mart level, and nowhere else! This ensures the raw and business vault continues to be decoupled and different business units can have the data conformed to what they want in the information mart layer!

Why this is important?

Agility and Automation. Pulling keys not in scope only to be used later then requires re-engineering when they are needed… and why not use same-as link tables for content when they are in scope? What if they never come into scope?

Why this is important?

Automation. Record reversion, the ability for a business object to go back to the state it was before. If you’re writing satellite comparisons to detect if the staged record is new, keep in mind that the business object may choose to go back to what it was before! Always compare against the latest/current state of the business object!

This is also why Effectivity satellites are in place for link tables,

Data Vault Mysteries… Effectivity Satellite and Driver Key, bit.ly/36SA5mV

Why this is important?

Automation. sysdate is a system date and if included in the record hash makes the record new every time — this is an incorrect implementation of record hash digest!

Why this is important?

Automation. Clarify, including load dates in the record hash means every record is new by default! Load dates inform the architecture when the record was loaded into the data warehouse, thus loading new data infers a new load dates. If the data is the same as the current record in the target satellite then you should not insert that record — — nothing has changed! Never include your own data vault metadata columns in the record hash and it is unnecessary to include the business key in the record hash as well, the load code already does a by-surrogate hash key comparison! Adding the business key to the record hash has about a zero effect to its uniqueness!

Why is this important?

Architecture. Binary data types are ideal to store hash digests because what is it that the output of a function produces? An output that resembles a hexadecimal string output meaning that you’re not interested in collation, special characters and having to deal with characters that come from diverse languages. Hash digests are raw, and use half the byte length needed for the same characters in a string, therefore they are efficient as join keys for performance!

Why is this important?

Audit, audit, audit! You must be able to recreate the source tables at any point in time (within data retention policies of course!) Therefore all columns should be taken, columns may be split between critical data elements (CDE), what the content is describing, rates of change and personally identifiable information. Keeping everything as they have arrived (hard rules applied) ensures the data warehouse can be trusted.

Why is this important?

Automation. Use the correct function where it is intended, for example if you were coding to create a record hash digest using the semi-structured-data function to concatenate structured columns, instead of the structured data function. In other words, do not use array_construct for concatenating structured columns before hashing, why? The array function will be slower than just using the concat function. If it is slower on the cloud on a cloud-based utilization cost model it means the operation will inevitably cost you more to run an array construct!

Why is this important?

Architecture. An incorrect interpretation of Raw Vault versus Business Vault is to treat RV like an Operational Data Store (ODS) and BV as the business facing portion of data vault. RV must be modelled to the business architecture of the business, BV in turn completes the business process gaps in RV and additionally is used for centralizing additional business rules that needs the same automation and auditability you get with RV. Besides, we know neither of RV and BV are business user facing anyway!

See: Learning Data Vault is Like Learning How to Make Beer! bit.ly/2ZYGpJP

Now consider what the Data Vault 2.0 loading pattern should look like

Constant, file/event based inserts into Data Vault 2.0 model with XTS

And if you include timeline correction… · Data Vault has a new Hero, bit.ly/3y4mUdV

and Data Modelling Meetup Munich (DM3), bit.ly/3rA7rQg

Why is this important?

Architecture, Agility and Automation. A cheap way to build a data vault is not recognizing passive integration, this leads to a source-system data vault which all it’s doing is splitting source primary keys into hubs that really add no value and simply multiplies the number of tables needed to join when getting the data out of the data vault!

Folks! A Data Vault will become a legacy data warehouse unless it’s aligned to what the business expects to see and represents the business architecture decisions, enterprise architecture 101!

Why is this important?

Architecture and Automation. Big Data platforms generally don’t allow for table updates, not even Snowflake does an update on its micro-partitions! Instead the micro-partition is deleted and a new micro-partition is inserted in its place and the deleted micro-partition becomes a part of Time-Travel. Updates are expensive and impractical, instead modern relational database platforms make use of SQL Window functions to infer the END-DATE columns of satellite tables and a Record-Tracking Satellite is used in place of the classic last_seen_date that existed on hub and link tables. Insert-only satellite table also make XTS possible (timeline correction).

Why is this important?

Audit. How can you recreate the source if you have destroyed the Unit of Work from the business process represented by the raw source? Attempting to recreate the source will produce a ‘false positive’, i.e. relationships that never existed, so in other words if more than two business objects participated in a relationship, do not by default break that relationship up into a two-hub link!

Why is this important?

Automation and Architecture. Introducing ‘weak’ hub tables is not only against the very definition of what a hub is, it introduces yet more tables to join in the data vault to get the data out! This is done in anti-patterns that do not recognize link-satellites as well!

What is this important?

Automation, Architecture and Audit. In principle business keys are not measures, they need not be data types that way. You cannot add or subtract from a business key without it becoming a different business key. To support passive integration and to avoid refactoring store business keys as string, see more examples here: bit.ly/3tPI66B

What is this important?

Automation, Architecture and Audit. Smashing business keys into a hub table forces the querying of those hubs to now un-concatenate those business keys, for every query. Avoid this technical debt by not introducing it yourself!

What is this important?

Architecture and Audit. The promise of Data Vault is not having to refactor the data model when a change is needed. True this pertains to the overall Data Vault model itself (the model is non-destructive to change) but also to the very satellite tables themselves, in other words schema drift or schema evolution, neither should cause a reload of the satellite table, in a large data warehouse this is impractical and expensive, like applying updates to the tables themselves! Instead if the new column(s) are detected, or columns(deprecated), stop if the change is unexpected, decide what to do with this new content.

  • is it PII?
  • is it a Business Key?

For new columns(s) execute an ALTER table statement and add the new profiled column(s) and ensure that the HashDiff includes then new columns. Including this new column will not create duplicates, think about it. We didn’t know of these facts yesterday! Refactoring the whole satellite table for the sake of the new columns in fact breaks audit! The very nature of adding a new column represents an auditable fact! Add the new columns to the right of the satellite to maintain column order.

For deprecated columns do not change anything, it is no longer populated, do not update the hashdiff calculation.

You might have satellite tables so large that is it will be costly to “refactor”.

See: Data Vault Elevator Pitch, bit.ly/2RyoRjv

and Data Vault or: how I learnt to stop worrying and love Data Governance, bit.ly/3vhEDxJ

and Bring out your Dead… Data, bit.ly/3o8To3m

and Business Key Treatments, bit.ly/3bQjAKN

and Applying Top-Down, Big-Picture Models to a Data Vault, bit.ly/3vRXtf6

Last but not least

What is this important?

Architecture. Data Vault 2.0 has the techniques (see: bit.ly/3xlFK0s) and patterns to integrate the the various data sources (automated business rule & business process engines) to already integrate all the content at the raw data vault area, and align it to the business objects of the business architecture, why would you need more layers? More layers = more to manage, larger latency between business event and analytic value. Business vault completes the raw vault business processes as the business sees it, often 3rd party tools are serving more than just your business (they have their own customers) and therefore how they map industry business processes might not fully align to how you see business processes in the industry, that’s where business vault comes in. It fills that gap. Of course you can use business vault to derive more content with the objective of extending raw vault and other business vault artefacts (see: bit.ly/3BUt81s) and provide the same auditability and agility expected of raw vault. So ideally, if the source systems solve all the tech debt and fill the gaps themselves then there would hardly be a need for a business vault. Business vault is sparsely populated.

RV + BV = DV.

Aim for this.

This post is an extension of Seven Deadly Sins of Fake Vault, bit.ly/3n9zIxB

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

Responses (2)