The Data Vault Must Flow

Patrick Cuba

19 min readDec 5, 2022

Pragmatic guide to Building a Data Vault

Blog Catalogue

Advantage Data Vault 2.0

Highlighting what DV2.0 gives more than any other interpretation of DV

Keywords: Data Vault 2.0

Learning Data Vault is Like Learning How to Make Beer!

All it takes are three or four “things” to start any learning journey, your cognitive load. At the time of writing the article I was learning how to make beer!

Keywords: Beer, Hubs, Links, Satellites, Modelling, Business Process

Data Vault or: how I learnt to stop worrying and love Data Governance

Following an Atomic Space Age theme, a glimpse into Data Vault with DataOps.

Keywords: DataOps, GDPR, CCPA, Data Governance, Satellite Splitting, Auditability

Time to upgrade your thinking on Data Vault

Data Vault is more than just a data modelling methodology, it is designed to change and flex as the business evolves and matures around core business capabilities.

Keywords: Hoshin Kanri, Business Architecture, Business Agility, Data Platform, Information Mapping, OKR, Balanced Scorecard, KPI, Measure, Metrics

Data Vault Recipes

A holistic look at what it means to adopt Data Vault 2.0 methodology, inspired by baking of course!

Keywords: Enterprise Architecture, Solution Architecture, Business Architecture, Business Process, Capability Map, Raw Vault, Business Objects, Business View, Technical Debt, Business Vault, Information Mart, Data Delivery, Data Quality

a DATA VAULT ANALOGY

Data Vault in the industry has two standards, one following the Hans Hultgren method (Ensemble Modelling) and the other follows Dan Linstedt (Data Vault 2.0). Sometimes the terms of the two are confused and to the untrained eye it is difficult to tell who is following which method, which of course adds to the confusion of learning about Data Vault. Ensemble tends to lean towards replacing Kimball, Data Vault 2.0 does not — instead DV2.0 keeps the patterns simple and repeatable.

The art was inspired by “in the land of the blind the one-eyed man is king” proverb.

Keywords: Ensemble Modelling, Data Vault 2.0, Hans Hultgren, Dan Linstedt

Data Vault Elevator Pitch

One’s point of view is usually biased toward their own interests, and it is the same when you pitch a Data Vault to different professions within a business.

Keywords: Agility, Agnostic, Automation, Auditability

Data Vault Dream Team

Ideas on how to get a Data Vault project started and build momentum

Keywords: Team, Community of Practice, DVBoK, Consulting

Building Data Vault modelling capability through the Mob

How to go about modelling your Data Vault through collaboration and having the right people in the right place. Inspired by work done at a major customer and extreme programming principles

Keywords: Modelling

Bring out your Dead… Data

The first DV article on consideration on what to do with defunct data! Inspired by Pet Semetary, Poltergeist and the Sixth Sense!

Keywords: deprecated data, GDPR, freshness, observability

Data Vault Industry Verticals

An outcome of a Data Vault model review, this article explains some of the pitfalls of attempting to conform a data vault to an industry model. Art inspired by Sim City.

Keywords: Dependent-Child Key, Standards, Business Key, Industry Model

Data Vault Loader Traps

Articulating some of the pitfalls of not doing a Data Vault properly!

Keywords: Satellites, Multi-table insert, Semi-structure function, Hashing

Decided to build your own Data Vault automation tool?

Based on experience building a home-grown Data Vault automation tool, this post covers most of the patterns you will encounter in a Data Vault 2.0 model, with examples!

Keywords: Mapping, modelling

Data Vault 2.0 on Snowflake…To hash or not to hash… that is the question

To hash or not to hash on Snowflake…? An article justifying why you should and how Snowflake’s MPP interpretation can still be used to deliver a Data Vault. Any guess to whom that is in the title page?

Keywords: PIT, Point in Time, OLAP, Snowflake, MPP, Massively Parallel Platform, hashing, zone maps

Why EQUIJOINS Matter!

Evidence on how PIT tables (when designed right) take advantage of inherent OLAP capabilities for querying facts and dimensions. Inspired by 12 Angry Men and Juror #8.

Keywords: Raw Vault, Query Assistance, Hashing, PIT, Point in Time, Sequence Number PIT, equijoin, right-deep-join-tree, sequence key, OLAP, Information Mart, Business Vault, Clustering

Data Vault Test Automation

Reconciliation between staged and target and between target tables is a must. This test framework is designed to keep the data vault implementation honest, and it is insert-only as well.

Keywords: Reconciliation, Test Automation, Hubs, links and satellites, staging, business keys, unit of work

Data Vault Dashboard Monitoring

How to set up and track Data Vault dashboard reporting based out of Snowsight and the same INSERT-ONLY paradigm of DV2.0

Keywords: Snowsight, Test Framework, Automated Testing, Auditability, Dashboards

Data Vault PIT Flow Manifold.

A little bit of Snowflake engineering in Conditional Multi-Table INSERTS and Point in Time (PIT) tables. Images inspired by the dozens of online manuals I read when trying to fix my lawnmower or motorcycle!

Keywords: Point in time, PIT, as_of, Conditional Multi-Table Insert, Snowflake

The Lost Art of Building Bridges

Where to use Bridge Tables and what problems do they solve?

Keywords: Bridge table, query assistance

Data Vault’s XTS pattern on Snowflake

Solving Time Crime in Data Vault, using Snowflake. How does the timeline correction pattern perform on Snowflake?

Keywords: Snowflake, Extended Record Tracking Satellite, XTS

Data Vault Agility on Snowflake

Partly inspired by Tron! Some practical consideration for deploying a Data Vault on Snowflake and taking advantage of some little-known nuances of the platform.

Keywords: Testing, Data Quality, Business Architecture, Source-System Data Vault, Passive Integration, Multi-tenancy

Kappa Vault

Ease of use of Snowflake for Data Vault streaming pipelines, how the loading patterns have changed.

Keywords: Data Pipelines, Streaming, Streams & Tasks

You might be doing #datavault Wrong!

A long list of considerations when building your Data Vault, what to do, and not to do! Inspired by… people doing it wrong!

Keywords: business architecture, enterprise architecture, business objects, raw vault, business processes, automation, passive integration

Seven Deadly Sins of Fake Vault

Born out of observing Data Vault implementation seen in the wild that do not follow the standards, DV2 practitioners have seen various unguided interpretations; these are the main sins we see in the industry

Theme and images inspired by Seven and Milton

Keywords: Business Vault, Unit of Work, Auditability, Dependent-Child Keys, Weak Hubs, Source System Data Vault, Business Key Collision Code, Staggered Load, Sequence Key, Link Satellite, Satellite Splitting, Refactor, Schema Evolution

Data Vault Mysteries… Business Vault

Just what is a Business Vault and why is its creation a mystery, it really shouldn’t be if you follow the standards!

Theme based on 1950s culture and story telling

Keywords: Business Vault, Raw Vault, Business Key, Unit of Work, Business Process, Change Record, Derived Content, PITs and Bridges, Point in Time, PIT, Bridge, Information Mart

Is it Business Vault or is it not?

An often-foggy area of Data Vault is how to define a Business Vault, here is some guidance

Keywords: Business Vault, Raw Vault, Business Process, Business Rules, PITs and Bridges, Information Marts, Auditability

Apache Spark GraphX and the Seven Bridges of Königsberg

An example of building a Business Vault Link but using Big Data (Spark + Parquet) to get there. Theme inspired by the story of Euler and the origins of Graph theory.

Keywords: Business Vault, Spark, Big Data, Graph, Link, Graphx, Pregel, Euler

Business and Source-System Unit of Work

Why complexity should be hidden from the business user and solved in the data vault

Keywords: Business Vault Link

Data Vault Mysteries… Effectivity Satellite and Driver Key

Just what does the Effectivity Satellite solve? And why do you need to define a driving key for it?

Effectivity Satellites are designed to deal with a gap in Data Vault modelling that there is no other way to solve.

Keywords: Effectivity Satellite, Driving Key, relationship, link, unit of work

The Link between Effectivity Satellites and Driver Keys.

Revisiting the explanation of Driver Keys and Effectivity Satellites

Keywords: Effectivity Satellite, Driving Key, relationship, link, unit of work

The Different Grains of Multi-Active Records.

Drawing the line between dependent child key satellites and multi-active satellites.

Keywords: Dependent-child key, multi-active

Simple PIT table Constructs

Setting the record straight on why PIT tables are useful

Keywords: Point in time, PIT, Logarithmic, Managed Windows, Tumbling Windows

Data Vault Mysteries… Zero Keys & Ghost Records

DV2.0 has a few esoteric concepts, this article describes the difference between default keys, ghost records and zero keys

Keywords: ghost records, point in time, pit, zero keys, nulls, default keys

Say NO to Refactoring Data Models!

Facing the same problems every data platform face is the challenge of making changes without regression testing and escalating costs. Sticking to the Data Vault 2.0 patterns rises to that challenge by promoting data agility.

Keywords: Refactor, Business Key, Kimball, Inmon, Patterns, Schema Evolution, Schema Drift, Extended Record Tracking Satellite, XTS, Time Crime

Data Vault Naming Standards

Theory behind what naming standards should look like

Keywords: Naming standards

A Rose by any other name… Wait.. is it still the same Rose?

Initially this article was released on Valentine’s Day, it delves into Passive Integration and Business Key Collision Codes by way of an example.

Keywords: Business Key Collision Code, BKCC, Passive

The Data Vault Guru: a pragmatic guide on building a data vault

A summary of what is in the book.

Keywords: Book

Data Vault has a new Hero

Originally titled “Solving Time Crime in Data Vault 2.0”; this article delves into how to deal with batch data that arrives out of sequence; this is an authorised extension of the DV2.0 standards called the eXtended Record Tracking Satellite (XTS). A data driven approach to dynamically enable the DV model to self-heal.

Keywords: XTS, Extended Record Tracking Satellite, Timeline Correction, Self-Heal, Applied Date

Data Vault solves Time-Crime

Keywords: XTS, Extended Record Tracking Satellite, Timeline Correction, Self-Heal, Applied Date

How I can get away without paying the Pied Piper… in Data Vault 2.0

What you learn on DV2.0 training is that a Data Vault model is not easy to query, to make it easier and to support your Information Models you build Point-in-Time and/or Bridge tables but the expense of querying the data vault is pushed to the creation of the PIT tables themselves. But what if you don’t have to?

Keywords: PIT, Point in Time, Ghost Record

Business Key Treatments

What do you do when a source provides business keys that don’t quite follow the standard business key assignment best practices? An approach to ensure passive integration without sacrificing automation.

Keywords: Business Key, Business Key Treatment, Passive Integration, Hub, Hashing

What does dbt give you?

A gloss over dbt and its power of transformation

Keywords: dbt, integration

Passive integration explained…

Another take on explaining passive integration

bit.ly/3pTWCXP

Keywords: Passive Integration

Ep1: Immutable Store, Virtual End-Dates

Why Snowflake is well suited for Data Vault

Keywords: Snowflake, virtual end-dates, micro-partitions, devops

Ep2: Snowsight dashboards for Data Vault

Using Snowsight for Data Observability over a Data Vault

Keywords: Snowsight, Test Automation

Ep3: Point-in-Time constructs & Join Trees

How to build PIT tables tosolve getting data out of a Data Vault

Keywords: Point-in-Time, PIT, Join Tree, Right Deep Join Tree

Ep4: Querying really BIG satellite tables

A look at how to use Dynamic Pruning to solve querying of really big satellite tables

Keywords: BIG Satellite, Dynamic Pruning, Static Pruning

Ep5: Streams & Tasks on Views

Animated version of Kappa Vault

Keywords: Streams on Views, Streams & Tasks, Set & Forget

Ep6: Conditional Multi-Table INSERT, and where to use it

Another look at building PIT Flow Manifold

Keywords: PIT, Manifold, Query Assistance

Ep7: Row Access Policies + Multi-Tenancy

How to combine multi-tenancy in Data Vault with Row Access Policies

Keywords: RAP, Row Access Policy, Multi-tenancy

Ep8: Hub locking on Snowflake

An interactive look at hub table locking in Snowflake, transaction isolation levels

Keywords: Transaction isolation, READ committed, hub locking

Ep10: Virtual Warehouses & Charge Back

An approach on how to deploy your data architecture to suite Data Vault and a Charge-back model

Keywords: Chargeback, virtual warehouse, resource monitor, data architecture

Ep9: Out-of-sequence data

How do you handle data that arrives out of sequence dynamically and without needing to replay your loads?

Keywords: Out of sequence, extended record tracking satellite

[BONUS]: Handling Semi-Structured Data

An easy framework for handling semi-structured data in data vault on Snowflake

Keywords: Semi-structured, streaming, business vault

[BONUS] Episode 12: Feature Engineering & Business Vault

How Data Vault can support with Data Science

Keywords: Feature Store, Machine Learning, Business Vault

Episode 13: Join Key Data Types

Performance focus, join key types as hash keys, natural keys and temporal sequence ids

Keywords: Hash key, Natural Key, Sequence Key

Episode 14: Snapshot PIT Tables

Performance focus, join key types as hash keys, natural keys and temporal sequence ids in snapshot PIT tables

Keywords: PIT Snapshot, Right Deep Join Tree

Episode 15: Incremental PIT Tables

Performance focus, join key types as hash keys, natural keys and temporal sequence ids in incremental PIT tables, and merge PIT tables

Keywords: Incremental PIT, Merge PIT

Episode 16: Information Marts

Performance focus, join key types as hash keys, natural keys and temporal sequence ids in information marts

Keywords: Information Mart, Ghost Skew, Pearsons

Episode 17: Expanding to Dimensional Models

Detour — dimensional modelling with data vault by simulating facts and dimensions using Data Vault native tables.

Keywords: Dimensional Modelling, Facts, Bridge Table

Episode 18: Dynamic Tables

Where do Dynamic Tables fit into the Data Vault architecture?

Keywords: Dynamic Tables, Streaming, CQRS, Command & Query Responsibility Segregation

Episode 19: Hybrid Satellite Tables = Operational Data Vault

What do you get when you mix Data Vault and HTAP? Hybrid Satellites!

Keywords: Hybrid Tables

Episode 20: Archival, Deletion, Retention, Policies, Storage Tiering

Storage Tiering and Data Vault

Keywords: Storage Tiering, Archival, Retention

Episode 21: Data Vault on Snowflake and Apache Iceberg

What it means to use Apache Iceberg for your Data Vault

Keywords: Apache Iceberg, Data Lakehouse, Data Lake

Episode 22: a Classification & Tagging Framework

How does data classification impact your data architecture and data model

Keywords: Classification, Tagging

Episode 23: Amping up on Data Vault knowledge

Snowflake LLMs read my book!

Keywords: Large Language Models, Streamlit, Notebooks, Generative AI, GenAI, Cortex

Episode 24: How to Twine on Snowflake

Using SQL merge techniques where appropriate for streaming data

Keywords: Streaming, Twine, ASOF

Episode 25: …and the other 80% of the world’s data

How to process unstructured data using DocumentAI

Keywords: DocumentAI

Snowflake, the Data Cloud

What’s in the box? Simplifying the concepts to your cognitive loads makes learning Snowflake so much more easier, and that’s the aim of this article.

Keywords: Snowflake architecture

Snowflake, the Cloud Data Platform (2023)

What’s in the box now in 2023?

Keywords: Snowflake architecture

Snowflake, the Data Cloud (2024)

What’s in the box now in 2024?

Keywords: Snowflake architecture

Data Vault and Domain Driven Design

Delve into DDD and DV, a corner stone of Data Mesh.

Keywords: Data Mesh, Domain Driven Design, DDD

Data Vault as a Product

Expanding on DDD with Data Products through a DV, another Data Mesh concept.

Keywords: Data Mesh, Domain Driven Design, DDD

Data Vault and and Domain Oriented Architecture

Architecture patterns for Data Mesh and Data Vault.

Keywords: Data Mesh, Domain Driven Design, DDD

Data Vault semantics & ontologies

Final blog in this series, linking Data Vault to the Semantic Layer and Domain Ontologies.

Keywords: Data Vault, Domain Ontologies

Data Vault and Analytics Maturity

Bonus blog discussing Data Vault and other methods for framing and modelling data.

Keywords: Kimball, Inmon, Framework, DCAM, DAMA, DMBoK

Data Vault is not a Monolith

Describing why an enterprise data vault model is not a monolith

Keywords: Kimball, Inmon, Data Mesh

The OBT Fallacy & Popcorn Analytics

Why the rigour of data modelling is still required

Keywords: Data Contracts, Data Vault

Business Vault & Activity Schema

A repeatable pattern for Event data as Activities

Keywords: Streaming, Event Data, Activity Schema

Data Vault + Supernova on Snowflake

How to denormalise Data Vault in a repeatable pattern

Keywords: Denormalizing, Supernova

Rules for an almost unbreakable Data Vault

How to not build a legacy data vault

Keywords: DevOps, Testing, Product Owner

**More Rules for an (almost) unbreakable Data Vault**

Four more rules for your data platform built to change and therefore built to last

Keywords: DevOps, Testing, Product Owner

the Modern Data Vault Stack

A repeatable, robust, reliable data architecture built with intent

Keywords: Modern Data Stack

What is the Shape of your Data?

Expanding the test framework with running stats cheaply on Snowflake

Keywords: Test framework, Snowsight Dashboards

Middle out your Data Strategy on Data Vault and Apache Iceberg

Data Lakehouse, yea a Data Vault lives there.

Keywords: Apache Iceberg, Apache Polaris, Lakehouse, Data Cloud

Does Data Vault have a Rorschach Problem?

Why is there consternation, failed implementations when it comes to data vault?

Keywords: Ensemble, Data Vault 2.0

Snowflake Data Clean Room

Keywords: Data Clean Room

Snowflake Data Clean Room (Native App)

Keywords: Data Clean Room

Book 1: the data vault guru

a pragmatic guide on building a data vault

The data vault methodology presents a unique opportunity to model the enterprise data warehouse using the same automation principles applicable in today’s software delivery, continuous integration, continuous delivery and continuous deployment while still maintaining the standards expected for governing a corporation’s most valuable asset: data. This book provides at first the landscape of a modern architecture and then as a thorough guide on how to deliver a data model that flexes as the enterprise flexes, the data vault. Whether the data is structured, semi-structured or even unstructured one thing is clear, there is always a model either applied early (schema-on-write) or applied late (schema-on-read). Today’s focus on data governance requires that we know what we retain about our customers, the data vault provides that focus by delivering a methodology focused on all aspects about the customer and provides some of the best practices for modern day data compliance.
The book will delve into every data vault modelling artefact, its automation with sample code, raw vault, business vault, testing framework, a build framework, sample data vault models, how to build automation patterns on top of a data vault and even offer an extension of data vault that provides automated timeline correction, not to mention variation of data vault designed to provide audit trails, metadata control and integration with agile delivery tools.

Other

Merging Data Vault and Medallion Architecture Patterns, bit.ly/40B6Z6V
Data Mesh & Data Vault on Snowflake 2024, bit.ly/3ZLOpbS, bit.ly/49Iz2W0
DataEngBytes 2024 — bit.ly/4huXqNR
Keynote Speaker at Data Vault UK 2024, bit.ly/3UoisnS
Going DAGless with Business Vault and Activity Schema, bit.ly/3VnYpFT, bit.ly/46d8Tgo
Snowflake Summit 2024 presentation, “Data Mesh and Data Vault on Snowflake with Xero Effort”, bit.ly/3VIOhcl
Snowflake, the Data Cloud, bit.ly/3NJqfGO, bit.ly/36Ho2we and bit.ly/3SPdUFE
Demystifying Data Vault with dbt — Coalesce 2023, bit.ly/3umAi0E
DataEngBytes 2023 — Data Vault Engineering, bit.ly/3ssVbXu
Github — https://github.com/PatrickCuba/the_data_must_flow
Data Vault UK Interview — bit.ly/3baadp9
Data Vault UK Presentation — youtube.com/watch?v=7lUn3eBiuyU
Data Vault Munich Presentation — youtube.com/watch?v=tRPgijauH2w
Keynote Speaker at Data Vault UK 2023, bit.ly/3TCJtEC
Meet the Expert: Data Vault — bit.ly/3t1hBe1
Snowflake Data Vault User Group: DataOps — bit.ly/3qbnm7P
DataVault Interview — data-vault.co.uk/patrick-cuba-interview/
Integrating SAS and Data Vault, bit.ly/2YUw1xT
Data Mapping, bit.ly/3s0kcEj, bit.ly/32FnFQI
3 Ways to load data into SQL Server MDS, bit.ly/3mirsbs
My Hash of Hashes, bit.ly/2MGKE5L
SAS indexing tricks, bit.ly/2L9gsiW
SAS Parallelism, bit.ly/3oiQubn
SAS SQL vs Data Step part 3, bit.ly/3s0Hbie
SAS SQL Join vs Data Step Merge part 2, bit.ly/3nnfNaF
SAS Hash Tables, bit.ly/3hPSwxg
SAS Data Step Merge vs SQL Joins, bit.ly/3be5jIf