1. Data Vault and Domain Driven Design

Patrick Cuba
14 min readSep 5, 2022

--

Data Vault on Data Mesh 1/4

“It is not the domain experts’ knowledge that goes to production, it is the assumption of the developers.” — Alberto Brandolini

Series catalogue:

  1. Data Vault and Domain Driven Design
  2. Data Vault as a Product
  3. Data Vault and Domain Oriented Architecture
  4. Data Vault, Semantics and Ontologies
  5. Data Vault and Analytics Maturity

Recognizing that a data vault (DV) build is a software build draws in the mind an array of software engineering best practices relevant to software and data engineering that do have a data vault flavour. In this blog we will look at how the context of domain-driven design (DDD) thinking can be used with data vault design thinking. I say design instead of modelling because in my opinion, data vault is not just a data modelling technique, but a way of mapping the business architecture to the information domain through agility, automation, and auditability.

After all, organizations are in the business of delivering value (and delighting their customers); software is in the business of automating how that value is delivered (ie. service-oriented architecture). It is vital that those responsible for automation understand what the business wants; domain-driven design is a method for software engineering teams to bring software design thinking closer to business design thinking from both perspectives.

Domains (and subdomains) shape the enterprise data vault model and each domain may overlap with business capabilities and manage business objects. Agility comes in the form of software engineering best practices through the use of repeatable patterns and organizing teams into pods/squads/tribes that focus on different aspects of data products whilst leveraging domain experts and product thinking within the domain. Each squad based on a domain speaks a ubiquitous business language and through DDD develops a sort of symbiotic relationship with the business. Finally, auditability comes with the responsibility for securing data products, ensuring data privacy standards, observability and quality whilst complying with various regulatory and compliance requirements.

Definition #1: What is a domain? “A sphere of knowledge, influence, or activity. The subject area to which the user applies a program is the domain of the software.” — Eric Evans

Definition #2: Who is the domain expert? “a subject matter expert (SME); is a person who has accumulated great knowledge in a particular field or topic and this level of knowledge is demonstrated by the person’s degree or licensure” — Wikipedia

Let’s start with DDD’s strategic design, the central concept being the bounded context. A bounded context is a way of defining the range of applicability of a data model and how it may relate to other model contexts. To start, let’s borrow the model presented in Martin Fowler’s example here, bit.ly/3y7Hxrr

Bounded contexts in the sales domain, sales and support contexts; notice how each has a customer and product context, are they the same contexts?

From a business architecture perspective in the above diagram, the sales context is customer facing and a tier two core business capability; the support context is a tier three supporting capability. This can differ for your business, but under this example we’ll stick to these stratification tiers. Similar language is used in DDD to describe these as subdomains to carve out software capability into core, supporting and generic subdomains.

Learn more about business architecture and data vault here bit.ly/3o4koB6

Each hexagon in the above diagram is also a bounded context that demands the rigour described under DDD,

Bounded Context (names enter) Ubiquitous Language

“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” — Melvin Conway

For an application to have business context it must carry or map to the names and labels the business is familiar with; the vocabulary is defined by the business themselves in the business ontology. Each given industry will have vocabulary relevant to that industry that likely will not translate well into another industry. We call this phenomenon “polysemy”; it is the capacity for a sign, word, or phrase to have multiple related meanings. Words and phrases must be based on your business domain and avoid what Eric Evans (see, bit.ly/3appcyn) calls “false cognates” — this is when two or more individuals are using the same term, thinking they are talking about the same thing, but they really are not! For example in the model above the word “ticket” can carry multiple meanings depending on the context.

Ticket homonym examples:

  • a piece of paper or card that gives the holder a certain right. noun
  • (in information technology) a request logged on a work tracking system detailing an issue that needs to be addressed or task that must be performed. noun
  • a method of getting into or out of (a specified state or situation). noun
  • a certificate or warrant. noun
  • a list of candidates put forward by a party in an election. noun
  • the desirable or correct thing. noun
  • issue (someone) with an official notice of a traffic offence. verb
  • (of a passenger) be issued with a travel ticket. verb
  • (of a retail product) be marked with a label giving its price, size, and other details

You can imagine the number of business contexts “ticket” could be used in, such as IT support, travel industry, flights, airlines, criminal prosecution, politics… Oh my!

But we know from the context of the “support ticket” it refers to the bulleted item in bold above.

Business terms are business industry driven and never driven by the software development team, to close the gap between business and software teams we can employ an effective strategy to bring system and business domain experts and architects and engineers together by running workshops to share some of the relevant industry expertise with the data modellers and engineers through:

  • Mob modelling — an accelerator for rapidly building data vault models through collaboration, bit.ly/3uhWqX3
  • Domain storytelling — a collaborative modelling technique that highlights how people work together to transform domain knowledge into business software, bit.ly/3ND3KDl
  • Event storming — a workshop-based method to quickly find out what is happening in the domain of a software program, bit.ly/3ab72QI

Going forward and because this post discusses data vault, It will help for the modelling team to be knowledgeable of the aims and names of data vault modelling artefacts relevant to capturing business domains.

To learn more of the basic building blocks of a data vault, visit bit.ly/2ZYGpJP

Data vault table types are significant because they effectively map to the three important details a business needs to know about any business object as information concepts,

  • Its definition and how to uniquely identify it (- hub tables).
  • Its state — limited or open with historical context and audit history (- satellite tables).
  • Its relationships — transactions and interactions with other business objects (-link tables)

These definitions are articulated in an information map, one of the four core domains of business architecture. The information map is synonymous with a business glossary.

Business Architecture Core and Extended Domains

Now back to DDD…

Bounded Context (keep model unified by) Continuous Integration

Parallel teams, centralised standards

“A model should not be built to last, it should be built to change — only then can it truly last.” — Lars Rönnbäck

An enterprise data vault model is possible, the benefit of ensuring you are building one data vault for your enterprise is that they are built around the same definitions and terms prescribed as data vault hub tables. Hub tables form the integration points between source-aligned domain data (source-systems) and therefore act as the primary mapping artefact between source systems and the business domain. With each team working in parallel and building towards the same data vault model a few best practices must be adhered to continue your integration success.

  1. Establish the naming standards early, that is, table naming standards and data vault metadata column naming standards. Here we can set source-badge definitions and automation patterns (incremental, snapshots and streaming ingestion patterns) and start to set business vault column naming standards here as well.
  2. Establish the agile development patterns early, build a data vault body of knowledge (DVBoK) for established patterns and decisions with guiding flowcharts and principles for parallel teams to leverage and grow. Include decision registers and a method for managing technical debt.
  3. Along with the other recommendations described under mob modelling; integration to a hub table must be profiled, in terms of should the content be loaded to an existing hub table or if a new hub table must be modelled to house a new business object type.
  4. To ensure success also follow these guidelines around refactoring, in a data vault we never need to refactor, ever! bit.ly/3tPI66B and more …. bit.ly/3yKXwxj

Borrowing a software approach to extreme programming (XP, see bit.ly/3R7iUne), each team will mob model, build, deploy, test, and merge to the enterprise data vault model.

Borrowed from XP, add monitoring and retiring to the list!

The tests described in XP are not the automated tests for data vault observability described here, bit.ly/3OCxRMt. Instead, these are the tests described in the build to ensure integration does not cause any issues or corruption in the existing data vault model; for this, a production clone of the hub table should be made available to perform integration tests (an innovation Snowflake pioneered, cloning considerations).

Bounded Context assess relationships with Context Map

Mapping bounded contexts and context maps to data vault

Context Maps describe the contract between bounded contexts with a collection of context mapping patterns. From a data vault perspective, we must relate this integration by way of passive integration in the shared hub table.

Blue cylinder is the hub table, yellow pipe is the satellite table and red pipe is the link table, green rounded squares are the overlapping contexts.

Passive integration scenarios:

  1. Natural Passive Integration. Business keys are shared between contexts and represent the same business objects. This is the best case and passive integration is achieved straight out of the box. No need for a business key collision code (BKCC), or business keys should share collision codes. BKCCs are used in hash-key generation and why hash-keys are a great method to keep join queries simple.
  2. Mapping between Contexts. No overlap exists between contexts and no way to integrate the same business objects. No issue will exist in the integration to a common hub table however you will not be able to relate business objects between contexts (if they do exist) unless a mapping is supplied by either context (source-domain) or the solution is supported by master data management. Here you will likely see the use of a “same-as link” table for the mapping between contexts to solve equi-joins that overlap contexts — in effect, a context map.
  3. Collisions Managed. No overlap exists between contexts, but the same business key profile exists between contexts for different business objects. Like the previous scenario, however, if the same business key could appear in either context but represents different business objects within the same hub table then we must add a persistent salt to the latter context to ensure any context data queried from the data vault is not mistakenly aligned to the wrong business object. This is where we focus on business key collision codes as elaborated in this post, bit.ly/3yBC6m8

Now the basis for ensuring the context between a hub table and any related link and satellite tables is the use of SQL equi-joins. To ensure only the context applicable is joined we use the techniques described above to achieve passive integration because the business key collision code is used in the generation of the hash keys itself. And further, using equi-joins will also outperform any outer join in bringing the applicable data together.

  • For evidence of equi-join performance on Snowflake, visit bit.ly/3wohniH
  • To understand equi-join in the context of using Ghost Records and Zero Keys, visit bit.ly/3R5aOLZ
Bounded Context in a Data Vault model, by equi-join and passive integration, adding a new context is not a destructive change to the enterprise data vault model — the enterprise data vault model is not a monolith data model

How to apply passive integration can also be investigated in this article, bit.ly/3NEcAkv

From the types of context maps perspective, a data vault absorbs whatever changes are applicable from the source systems as hub, link and satellite tables and supports schema drift in satellite tables only. With passive integration entrenched in our minds, let’s apply DDD’s context mapping patterns from data producer and data modelling perspectives on the data vault model.

Context mapping cheat sheet, bit.ly/3nEJ6Il
  • Context Map (overlap allied contexts through) Shared Kernel (SK)

Co-dependence pattern

Hub tables are the integration tables for your business’s value chain and ultimately your customer journey. For example, partnerships and marketing can turn a lead into a customer with an account, a contract and rewards and retention efforts to increase the customer’s lifetime value. Each one of these value stages carry with it a business object unique immutable id (business key) whose value objects (attributes) may mutate. For that reason, the data vault hub tables are your shared kernel and the only table type where conformance is applied in the raw vault and where duplicate concepts are flushed out through data profiling with the help of domain experts.

  • Context Map (relate allied contexts as) Customer / Supplier Teams (CUS → SUP)

Upstream/Downstream pattern

Source-systems’ landed content for ingestion into a curated raw vault makes the data producers the supplier and the data vault modellers the customers. When new requirements are needed to support an evolving business rule (or new business rule) the customer first looks to the supplier to support any new requirements. Should the new requirement unlikely be fulfilled by a source-domain then we look at how to categorize the business rule change and design and build business vault links and/or satellites to support the derived business rule.

  • Context Map (overlap unilaterally as) Conformist (CF)

Upstream/Downstream pattern

Probably not the ideal scenario but several reasons could contribute to this being an acceptable pattern. For one the data producer may indeed have a product which is as close to the ideal industry solution you need, and they actively follow local or global standards to keep their products up to date which you simply inherit. Frequent checks and balances are needed to ensure the base model still supports the needs of the business. Collaboration is probably limited to feature announcements and subscriptions and new requirements will likely need to be exclusively managed as business vault artefacts.

Another scenario is the management of a complex domain where another domain relies on only the outcomes.

  • Context Map (support multiple clients through) Open Host Service (OHS)

Upstream/Downstream pattern

To ensure that source context changes don’t affect our data needs, a set of protocols and services could be set up for you to only pull the data needed as a part of a service level agreement (SLA). Versioning of the OHS can occur that can best be managed by storing the semi-structure form of the data in a satellite table and materializing only the key columns for query performance gains. Downstream users of this satellite table then pick the non-key attributes they need and the satellite table’s semi-structure content is free to evolve without the need to evolve the satellite table schema itself.

  • Context Map (free teams to go) Separate Ways (SW)

Independence pattern

Separate ways can be described as an anti-pattern and a sign of what business architecture describes as building similar business capabilities with little or no collaboration between business units and/or software teams. Business architecture mapping can be used to plan and ensure business strategy is correctly allocated to the initiatives funding it by cross-mapping value streams to capabilities, business units to those capabilities and value streams to SOA, application portfolio and IT transformation initiatives.

A valid pattern for separate ways can be utilised in the sense of building private business vaults, that is that instead of deploying a shared business vault artefact so all domains can use and benefit from; the private business vault may contain artefacts that should not be shared because of their specific subdomain that may not be compatible with the ubiquitous language. This means that the same/similar business vault data product may appear in two or more business units but with their own ownership, management and privacy. This is the opposite of a shared business vault data product where reusability is sacrificed for flexibility and agility.

  • Context Map (translate and insulate unilaterally with) Anti-corruption Layer (ACL)

Upstream/Downstream pattern

An ACL from a data vault perspective is the information mart and/or semantic layer; making querying and understanding the data from the modelled data vault easier to ingest whilst not exposing those non-data vault proficient business users to the complexity of joining data vault tables together and consume the information in a familiar format; ex. cubes, pivot tables or Kimball dimensional marts (to name a few). Changes in the underlying data vault model must be profiled and collaborated upon to ensure the ACL is still sufficient to draw value from the data vault. The scenarios that you could encounter could include schema evolution, source system migration, deprecated source systems and more.

  • Context Map (translate and insulate unilaterally with) Partnership

Co-dependence pattern

Although all data products have a domain owner, some products can be developed in partnership between domains or business units. Clearly defined responsibilities must be established especially where context may overlap which will likely be in the shared business vault.

  • Context Map (translate and insulate unilaterally with) Big Ball of Mud (BBoM)

Anti-pattern

Technical debt, data vault is not the repository of technical debt. Complexity and “switches” built into the data vault automation adds maintenance to the curated zone, this includes dealing with source system curiosities that may corrupt the data loaded into the data vault. As the adage states, “garbage in, garbage out”, a ball of mud accumulates tech debt the longer it is left alone; the sooner it is dealt with the better.

DDD + DV…

Data vault, Domains and Products

We have shown how the thinking behind domain driven design can be applied to designing a data vault. We have also shown how data vault maps to business architecture and forms the basis of information mapping. A common question when building a data vault is what do we do in the absence of a business ontology where we can define the naming standards of the data vault model. Domain-driven design serves as an input to that process as it plays a similar role to mapping business capabilities as business architecture does from a business perspective. Your software should reflect the ubiquitous language for your business, and your data vault definitely has to or you could be in danger of building another legacy platform.

Next we will demonstrate how a data vault implementation is well suited to deliver data products…

“A model is a distillation of knowledge” — Eric Evans

#datavault #datamesh #businessarchitecture #ddd #domaindrivendesign #snowflake

in summary: data vault, an aggregate of aggregates

Data vault tables store business process outcomes, the automation of these processes can be written in any language but loading the data vault should be done in one automated and standardized method.

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

Responses (1)