4. Data Vault, Semantics and Ontologies

Patrick Cuba
15 min readSep 8, 2022

--

Data Vault on Data Mesh 4/4

“There are only two hard things in Computer Science: cache invalidation and naming things” — Phil Karlton

Series catalogue:

  1. Data Vault and Domain Driven Design
  2. Data Vault as a Product
  3. Data Vault and Domain Oriented Architecture
  4. Data Vault, Semantics and Ontologies
  5. Data Vault and Analytics Maturity

In the first blog in this series we looked at domain driven design, in the second we looked at data products and the expectations around product delivery, in the third we looked at data architecture and product maturity. Now in this final blog we’ll touch on semantics and ontologies, i.e. naming things.

Definition 1:

“A semantic layer is a business representation of corporate data that helps end users access data autonomously using common business terms. A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization.” — Wikipedia

Definition 2:

“An ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains of discourse. A domain ontology represents concepts which belong to a realm of the world, such as biology or politics. Each domain ontology typically models domain-specific definitions of terms.” — Wikipedia

We will dive into the semantic layer in the next section and then follow that up with a discussion on business ontologies, both relating to data vault. But first, some business architecture mapping,

Business capabilities mapping to information concepts

Information is a strategic asset; by mapping capabilities to information we can effectively establish the common vocabulary, describe the business ontology and taxonomy. This mapping of information establishes the audit history of the enterprise, in turn through information technology business capabilities are automated, improved and enabled.

Semantic Layer

analytics stack including a semantic layer

This is not a new vision, the semantic layer is the layer designed to integrate disparate data stores into a single virtual plane to standardize the definitions of measures and dimensions while abstracting away how that data is provided. The semantic layer is not yet another data platform to move your data into, instead the semantic layer is a virtual layer that boasts optimal connectivity to those data stores while being able to unify the data to the end user. A key feature of this layer is the ability to execute predicate pushdown, i.e. generate the code needed to run natively on a platform and return only the results to the virtual layer. The virtual layer is no-code, headless BI or also known as the metrics layer; it is a declarative layer for business analytics users where they are free to build their own reports, wrangle with data models and even support traditional OLAP cubes!

Dimensions, metrics and query assistance.

Why did we use a data vault to decompose the data only to decompose/unify the content in the semantic layer? Data vault is built to absorb business rule outcomes into the three key things we want to know about any business object.

  • business object/entity definitions (have immutable business keys),
  • their limited or open states (attributes and values — descriptive content), and
  • their relationships and transactions with other business objects (unit of work).

Recall the three data vault table types,

Hubs, Links and Satellites

These table types are not designed to be easily queryable, they are designed to be flexible to business evolution and change through agility, automation and auditability. Data still needs to be formed into structures that are easier to interpret in the business context whether it is single or cross domain. This is why the concept of satellite splitting is so important in data vault, you do the work of grouping data by what it describes, rate of change, criticality and privacy before it is used/processed downstream. Just like the discussion of where to place and stack data governance upfront saves on the overall cost of governance by reducing duplication of work, satellite splitting reduces the cost of denormalizing at query time downstream.

Mapping and satellite splitting data vault to dimensional terms

Metrics/measures are fact tables, we know these from Kimball modelling and they reappear in the semantic layer as:

  • transactional metrics, aggregate measures or non-aggregate facts.
  • periodic snapshots that fall into a defined business reporting period, likely involve a form of aggregation to that reporting period, ex. aggregate sums and semi-aggregate averages.
  • accumulating snapshots, metrics based on previous metrics for the same business entity

For each fact there are the time-based dimensions describing those measures in relation to the entities included in the unit-of-work to be analyzed. The most common form of dimensional table is the classic slowly changing dimension type 2 table (a dimension with a start and end date with a high date indicating the record is the active/current dimension for an entity).

Can the semantic layer reason with a data vault model? It does not need to…

Mapping now includes data products for pre — aggregate loads and performance enhancements using a data vault.

Information marts as data product aggregates are produced to pull in everything you need to know about a business object or transaction. As for the metrics the data product can include some pre-aggregation upfront, such as

  • pre-calculated metrics using scalar functions
  • window function based calculations, ex. first(), last(), lead() and lag()…
  • measures requiring joining data from different domains, sources

Advanced table structures called “query-assistance” tables can be utilised to query a data vault, and they have a dual purpose:

  • mask the data vault join complexity away from the users of data vault, and
  • take advantage of the platform’s OLAP optimizations (hash-joins) to perform star queries (equi-joins).

These are the fabled (point-in-time) PIT & bridge tables and they may include some of the above pre-calculations columns persisted as physicalised columns.

PITs & Bridges

To learn more on how these query optimization tables function and strategies on how to build them, visit:

Bitemporal, INSERT-ONLY data model

Data vault is an INSERT-ONLY modelling methodology (see, bit.ly/38fAy6H) but it is also bi-temporal, two timestamps are included at every record and these are:

Event timestamp, also called the applied timestamp. The business event, its descriptive attributes and measures and applicable business timestamps at a point in time is collectively grouped into an applied timestamp. These grouped timestamps may include discrete events (happened now), recurring timestamps (happens periodically) and evolving timestamps (happening now) that are “packaged” by that applied timestamp.

Package of time for a business object, the applicable data grouped into an applied timestamp
  • Processed timestamp, also called the load timestamp, is the time the record enters the data vault. Because we have both applied and load timestamps, the load timestamp acts as the version timestamp of the applied timestamp record (package). Processing corrections will have the same applied timestamp but with a newer load timestamp. Your information marts should be based on the newest load timestamp record by applied timestamp by business object. Data vault then fulfils one of the basic tenets of DataOps with regards to its comparison to DevOps, versioning data. Data vault tables are the recorded history of business rule outcomes, if the business rule has evolved for the same applied date package then it could also generate a new outcome.
Processing a correction

Because data vault is bi-temporal and insert-only it makes the possibility of correcting previously loaded content far easier than ever before. In the first instance, if a correction to a record is needed the new record can be inserted as long as its record hash digest differs from the previous hash digest per business object based on its chronological order of business events (if the hash digest is the same then no correction was needed!). In the second instance, it makes processing out-of-sequence data possible, that is, there is no need to update/manage end-date columns in a data vault when an older record (by applied timestamp) arrives and it needs to backfill history and ensure the timeline for that business entity is still correct (data vault has no end-dates). This pattern makes the data vault self healing and it remains the auditable source of the business facts even if it can process corrections to the timeline dynamically. Query assistance tables and information marts in turn are disposable; because the audit history is always in the data vault and no refactoring is ever needed, the query assistance table and information marts can be destroyed and replaced when a correction has occurred!

To explore this data vault pattern visit, bit.ly/3y4mUdV

Out-of-sequence batch load, and solution

Write back, and Business Vault

New insights derived from the semantic layer can find their way back into the platform in a number of ways,

  • as input into the source platform/application; should the need arise that data can be ingested and modelled into the data vault. Because this comes from the source it should be modelled into the raw vault even though the original source of the data came from the semantic layer! This will result in single-domain business vault artefacts.
  • as landed content; modelled and ingested into business vault (see bit.ly/3cm0104) because it extends raw vault hub or link tables and does not pass through a source application to get there. These business vault artefacts can be cross-domain.

Let’s turn our attention to business semantics…

Business Ontology

Triples, OWL, RDF

A semantic web consists of three technical standards

  • Resource Description Framework (RDF) — data modelling language (semantic schema standard) and triple statements represented as a directed edge-labelled graph
  • Web Ontology Language (OWL) — schema language / knowledge representation language into composable elements. These describe data schema, taxonomies and vocabularies and RDF* allows descriptions to be added to edges
  • SPARQL — query language for RDF graphs

and…

  • an ontology is a formal specification that provides sharable and reusable knowledge representation as a common understanding of information within a domain. The ontology includes descriptions of
  • concepts and properties of a domain;
  • relationships, relationship types and constraints between concepts
  • individuals or classes with respect to a class hierarchy and grouped into categories (taxonomy) as instances; and
  • free text descriptions

A business ontology is the business representation of the business object state and relationships depicted as a triple: subject-predicate-object, or entity-attribute-value and declare business axioms using OWL syntax.

The business ontology operates at the enterprise level, just like the data vault does. A hub table name and definition is the focal point of any business concept and is uniquely represented by an immutable business key.

To elaborate how a data vault’s hub table names can be established using a business ontology we will use FIBO, a community-driven EDM Council sponsored business ontology for the financial industry that includes the expected taxonomies depicted as class diagram hierarchies.

Right at the top of the hierarchy is a Thing, below are 15 sub-classes of a Thing:

  • Autonomous Agents (AA) — an agent is an individual that can adapt to and interact with its environment. This is a person, legal person, organization or automated system.
  • Thing in Role (TIR) — a thing-in-role is a relative concept that ties a thing to a role it plays in a given context. This is an agent in a role (client, contract party, obligator or manager), asset, functional entity or facility.
  • Reference (REF) — a concept that refers to (or stands in for) another concept. Ref is used to classify identifiers, registry and codes.
  • Arrangement (ARR) — an organizing structure for something such as classifications, groupings, pools, baskets, portfolios etc.
  • Location (LOC) — a named geographic place; a geopolitical identity (country, municipality) and physical locations, notional places (abstracts) and virtual locations such as URLs and email.
  • Document (DOC) — tangible item that records something, like a publication or legal document.
  • Service (SVC) — an economic activity that is intangible, is not stored and does not result in ownership; a service is consumed at the point of sale — registration, regulatory or financial.
  • Product (PRD) — commercially distributed goods that are tangible, and pass through a distribution channel before being consumed or used.
  • Agreement (AGR) — a negotiated and enforceable understanding between two or more legally competent parties; such as a mutual agreement or contract.
  • Commitment (COM) — a legal construct which represents the undertaking on the part of some party to act or refrain from acting in some manner. Examples of this are payment obligations, debt, guarantee and contract terms.
  • Contract Element (CE) — general and special arrangements, provisions, requirements, rules, specifications, and standards that form an integral part of an agreement or contract. These are commitment priority levels, conditions, contract definitions and terms.
  • Legal construct (LC) — something which is conferred by way of law or contract, such as a right. Think of duty, regulation, claim and legal capacity here.
  • Account (AC) — is a container for records associated with a business arrangement for regular dealings or services. Account is a supertype to financial service accounts like loan, debt, deposit, investment accounts and bank accounts.
  • Time instant or interval (TI) — either a discrete event or duration of time that may be recurring.
  • Occurrence (OCC) — the event or transaction itself.
15 Business concepts, built from fib-dm

Hub tables are your business objects as represented by your business ontology. If the business concept / object has an immutable business key (REF) we can track its movement through an institution’s value chain and/or value streams as represented by the thing in role, account, contract/agreement (binding object for a value stream), commitment, contract element and location.

We need to define the hub table at the correct grain, for instance the definition of an account is:

“container for records associated with a business arrangement for regular transactions and services”.

Account concept with possible hub table names circled in blue

A name like hub_account may be too generic in this context because as we see from the above relationship diagram, account is a super-type for a few finer grain business concepts that differ in semantic meaning. The definitions for each show they are semantically different and should be loaded to separate hub tables, for example,

  • card account“account whose terms and conditions are defined in a card agreement that is represented by a payment card” should be defined as hub_card_account
  • loan account — “account held by the borrower associated with a specific loan” — should be defined as hub_loan_account

Each hub table will have identifiers associated with it, such as a card account number, loan account number and so on. Could we load debit account numbers to the same hub table as hub_card_account? Semantically they are the same with slightly differing definitions. We look at the FIBO definitions and attributes related to the card concept and (for example) we decide they are semantically the same.

As a guide we can rely on,

Warning: do not overly rely on published ontologies for this exercise; this is just a useful guide for the data modelling process. Always talk to your business to get these definitions correct! As an add on to this, if you find that there isn’t a useful ontology available for the domain you’re attempting to model, then this is an opportune time to build one through mob modelling, domain storytelling or event storming (to name a few)!

When establishing the names of things for hub table names and definitions (in the absence of an established business ontology). Again, the ontology must be business driven and in order to define those names it is useful to stick to a few guiding principles (see, bit.ly/3JWMicF)

  1. Consistency — ubiquitous terms, each concept should be represented by a single, unique name.
  2. Understandability — domain-specific, a name should describe the concept it represents.
  3. Specificity — a name shouldn’t be overly vague or overly specific.
  4. Brevity — a name should be neither overly short nor overly long.
  5. Searchability — a name should be easily found across code, documentation, and other resources. Avoid names that are too generic; a name like “user” may be the most appropriate for a given concept, while being generic enough that there are other valid uses of that name in adjacent domains.
  6. Pronounceability — a name should be easy to use in common speech.
  7. Austerity — a name should not be clever or rely on temporary concepts.

A business ontology will suggest what the expected relationships between business objects are. Link table design is driven by the business event, relationship or transaction between two or more business objects and therefore these should form the basis for your link table design. Breaking this unit of work makes the link un-auditable — you will struggle to rebuild the history at a point in time if you ever needed to do so. Although we strive to keep the unit-of-work intact, we can still produce the semantic relationship representation as the business understands them.

N-ary relationship depiction, not limited to ternary and can be a combination of the above

A software system/application is acquired because it matches or closely matches the need for your business to automate business processes. The application outcomes are pulled/pushed to your analytics platform and modelled as data vault artefacts. The unit of work you get from the source system/application should closely match the unit of work you need. Load the unit of work provided from the source as a raw vault link table; should you need to depict/store the unit of work differently to how the source has supplied it, then this is where you may consider modelling a business vault link table. To see an example of this, refer to bit.ly/3EQ9wO3. You’ve maintained source system auditability and you have represented the expected business unit of work, win-win!

From this perspective it is easy to see why the data vault could easily represent the knowledge graph of your business in a relational database; using business ontology as a guide ensures that the ubiquitous language used in your data model represents the industry your business is in. If the semantics are deployed on a graph you could enable more semantic capabilities like:

  • deductive reasoning — draws specific conclusions/inferences from a premise based on facts / truths / certainty depicted in the ontology. Ex. using “Bob has a Bank Account” we can infer “Bob is a Legal Person”. Deployed on RDF triplestore
  • inductive reasoning — extracts a probable premise from specific and limited observations / patterns (GraphML). Deployed on labeled-property graph.
Inductive reasoning techniques

To learn more refer to the free Knowledge Graphs book located here: https://kgbook.org/

What’s more, with text and analytics harvested from the data we can ultimately provide enterprise semantic search behavior for your users to find the data they need.

Conclusions

Does a data vault impose implementation styles that are counter to modern day analytics?

Well no, the data vault is about the business, it always has been (see bit.ly/3IqgPzb). Through passive integration using repeatable patterns on a data model that is non-destructive to change, data vault as a methodology will scale as your business scales and does not impose any restrictions on how you deliver and manage your data platform.

Data vault does however impose one requirement, discipline. Using discipline, the data vault graph-like patterns itself can be used to derive even more analytical value

This post and other posts expand on the data vault content explored in the book “the data vault guru”, amzn.to/3d7LsJV

In the book you will find:

  • data vault architecture patterns, governance, and audit
  • the various types of keys and how to think about bi-temporal data
  • raw vault, business vault, how to grade data vault automation tools and a guide on how to build your own data vault automation tool
  • all data vault table types and expected metadata, especially the code to build and query effectivity satellites, bit.ly/3oS4k70, the right way
  • automation patterns for data vault, including a pattern for correcting timelines when data arrives out of sequence, bit.ly/3R4Azw2
  • a test framework, how to get data out of data vault including using query assistance tables to do so, bit.ly/3CSP3aV
  • building various data vault-based variations like a metric and Jira-vault

See bit.ly/3nycjoz for more details

“In theory there’s no difference between theory and practice. In practice there is.” — Yogi Berra.

#datavault #datamesh #businessarchitecture #selfservice #semantic #knowledgegraph #dataops

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

No responses yet