5-Data Vault and Analytics Maturity
Series catalogue:
- Data Vault and Domain Driven Design
- Data Vault as a Product
- Data Vault and Domain Oriented Architecture
- Data Vault, Semantics and Ontologies
- Data Vault and Analytics Maturity
“A model is a way of representing reality. A framework is a way at looking at reality” — Dave Snowden, Cynefin framework
In the analytics and data domain we often discuss the levels of maturity that takes an organisation from having rudimentary analytical capabilities such as operational reporting to the fully automated prescriptive analytics capabilities based on artificial intelligence.
Information Architecture is a bedrock for automating analytics and it relies on clean and mastered data. When the question of the data is known the optimised data structures are built and business intelligence is served. When the question is not known the analytics relies on raw data to explore what the question should or could be, the unknowns. Through data lab experiments, machine learning models are trained, matured and the results of which can be embedded into the analytics database or used to augment the operational database itself. The former is persisted into a business vault, the latter is re-ingested into a raw vault.
Below are examples of the analytics the above is designed to resolve, ranging from the simplistic to the complex (higher investment and higher valued outcomes).
Operational reporting (what has happened), these are non-sophisticated reporting requirements asking the data:
- What are our sales figures by channel?
- What are our arrears by bucket?
- What are our assets under management?
Hard facts produced from the business process automation engines such as measuring the business process health. This is likely to be consumed from raw vault but only for the current state of the business object or unit of work.
Business intelligence (why is it happening), information into the characteristics of the business process, inductive reasoning:
- Why are some sales channels outperforming others?
- Why is arrears bucket 2 increasing?
- What’s driving self-service adoption?
What happened, why it happened or why it continues to happen are converted into actionable insights by specialised analysts using software requiring data wrangling/analysing skills. These insights are likely consumed from raw vault and some business vault as both will provide historical context to the business object or unit of work.
Advanced analytics (what could happen), advanced analytical techniques and statistics used in deductive reasoning:
- What’s the probability of a customer going into arrears?
- What are key segments of our customer base driving product adoption?
- What’s the probability of the customer refinancing?
Testing scenarios and historical trends, some of the intelligence sought to improve business process automation; what could we do better? Consumed from raw vault and the auditable rules persisted into business vault as derived business rule outcomes; business rules here will evolve and version as business processes are improved.
Prescriptive analytics (what should happen), automation, bots and artificial intelligence:
- When the probability of a default becomes X% follow up with a call…
- When customers belong to segment ‘X’, execute ‘Y’ marketing strategy
- Campaign to segment ‘B’ customers via email then life stage is ‘2’
Artificial intelligence relies on information architecture, information architecture is the design and deployment of trustworthy data products in a repeatable, (re-)usable, feasible and reliable fashion based on the constraints of the business. To support the evolving business you need a data model that evolves with you and represents all the information criteria highlighted above. Real-time and batch/file-oriented support can be hosted in raw and business vault.
“There is no AI without IA” — Seth Earley
Data Modelling, Mapping and Frameworks
Is data mapping and data modelling opposing approaches to turning data into information? Depending on the context; let’s take the traditional 3rd normal form to data warehouse modelling otherwise known as the Inmon data model. The Inmon model does an excellent job of representing the industry models by constraint; by that we mean that through understanding the 3rd normal form data model in its context you can begin to even assume the business rules that are being represented by that data model, these are enforced through referential integrity and primary key constraints physically applied to the relational tables. Attempt to load any new data into this data model that does not fit these constraints and your load attempt will fail.
Despite applying a 3rd normal form data model to data warehousing it must also be adaptive to change when a new data source is introduced. The reasons and implication to add new data sources could be:
- New source system that complements an existing source system
- New source system that will eventually supersede an existing source system
- New source system that replaces an existing source system
- Migration effort, context mapping, regression testing changes etc.
New data must be rationalised, cleansed and conformed into the Inmon data model. The problem with the Inmon data model when it comes to change is why do we need to enforce these constraints in the target data warehouse model when the source-systems already enforce them in their automated business processes? By extension, a 3rd normal form industry data model may not be as adaptive to different jurisdictions as one would hope.
For the last three to four decades the approach to modelling data into an analytical data model has been a discussion on whether to use Kimball or Inmon data models but what makes Kimball data modelling popular? How does Kimball data modelling differ from Inmon data modelling?
Kimball dimensional modelling focuses on the analytical depth of analysing dimensions of data (represented by business objects) around common metrics or facts. By common I mean that these metrics are applicable to the participating dimensions of this unit of work or transaction at a point in time. Where Kimball differs greatly to Inmon is in the construction of these facts and dimensions are built around the known questions the business is after rather than based on the 3rd normal form representation of the business process. The question is known and the answer can be rolled up, sliced and diced across hierarchies and present the dynamic view of the business question.
There are documented Inmon and Kimball industry data models (of course) but if you interpret these industry models to what they really are then you could arguably state that the Kimball industry models are based on the questions everyone in the industry is asking and Inmon data models are built upon universal industry data models for finance, health, transport, government etc. How do you then differentiate your dimensional modelling to the standard industry questions? Why would you want to be asking the same questions everyone else is asking? How do you add innovation into this process?
Kimball style (dimensional) data modelling will suffer the same difficulties when faced with the prospect of adding/augmenting new source systems and attempting to add dimensions or facts to existing dimensional models. You are not able to destroy the data and start over, you must incorporate the proposed changes while still providing the same analytical value as you did before, not to mention still provide the same expected performance existing users enjoyed before. And what about downtime to make these changes? Migration effort? Regressions testing? What if you transformed the data on the way into the dimensional model? How do you retrieve the original state of the raw data that supported that data model in the first place? … Oh my! Kimball data modelling has not adopted business process constraints but it does still suffer the downside of having to adopt changes… cheaply. Inmon and Kimball data models can live in the same data architecture as the former could form the historised business process data foundation and the latter providing the dimensional context of that data, should a change be required now you would then have to double the effort to migrate from the old state of the business processes to the new.
Data Vault is data mapping, yes the data is modelled but it is essentially mapped to the three business object components any business wishes to track as information concepts,
- definition of and its immutable business key — these are mapped to hub tables
- relationships, unit of work and transactions with other business objects — these are mapped to link tables
- information concept states of the above two concepts — these are mapped to satellite tables
Because they are mapped to three repeatable table types, data vault is better suited to agility and automation than Inmon and Kimball data models but it does not offer the dimensionality that Kimball data models do. Instead the objective of data vault is mapping business processes in a flexible and scalable data representation (map) and therefore easily allow for non-destructive change to the data model without any downtime and with no need to perform regression testing because of those changes. The objective of getting data out of the data vault is supported by building disposable information marts and they can use the traditional Kimball data models to achieve this except now any changes needed in dimensionality and metrics can be dropped and rebuilt without affecting the corporate audit history. Data vault calls this layer the information mart layer because it focuses on information rather than data and can be deployed in any model that meets the insight delivery needs. Information marts can act as the anti-corruption layer (to reuse DDD parlance) and can be deployed on a single or cross domain.
What is the benefit of this approach?
- a data vault never needs to be refactored, it grows as the organisation grows.
- data vault provides the corporate history, dimensional models can be dropped, recreated and deployed without risking any loss of data.
- business vault extends the raw vault structure with derived content, keeping the derived content separate from the raw content that fills in business process gaps in the source or is used to house derived content (in any language) using the same auditable loading patterns of raw vault.
- while business vault and the information mart layers are looking to answer the known knowns, raw vault can be used to discover the unknown unknowns of data and therefore encourage even more extensions to the corporate data vault model.
Let’s elaborate on this last point.
By providing the auditable history about what happened to a business object the mapped data provides a platform for building new insights based on historical context. And because they are raw, they can support applying feature engineering for both offline and online feature stores if the technology supports it. Should the current feature needed for data science be superseded they can be versioned off in the same raw data vault without any data loss of the auditable data source. By applying the same loading pattern for raw vault, business vault can support the same ability to version its labels based on changed algorithms and/or an update to the feature value itself. The raw and business vault satellites are child tables to a parent hub (business object) or link table (unit of work). This is how the data vault can support data science, see bit.ly/3BhWE3c
“All models are wrong but some are useful.” — George Box
Built around an aggregate of aggregates is a layered data architecture framework; data vault (as we have shown in a previous blog) can be worked on and developed in parallel across squads and not introduce contention in data product development in any of the domains; observe…
This reference architecture covers how we have been doing data analytics for decades and despite the technology changes it is still based on the same core phases to land the data, transform it in some meaningful way and produce meaningful analytics for the business to consume the information. The above proposes that a data vault is in the middle, why? Its design is auditable, agile and easy to automate while still representing the business architecture.
Let’s explore each component…
Producer Domain
From slowest to fastest
Landed sources
- Operational source system push/pull files in a batch window (traditional approach)
- Operational source system as log scraping, data replication pushed as landed files
Real time sources
- Application publishing to message queues/logs (dumb pipes) and subscribers latch onto those published messages
- Webhooks and/or APIs, (smart endpoints) requests and pulls
- Data sharing, pioneered by Snowflake, data is available immediately
The data is either made available as structured (defined column definitions upfront — schema-on-write), semi-structured (define column definition when you need the data — schema-on-write) or unstructured (blob files such as PDFs, video, images that are used as data sources for machine learning algorithms).
Aggregate Domain
Aggregate of aggregates is an apt description for data vault as the integration point of the data platform, hub tables are the “shared kernel” between:
- producer and consumer domains and their respective squads;
- raw and business vault; there are no business vault hub tables unless you have created the business key in the data platform and it did not come from a source system;
- source-aligned producer domains, after all, source systems are nothing more than business process automation tools and if those tools could reuse the same business keys to represent the same business entities across their domains then we have no need for business key collision codes (BKCC) to map those sources in the data vault.
This last point we described in the first blog of this series.
The raw data vault transcends the data lake and data warehouse because it models the data in its raw format with no changes except for mapping the business objects and unit of work between those business objects. The data in this format is used as a base for derived content and for data science at the same time, should data science models and business rules prove to be useful then why not automate those in a business vault to be shared by multiple consumer domains where appropriate. Reusable derived business rules are shared between consumer domains but some business rules should remain private and should then be deployed as private business vaults in each consumer domain’s respective data workspaces.
Now data products can either be single domain or cross domain, the way to bring those needed artifacts together is by picking the aggregate data products and expressing them as SQL views (by default) otherwise called information marts; this is where Kimball style information marts are utilised in data vault but in this framework they are disposable. Because point-in-time (PIT) and bridge tables do not offer the same auditability, agility and automation as the hub, link and satellite tables do in data vault they are considered information mart query assistance tables and are not business vault artefacts at all, PITs and Bridges are disposable too.
Consumer Domain
Consumers can come in many forms (from scripted output to no-code headless BI) and as we have described at the beginning of this blog; consumers are from operational reporting all the way to prescriptive AI bots. Should the outcome of these interactions lead to more analytics of value then they should be reintegrated back into the analytics platform as:
- raw source to be ELT’ed into landing (or reverse-ETL), or
- augmented into a business vault
Consumers will also expect guarantees about that data, and certainly how the information was derived. Along with maturing data models, frameworks have also matured to provide guidance on how to provide those guarantees, it is up to the data governance tools to support these guarantees.
Data Management Frameworks
Now that we have divided the domains vertically into zones we can horizontally divide domains by domain ownership which architecturally starts to resemble a data mesh; this was the focus of the third blog in this series.
Data warehouse, data lake, data lakehouse… all need auditable trust… frameworks suggest the patterns needed for building this trust. This is applicable regardless of the data modelling paradigm used.
DMBoK
Data Management Body of Knowledge (DMBoK): a broad theoretical and practical framework published by DAMA for managing all aspects of the data with a particular focus on data governance.
DCAM
Data Management Capability Assessment Model (DCAM): Developed by the Enterprise Data Management (EDM) Council, DCAM is a structured resource that defines and describes data management capabilities with practical guidance.
Information technology is the automation of business processes and rules. They provide the technical agility behind the business agility to scale your business, this is why you often see in enterprise architecture frameworks that the business architecture is the first architecture domain that forms the foundation for data, application and technical architectures. If a business initiative is not tied to a business objective, then how do we justify the funding for that initiative? The most well-known enterprise architecture frameworks are (in no particular order):
- The Open Group Architecture Framework (TOGAF) now in its 10th edition and Business Architecture is leveraged from the Business Architecture Guild.
- Zachman Framework, a multi-view framework for enterprise architecture perspectives
- Department of Defence Architecture framework (DoDAF).
To round off this blog in the series, a little bit of a different take on the famous Clive Humby quote…
Data is not the new oil, time is
With regards to the axiom: “Data is the new oil” — Clive Humby
I’ll leave you with a few musings I found based on a speech given by Harlan Cleveland, 2001 bit.ly/3AU4roA (I substituted the word information for data) as a counter-argument to how we think about data:
- Data expands as it’s used — unlike physical resources. The only way even physical resources can be made expandable is by using data — like using fertilisers to increase crop yields.
- Data is less hungry for other resources than physical resources used to be. The more advanced the technology, the less energy and raw materials seem to be required.
- Data can, and increasingly does, replace land, labour, and capital.
- Data is easily transportable — at almost the speed of light, a physicist will tell you — and by telepathy and prayer, much faster than that.
- Data is transparent — it has an inherent tendency to leak. The more it leaks, the more we have, and the more of us have it.
- Data is shared, not “exchanged.” The sale or gift of data is not an exchange transaction, because both parties still have it after they have shared it. (This means data cannot be owned — “intellectual property” is an oxymoron). Data ownership through NFTs is another topic, a digital image bought and paid for can easily be copied and downloaded without owning that original image.
- Data has an infinite and expanding number of use cases and an infinite number of forms.
- Data requires consent to use it. If data is out in the open it is unrecoverable.
…and about time, bit.ly/3RdbSg6.
- Time has scarcity, the older you are the less of it you have and the more precious it becomes.
- Time as a resource is you, time requires more resources in order to expand through automation (using data) or through hiring.
- Time does not create more time, once time is spent it is unrecoverable.
- Time only has one form, linear (until proven otherwise).
- Time cannot be exchanged, if you’re willing to pay for it, which places a dollar value on your time and requires time to earn.
What do you think? It is true that the world’s focus in this industrial revolution is data but energy powers everything, including data. Is data really the new oil?
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.