2. Data Vault as a Product
“Not all data is created equal” — Unknown
Series catalogue:
- Data Vault and Domain Driven Design
- Data Vault as a Product
- Data Vault and Domain Oriented Architecture
- Data Vault, Semantics and Ontologies
- Data Vault and Analytics Maturity
A data product is defined as:
“…an autonomous, read-optimized, standardized data unit containing at least one dataset (domain dataset), created for satisfying user needs”- Jacek Majchrzak, Sven Balnojan, and Marian Siwiak
A data vault has a number of specialised data product patterns designed to support (as we discussed in the previous post) the three core dimensions an organization values about their business objects are:
- definition of and immutable business key stored in hub tables (HUB),
- relationships and transactions stored in link tables (LNK) and
- business object or relationship descriptive states stored in satellite tables (SAT) whilst supporting the enterprise audit history
Each domain of a business is mapped to a portion of the enterprise data vault model that forms the value chain of the organization, each enterprise data vault portion will have one or more data products. This mapping theme is also tied to the core business architecture domains as
- capabilities — what the business does based on business objects
- value streams — how the business delivers value to and delights their customers
- organization — business units who own the value stream stage and capability and
- information/business glossary — the business vocabulary representing the above as information concepts
Business architecture provides extended domain mappings that include product mapping, and defines a product as:
“A good, idea, method, information, object, or service that is the end result of a process and serves as a need or want is satisfied. It is usually a bundle of tangible and intangible attributes (benefits, features, functions, uses) that a seller offers to a buyer for purchase”
When we refer to a data product (a definition articulated below) we can substitute the words “seller” and “buyer” with the words “producer” and “consumer” in the above definition.
How product mapping fits into the extended domain of business architecture mapping is described as:
Illustrated in the above we describe that,
- A product can be similar to another product.
- A product belongs to a product line and/or a product family within a product inventory.
- A product will include one or more features.
- An initiative (execution of strategy) creates, modifies or sunsets a product, product line or family
- A product is owned by a business unit and relies on business capabilities
- A product is used in a value stream stage which contributes to the value proposition
- A product may include various product entitlements enabled by business capabilities
As business architecture maps an organization’s strategic vision to its product’s value proposition, is there a proven repeatable pattern to product innovation success? And can we apply that thinking to data products?
Note that your data products may be internal or external (customer facing) only; regardless the same product thinking can be applied to either data product type.
Data as a Product (data quantum)
Data products are feasible, valuable and usable, and support external or internal customers, or both. The data product has a definition, configuration and defined inputs, outputs, use cases and be a problem-solution-fit that can be shared with other domains or be discovered with the appropriate data governance applied.
Data mesh adopts product thinking in its definition of data as a product as:
“Data as a product principle is designed to address the data quality and age-old data silos problem; or as Gartner calls it dark data — the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes. Data as a product is about applying product thinking to how data is modeled and shared”. — Zhamak Dehghani, Data Mesh
We adopt product thinking for data as a product by defining the problem-solution fit using cognitive tools like:
- 5W1H (5 whys, 1 how)
- Mom’s test to avoid bias, momtestbook.com
- Jobs-to-be-done (JTBD) framework, bit.ly/3QDvA4a
And we define the desired outcomes using cognitive tools like:
- Developing a concise hypothesis statement — value proposition (value stream outcome)
- Getting creative by transforming, minimizing, maximizing, modifying/rearranging or expanding on an existing solution
- Focusing on the outcome/impact rather than the output
“Making products for your customers is far more efficient than finding customers for your products” — Seth Godin
And who are our customers? We define our data product user personas under product management (see bit.ly/3PmQfbJ) as:
- consumers — business decision maker, management/executive using dashboards and reports; explores via no-code
- explorers — insights generator, assesses data as raw or curated, business analyst; explores via code, search
- producers — automating data production processes, engineer, integrator of data sources
- modellers — building data models and algorithms, categorised as data vault modeller or if building AI models then a data scientist
- advisors — compliance and security
We can then systematically apply the above to how we configure the data product by leveraging a reusable template, a data product design card.
Start configuring the card by defining data product classification as:
- Raw data — modelled raw data into a raw vault where only hard rules have been applied. The pre-modelled data is made available for data scientists to explore and discover new viable data products in a data lab. Successful experiments can have their outcomes persisted as business vault artefacts.
- Derived data — a mix of raw and business vault that represent auditable idempotent business rules developed within the data platform.
- Algorithms — based on data vault (audit history) the algorithm (machine learnt) outcomes can be persisted back into a shared business vault or persisted as private business vault entities.
- Decision support — such as expert systems to augment or support a decision process.
- Automated decision making — prescriptive “hands-free” data product.
Define how to access the data product by defining input/output ports:
- API (smart endpoints), required to be documented, discoverable, addressable, understandable, trustworthy and intuitive for an authorised consumer to work with and use data in their own domain. See bit.ly/3oKf1aU
- Visualizations and dashboards (insights), purpose-focussed insights designed to augment the consumer’s own workflow/routine. See bit.ly/3vtfVwd
- Web elements (enhancing): augmenting applications and other consumer interfaces where data products serve a particular function like recommendation engines on a search solution.
- Events (dumb pipes): streaming-first replayable topics curated and ingested, other business event packages are supported as batch and micro-batch..
Profile the data product metrics as:
- actionable, goal-oriented and define what success is
- have a common interpretation, using ubiquitous language
- be accessible, based on credible data with data provenance
- be a transparent simple calculation, i.e. incorporate concision
Profile data product architecture and infrastructure needs:
- How often is the data used, how many users of the data product will there be?
Concurrency and responsiveness measured as the number of queries-per-second (QPS) and the time to return query results.
- How complete and accurate does data need to be?
100% accurate and correct or approximate with a margin of error that provides trends analysis. The more real time data product can be superseded by batch-oriented data products.
- How often is the data updated, how fresh does the data need to be?
Bounded or unbounded data product or a mix of the two, an example technique described as lambda views that augments historical data with real-time data. What is the data retention requirement?
- What is the expected volume of data?
Batch, micro-batch or streaming has implications to data modelling and architecture. Do we need to consider data architecture and modelling techniques to augment the data product?
- Is there confidential or personally identifiable information in the data?
The defining characteristic about personally identifiable data is that it hardly (if ever) changes. Within a data vault modelling context we isolate these identifying attributes into a separate satellite table because there will likely ever be only one record loaded for a business object that is identifying and no changes applied to these identifying attributes going forward. A possible effective strategy to manage when an article 17 of GDPR is triggered.
Next define the steel thread for the data vault…
Data Vault Products (Steel Thread)
A repeatable pattern to build, test and deploy more and more data products through the data vault. As we saw under DDD, data vault is the opportunity to passively integrate your data sources by hub tables, the table object used to house the immutable business keys used by business objects. Measure, measure, measure, to learn more visit bit.ly/3Anc1bK
Details on the steel thread and its data product sidecar:
- Source-aligned domain data modelled into raw vault as hub, link, and satellite tables as data vault aggregates. Any gaps in source should be solved at source otherwise modelled in business vault as an extension of raw vault, see: bit.ly/3BUt81s
- Cross or single domain data products that have been modelled in a data vault are then available to form the aggregates as information marts.
- Discoverable, from bounded contexts, hub registration and all the way to data products are registered in a data catalogue and have the appropriate business glossary updated. Is there sensitive data being exposed? Who has access to it? Who created it? What is the retention period? Do we need to utilise row access policies, column access policies like dynamic masking or tokenization? Have we captured the data provenance of the data product? This is also known as descriptive, structural and administrative metadata.
- Observability, metrics are baked into every step of the steel thread including data provenance, freshness and logs.
- Shape of data, statistics, everything from volume, rate of change to distribution and range expected as fitness functions.
- Code, data, and infrastructure dependencies as well as usage rules and policies defined at the domain and product levels
We know of the standard three data vault table types we use as data product patterns, but not all data will fit these main table types and for that reason data vault does provide guidance on variations of the above table types
- non-historised satellite table (NSAT) — when the data producer is sending data at row level and continuously; having to check if a record is a true change is redundant because the new records are true changes by definition and the data product’s value rapidly decays unless acted on in realtime. An alternative supported structure is the non-historised link. This load pattern is not intended to be used for file-based (batch) workloads.
- multi-active satellite (MSAT) — when the data producer must be managed in a SET (declaratively) rather than by record changes (imperatively) an approach is to use the multi-active satellite. Any change in any of the records in the set or number of records in the set triggers a true change to that set.
- satellite with a dependent child key table — when finer grain change tracking is needed by the parent key (hub or link table), changes are tracked to the parent key and the dependent-child key(s). This pattern is also useful in loading intraday changes by defining the business event date to be used as a dependent child key. Alternatively the dependent child key can be placed in the link table itself.
All three of the above extended satellite patterns can be used in both raw and business vaults.
The same patterns we have available for loading satellite tables as data product constituents can be reused to provide additional intelligence about the content we receive from the source-aligned domains. We call these satellite table patterns “peripheral” because they are not about the business process directly but rather providing additional peripheral intelligence about those business processes and their participating business objects.
- record tracking satellite table (RTS) — providing the occurrence of the parent entity, business object (if the record tracking satellite is based on a hub table) or relationship/transaction (if the record tracking satellite is based on a link table) and is used as an alternative to last-seen-date for that entity.
- status tracking satellite table (STS) — used to infer the insert, update and deletion of a snapshot based on either the hub or link table. Adoption of this pattern requires a second-level of staging because of the generated record based on the disappearance of the parent entity (i.e. status=’D’).
- effectivity satellite table (EFS) — based on a driver key, this structure is used to track the changes between participating business objects in a relationship when such a relationship change is not provided by the data provider. It is the only satellite table that includes a physicalised end-date column without actually performing an update statement on the table itself. Adoption of this pattern requires a second-level of staging because of generating the records to close the previous relationship based on that driver key. Without this pattern and without the business date provided from the data provider, there is no other way in the data vault of tracking a flip-flopping relationship by a driver key.
Data vault decomposes your data into agile, automated and auditable table structures, the following data product patterns unify your domain-specific or cross domain data vault artefacts into a single domain or cross-domain data products.
Disposable data product patterns, no audit history required.
- point-in-time (PIT) table — used to optimize join performance for information marts over raw and business vault artefacts. Deployable in managed PIT windows and/or logarithmic PIT structures
- bridge table — simplifying and optimizing join performance over a large data vault model
- information mart — provides the harmonised view of information over a data vault and the localised semantic grain needed for the consuming domain(s)
If enabled, the extended record tracking satellite tracks every record loaded by parent hub or link table and is used to provide time-line corrections when data is loaded in the incorrect sequence. This means that any file-based package that arrives out of sequence does not stall loading that data into a data product pattern, late arriving data packages can be loaded seamlessly when they arrive.
DDD + DP(DV)
Here we have shown that designing a data vault as data products follows a familiar design pattern to unix philosophy, and to paraphrase for data products we:
- create data product patterns that do one thing and do it well (avoid overloaded data products);
- create data product patterns that can be assembled and work together;
- create data products to be tried early (retire redundant data products)
Finally, putting the data vault products into context over the domain-oriented architecture platform maps to the domains of business architecture to provide an overview of how the data platform delivers and shares value. Data from source-aligned domains is the modelled outcome of automated business processes. Data vault provides the auditable data model that forms the agile foundation for the business’ data products. Each stage of the business object’s journey through and across your business processes passes through applications while shedding business object state and metrics for the data vault to collect.
“People don’t want to buy a quarter-inch drill, they want a quarter-inch hole” — Theodore Levitt
Next we will investigate data vault and data oriented architecture options…
#datavault #datamesh #businessarchitecture #ddd #dataproduct #dataasaproduct
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.