3. Data Vault and Domain Oriented Architecture
“Any organisation that designs a system (defined broadly) will produce a design whose structure is a copy of the organisation’s communication structure.” — Melvin Conway
Series catalogue:
- Data Vault and Domain Driven Design
- Data Vault as a Product
- Data Vault and Domain Oriented Architecture
- Data Vault, Semantics and Ontologies
- Data Vault and Analytics Maturity
Let’s begin from an organisation’s perspective by defining the following terms:
- an organisation is “a social unit of people, systematically structured and managed to meet a need or to pursue collective goals on a continuing basis” — BizBoK Guide
- A business unit is “a logical element or segment of a company representing a specific business function, and a definite place on the organisational chart, under the domain of a manager. Also called department, division, or a functional area” — BizBoK Guide
An organisation’s bounds are the bounds of the business, that is that business units perform capabilities of the business and non-core capabilities can be outsourced to 3rd parties and partners, outsourcing capabilities is an effective means for offloading non-capabilities but it is still part of the business, just not the organisation. A Harvard Business Review article (see, bit.ly/3PoLaQq) on organigraphs identifies four types of organisation structures, the first two are traditional and latter two more reflective to how organisations actually work.
- Set — every organisation is a set of items such as machines or people that barely connect with each other. This is the classic organisation silo (antipattern).
- Chain — linear and often sequential communication and relationship between business units (and business partners) that work together to deliver on a value proposition. This is often referred to as a value chain.
- Hub — a coordinating centre such as a building, manager, core competency which may describe an extension of a chain. In the previous blog post, we defined “jobs-to-be-done” as a neat framework for defining the indirect jobs needed to install, manage, maintain and retire a product. All peripheral activities to the product/service itself can be activities that occur external to the organisation as the product/service is the organisation’s value proposition.
- Web — defining lateral communication between nodes (people, teams, technology etc.) that connect in all manners of ways to get the job done. This may accurately describe how the business functions.
Your business organisation is a complex network of participating stakeholders representing interconnected business domains, imagine now that you only had one data platform to manage all your data. A common framework for viewing your business is through the lens of enterprise architecture as depicted below,
Mapping enterprise architecture to business process automation and service oriented architecture, bit.ly/3o4koB6. In the above diagram we describe:
- Enterprise architecture — supporting multiple views of the enterprise encapsulating business architecture mapping to the information technology to automate it (ex. Frameworks such as DoDaf, Zachman, TOGAF).
- Business architecture — business view from capabilities and value streams perspective defined by business strategy and transformed through business initiatives enabling the organisation.
- Data architecture — data view and information map representing capabilities as information concepts in a taxonomy, the ubiquitous language.
- Application architecture — business automation and services view automating business capabilities as software.
- Technical architecture — platform and enabling technology to integrate and support the enterprise through software and data service platforms.
- Solution architecture — initiative or portfolio based view to design automation into the enterprise architecture using a combination of the above architectures.
These architecture views provide the accuracy needed when upgrading or improving business capabilities or when embarking on major organisational changes such as digital transformations, divestiture, mergers and acquisitions, joint ventures, shifting to a customer centric business model, introducing new product lines, enabling globalisation or meeting regulatory compliance.
Data mesh favours a decentralised architecture, what it means to the business is that for as long as the interfaces between data domains are standardised each domain can be maintained and upgraded in isolation while still providing the expected service delivery and governed autonomous access to their respective data products.
“There is no AI without IA” — Seth Earley
Data Vault on Data Mesh
There’s not one way to configure your data mesh and that is why data mesh contains principles and not directives, below we will explore pros and cons for some possible data vault on data mesh configurations. To start, we place the raw and business vault at the centre of an architecture.
Raw vault is source-aligned and the raw vault satellite and link table are single file/stream based while the hub table is the integration table between multiple files/streams. Source systems automate business processes and rules, whilst the operational platform may support some historical context the data vault is designed to support all historical context.
The raw vault hub, link and satellite tables are the evidentiary source-aligned content and we use those (and other business vault tables) to support soft business rule code/algorithms that we will store the outcomes as business vault link and satellite tables. By using the same loading patterns for raw vault the business vault also becomes the evidentiary business rule outcomes providing the same automation and agility as raw vault.
As depicted in the diagram above, the business vault is sparsely modelled, the hub table is the shared kernel between bounded contexts in a context map (first blog) and it is also the shared kernel between raw and business vault. Hub tables store the business key and unless the business key was created in the analytics database (highly unlikely) they are always raw vault tables. Business vault stores the derived business rules about those business objects (business key, hub table). The segregation and decoupled nature of raw and business vault allows for limitless scalability within the data vault.
Single platform, single data vault
In the above diagram, from the bottom moving upwards we describe:
- Streaming backbone or change data capture, a dumb pipe with publishers and subscribers pulling data up into SAL; or the data ingestion can be supported using source system snapshots of what is needed for raw vault.
- Source-aligned-layer (SAL), your integration of source-domain data into modelled hub, link and satellite tables. The data model is extended with sparsely modelled business vault link and satellite tables.
- Business-access-layer (BAL) is occupied by multiple business units (consumer domain and subdomains) which in turn can extend the data vault in SAL with additional business vault link and satellite tables of their own. Having a private business vault is optional but does provide that same auditability, agility and automation we spoke of earlier. Each business unit also has an independent lab to run experiments and these data products are owned by their respective domains. Source and consumer domain can work in tandem to deliver operational and analytic data value. Consumer domain may be the owners of their modelled SAL data therefore the owners of any changes to that raw and business vault context.
Each layer has a “util” area (yellow box) for storing control content relevant to that layer or zone, such as data governance policies, stored procedures (business rules stored in the platform), user defined functions and more.
The benefit of this approach is that it is easy to apply data governance standards to a central enterprise data vault and a single place to identify and manage sensitive data. Management of the data platform is centralised but each business unit using the platform does still function autonomously. Each business unit can model their domain into the data vault in SAL too because the raw vault is source-domain aligned therefore bounded contexts are present by design. Of course a domain could choose to hire analytics engineers as a facilitating squad to build these data products on their behalf and provide analytics-as-a-service as opposed to data-as-a-service; this enables non-data vault literate data modellers and engineers to be supported by the platform.
Multi-platform, single data vault config
Expanding on the previous diagram, the above diagram now shows:
- Data is modelled into a SAL that houses the raw and business vault and nothing else.
- Each business unit/domain has their own infrastructure platform that pulls what they need from an enterprise data vault via a standardised interface (doesn’t have to be an API).
- A functional information mart layer may be present to present the data for consumption.
The benefit of this approach is you can keep the infrastructure completely separate between domains and may in fact be a compliance requirement that you do so. Data provenance will have to be managed across data platforms with an external tool whereas the previous configuration contains data provenance totally within a single platform.
Multi-platform, multiple data vaults config
Each domain has their own modelled data vault, some considerations:
- Passive integration will not be easy
- Applying a common data vault standard may not be easy, however naming standards can be applied form a central code repository
- Each independent platform does not need to apply a data vault!
- Each squad can apply a data vault how they see fit, even using separate data vault automation tools
Multi-platform, multiple data vaults, multiple orgs
An organisation’s non-core capabilities could be outsourced or serviced by a 3rd party or partner. Because you have followed the data mesh principles for setting standardised interfaces for your data products that domain may interact with a domain owned and managed by the 3rd party/partner. However, data management and privacy imperatives likely means that you will only be sharing non-sensitive or obfuscated content between yourselves and your partners in a data mesh. An approach to manage this is by utilising a data clean room — a virtual space established between domains within an organisation or between organisations in partnership or even with customers, see bit.ly/3IeDzSu
Data governance on Data Vault
From a governance and privacy management perspective the earlier down the layered architecture we implement data governance (in any of the above configurations) the less likely we will need to replicate the same policies and controls in the different layers of the architecture.
Now we don’t want to remove complete control over the data of the customers and business units of your platform users but there will certainly be some (or most) data governance that must be controlled up front. The lower down the layers we apply them the more manageable and cheaper the implementation will be.
In addition to the data governance applied to the domain data captured in the data vault, each business unit can have their own data product governance applied to the data products they manage exclusively with all the additional metadata needed by each business unit, by their own definitions.
The data product sidecars (util) are designed to house controls, policies and configurations that are largely based on role-based-access-control (RBAC); advanced features in masking and row-level access control means you can also stack the controls over what is supplied in SAL as views. For example, if we build information marts as views over the physical tables the role access can be controlled using either a current or invoker role context functions depending on the measure of access control you require. In addition and to provide even more scalability, object tags can be associated with each data asset and data governance is then applied based on that object tag framework.
Ultimately data vault inherits the data governance applied to your data with some best practices around managing sensitive data, such as identifying data (ex. social security number) or quasi identifiers (value objects when combined can be reused to re identify the business object).
Systems must be designed for failure and operational mistakes will happen; data vault does provide the capability to version those events by processing corrections of those events as the complete audit history of data and the operational processes behind that data. Remember, in a data vault we never have to refactor anything! See bit.ly/3iEiHZB.
Replay-ability, audit history, insights
“Accountants don’t use an eraser. Why should programmers?“ — Pat Helland.
The above animation illustrates the application of applied timestamp concept, on a batch/file based business rule outcome the applied timestamp is the extract timestamp. For streaming or change capture ingestion the applied timestamp is the event timestamp. In the above scenario we see:
- Event source domain events projected to the analytics database (data vault) with an applied lookup to a pricing table. The price lookup is applied on load and stored in a raw vault satellite table (semi-structured content). The satellite table stores data at any velocity and since this is an event source the source provides the complete audit history for the domain events. The application state can be rebuilt by the accumulation of the events.
- Data vault supports data versioning with the inclusion of applied timestamp, because the data vault is bi-temporal two timestamps are included with every event. The applied timestamp and the load (into data vault) timestamp.
- When a correction is applied the event source can be replayed however we do not need to reload the data vault satellite, we load the same event again and because the record differs by the same applied timestamp it now supersedes the previous record by the same applied timestamp.
- The applied/event timestamp is the same but the load timestamp is different, the newer load timestamp indicates that it is a newer version of the data.
- Both are kept in a data vault forever, querying the correct timeline is just a matter of ensuring you query by the maximum load timestamp for the applied timestamp by the parent key.
- Should you be required to query by the previous state the data is always available in the data vault.
Because information marts are deployed as views any correction is impacted immediately, but if there are query assistance tables they may need to be replayed depending on your circumstances.
Event sourcing supplied to a data vault provides these highlighted benefits,
- Records every event as the source supplied it, we can analyse the behaviour of the business object to dig into why the “item remove” event occurred.
- Record every correction to analyse why the correction was needed in the first place.
- Tie the domain event to the overall business ontology because the satellite table is a child table of either a hub table or a link table.
- Data vault is an INSERT-ONLY data model, just like event sourcing (data from change data capture can be configured the same way) and therefore events should flow into data vault with no need to process updates to an end-date field.
- For this particular satellite table it can serve a dual purpose if it were applied in a command query responsibility segregation (CQRS) pattern using a hybrid transactional/analytical processing (HTAP) table — command and query table in one (see bit.ly/3AnBku5).
- A data vault satellite deployed as a hybrid table can then support high query throughput and high number of queries per second (20,000+ QPS) and support for analytics and machine learning initiatives such as an online and offline feature store.
The concept of raw and business vault as the storage of business rule outcomes is very important and something you must consider as a part of the overall data vault architecture. Raw vault is the storage of business rule outcomes from source-aligned domains, business vault is the storage of the business rule outcomes from raw and/or other business vault tables. Just like raw vault can have its source systems built in any programming language (or system), so can a business vault be supported by any programming language or business rule automation tool. Although the data vault loaders (raw and business) must be implemented in the same automation tool (loading hub, link and satellite tables), the implementation of the business rules themselves do not. This separation of concern around data vault ensures that your data vault can scale limitlessly supporting the business architecture as the business and its service-oriented architecture (SOA) evolves.
Data mesh is self-described as a sociotechnical system and therefore it is best served by domain-oriented squads, this stream aligned concept enables a proficient cognitive load.
Domain-oriented squads
“A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness.” — Alfred Korzybski, Science and Sanity
Domain oriented, self-organising, stream-aligned squads with facilitation provided by agile coaches and one or more data vault coaches. The key to data vault adoption is execution, data vault is not complicated but it does come with a learning curve and investment requiring training, and coaching. But as the standards are followed with the inevitable lessons learnt incorporated into domain history, the iterations of data vault delivery will speed up.
Referring to the previous blog, mapping data products to teams (squads and tribes) and their investment needs a map to drive strategy, a Wardley map can be used to identify differentiators, investments opportunities and the landscape versus the competition. In more detail you can group products as managed by squads and tribes and their dependencies on other squads and the domains they operate in. The Wardley map can describe your data vault adoption too, you will pioneer in it, settlers will mimic, learn and adopt it and before long the data vault becomes a reusable commodity in your enterprise.
#datavault #datamesh #businessarchitecture #domainorientedarchitecture #agile #datacleanroom #snowflake
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.