Data Vault on Snowflake: Row Access Policies + Multi-Tenancy
Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers. Data Vault supports multi-tenancy and combined with Snowflake’s Row Access Policy (RAP) you can simplify the authorisation access of the Data Vault tables itself and make it data driven.
Episode catalogue,
- Immutable Store, Virtual End-Dates
- Snowsight dashboards for Data Vault
- Point-in-Time constructs & Join Trees
- Querying really BIG satellite tables
- Streams & Tasks on Views
- Conditional Multi-Table INSERT, and where to use it
- Row Access Policies + Multi-Tenancy
- Hub locking on Snowflake
9. Virtual Warehouses & Charge Back
11. Handling Semi-Structured Data
12. Feature Engineering and Business Vault
A reminder of the data vault table types,
The tenancy applied to the data vault will be used in to build our query assistance tables.
Episode 7: Row Access Policies + Multi-Tenancy
Multi-tenancy is the concept of having a single instance of an application or platform serve multiple customers, the tenants. Each tenant shares the underlying application, but each tenant’s collateral is isolated and remains invisible to the other tenants. Your favourite Cloud Service Providers (CSP- AWS, Azure and GCP) provide Infrastructure-as-a-Service (IaaS) and Snowflake is one of its tenants. Snowflake itself is a Software-as-a-Service (SaaS) platform deployed in a Virtual Private Cloud (VPC) instance and serves multiple tenants (Snowflake accounts) per deployment across the CSPs.
The concept of multi-tenancy is applicable to a database as well, Salesforce’s SaaS includes a relational database management system (RDBMS) that hosts their customers on a single database but ensures each customer’s data is isolated and remains invisible to other customers.
Within a Snowflake’s SaaS environment, not only are customers enjoying the near-infinite Data Cloud scalability across multiple CSPs using Snowflake Data Sharing; Snowflake’s ultra-secure cutting-edge security narrative ensures that your Snowflake accounts are isolated and remains invisible to other accounts.
Multi-tenancy is a concept in Data Vault too; Data Vault can be configured to host multiple tenants of a shared enterprise data model, whilst keeping the table content invisible to other tenants as well. In this episode we will discuss just how to do that and combine Data Vault multi-tenancy with Snowflake technology.
What Data Vault means to a business
Data Vault is often compared to other data modelling methodologies such as 3rd normal form or Kimball dimensional marts. Each method has a focus, whilst dimensional modelling focuses on information delivery it becomes cumbersome maintain when changes inevitably need to be applied. Data Vault focuses on how to maintain auditability, remain agile while tracking the two most important information mapping concepts of a business:
- Object state, and
- Object relationships
We won’t need to investigate 3rd normal form because Bill Inmon himself has endorsed Data Vault and updated his definition of a data warehouse accordingly,
No matter what business process, (lean) value stream or domain you are mapping, what a business does is always based on a business object. We want to know its state (and perhaps what was its state) and how do we derive value from that object by combining it with other business objects to form a unit of work, aka transaction or relationship. As we were reminded of the table types of data vault these are the essential entities the data vault is designed to capture. All software applications and systems are purchased to automate business processes and it is the scalable and flexible nature of data vault that we can capture the business process outcomes into hub, link, and satellite tables. Every variation of these table types should be qualified through business process understanding, data profiling and technical debt recognised and budgeted to be dealt with in a product backlog. For every tactical solution there must be a card created to track, assign, and deal with technical debt (if any) when it is appropriate for that organisation to do so.
Passive Integration
Data Vault being the top-down data modelling approach, how do we then passively integrate the source systems to our enterprise model? Recognise that COTS (commercial off the shelf) software will have multi-tenants of their own and likely have an industry-specific but otherwise generic data model to serve as many of their customers as possible. What the vendor labels as an account may not be how your enterprise defines an account. The task for you is to integrate these business rule engines into data vault using recognisable hub table naming standards based on the defined/defining business architecture.
Defined at your enterprise level is your Information Map that anyone in your business will understand as the true interpretation and meaning of common business objects. With that in mind, let’s briefly define some important concepts of what the above means for your data vault.
1. Recognising that multiple source systems having their own definitions will map their business keys of a business object to a common hub table.
As we integrate multiple source systems to how we as business architects have helped the business define that business object, we could be loading similar business keys to the same hub table but represent different business objects. For those source systems where we could have a clash, we introduce a persistent salt for that business key and call it the business key collision code (BKCC). This method of separating business objects is used sparingly, the ideal world view is that the same business key representing the same business object is used across source systems, however we do recognise that in operational reality that this could be a challenge.
2. Recognising that you may have multiple tenants to the same data model but not necessarily being able to share its content whilst still using the same definitions of what is a business object according to your enterprise.
To reuse the same data vault model is to reuse the same hub tables without exposing the business object state and relationships to unauthorised business users.
3. What other metadata columns could we add to a record to give us the robust data provenance we seek. Here are some suggestions:
- Job or Task id — tie this record to what inserted it
- Jira id — what initiative led to this record being inserted into the enterprise data vault model. The Jira id ties to a document that in essence is a mandatory requirement in DevOps culture,
- Record source, where the record came from, but also a space to include the business rule name and version. Any change to business rule version could trigger a change record in a satellite table.
- Applied date timestamp is the snapshot date of the state of the business object at the time the business rule outcome was captured.
- Load date timestamp being the timestamp of when the record is loaded to Data Vault and in essence a version timestamp.
To summarise the above…
- Multi-tenant-id || BKCC || business key(s) = surrogate hash key
- For every data vault table, we include a column to identify the record tenant
- Because the tenant id and BKCC are a part of generating the hash key; using SQL equi-joins only ever join related content by the hash key digest.
If you are not sure what to set the tenant id to, or if you are unsure if you need one (and this is applicable to BKCC as well), set these ids to “default”.
Multi-tenant design
We have defined that multi-tenancy is built into the surrogate hash key and that these are used to join hub, link, and satellite tables via SQL equi-join conditions. We also need to filter these table types by the tenant itself. Because the hub table contains the business key and the appropriate tenant-id; it is the only place where the plain-text business key is required. Therefore, every other data vault table is only contextually applicable to the business key if we can join to it by the generated surrogate hash key.
- A satellite table has no business key and requires an equi-join to the hub table to find what that business key is
- A link table has no business key and requires an equi-join to the hub table(s) to find what the business key is
Now let’s design the components needed for multi-tenancy of Data Vault on Snowflake…
Step 1: Who are the tenants?
Like identifying the source systems for your Data Vault, you need to standardise the tenants of your enterprise data vault model via codes. This list will be what we will define as our policy entitlements table….
Remember, to qualify a multi-tenant is to use the same hub table structure, but to ensure that the tenant’s material content is easily discernible.
Step 2: Design RBAC
Snowflake Role-Based Access Controls (RBAC) will be used in combination of the entitlements table, this ensures a robust and centralised comprehension of whom the tenants are. Although access to the data objects could be shared, using the correct Snowflake context ensures that only the authorised role can see the data they are entitled to. For this design we will use Snowflake’s current_role() function. For the full list of context functions click through to here: bit.ly/3L4p2Zg
Functional role automation can be managed externally via your Single Sign-On (SSO) provider and groups managed via SCIM. These roles will map 1:1 to Snowflake and we can define that access by extending the entitlements table with the role names.
Keep these uppercase!
The roles above are linked to access roles that grant privileges to access the data vault assets themselves. Because the content is being used to provide analytics, the business unit analyst role over this content should only have SELECT access privilege on Information Marts only. The builder role will need SELECT access privilege over data vault and be able to CREATE TABLEs in our query assistance schema and CREATE VIEWs in the information mart schema. How you choose to manage and segregate your zones may differ of course, this example is designed to illustrate how to use RBAC and RAP for your data vault!
Step 3: Define RAP.
The entitlements table serves as the base for the data driven RAPs needed to control the access to your data. Combined with the authorised role, the RAP will only allow row-level selection of the content that RBAC is entitled to. For the common hub the authorised role an only see the business keys it is entitled to. The same is applicable to link and satellite tables but these are generally single source tables!
When querying / using a data object affected by RAP, Snowflake will evaluate access at query time and use a dynamic secure view over that affected table. This means every time you use the table the internal view itself is executed.
The above is a pretty simple design, essentially this can be expanded to more advanced RAP topics such as layering of access based on the view over the data vault. Still based on functional roles, this expands the row level access to sub-categories of roles and perhaps to aggregated views of the underlying content to obfuscate where necessary. Do keep in mind that the more layers you add the more context Snowflake needs to execute at query time, so keep these as flat as possible in your design!
Let’s see what happens when we use the same builder role to build a PIT over a hub that does not have treasury data. In other words, TRS BUILDER is not entitled to that content
With RAP applied to the hub_account table and its adjacent satellite tables, the PIT table that gets created retrieves no records from those tables. Essentially, we have applied a pseudo-filter without explicitly defining it in our PIT table construction. Neat!
TRS ANALYST only has SELECT privileges to the information mart view; running a SELECT statement will produce data that assigned role is authorised based on the view and not need explicit SELECT privileges on the underlying data vault tables (hubs, links, and satellites). Another not authorised to see the content will not return any records, like a finance role in the diagram below, FIN ANALYST. This role does not appear in our entitlements table.
Access beyond a Single Snowflake Account
Additional RAPs could be defined if you need to share centralized content beyond your Snowflake account. That is, to share your data to other Snowflake accounts you may consider utilizing Snowflake Collaboration in either of the following configurations:
- Privacy-preserving collaboration — account-to-account or account-to-multiple-accounts private, secure sharing of data, data services and applications (applications are currently in private preview) powered by Snowgrid, Snowflake’s cross-cloud technology layer that interconnects your business’ ecosystems across regions and clouds so you can operate at global scale. When sharing the data itself is not an option, specified accounts can be granted the ability to analyze the data without actually exposing it.
- Share with companies not yet on Snowflake — if the company you are looking to share data with is not yet on Snowflake, you can provision managed accounts that are either read-only or full read-write accounts.
- Snowflake Marketplace — built on Snowflake secure data sharing allowing consumers to discover, evaluate and purchase third-party data and data services, as well as providers to market their own products across the Data Cloud. You can access the most current data sets available and receive real time updates automatically directly in your Snowflake account to query without transformation and join it with your own data.
All three options can be used to automate and monetise your data between exchange partners (internal or external to your organisation). With the latter two Snowflake has functionality to enable you to advertise your data with a homepage, include usage scenarios and sample SQL scripts or templates for self-service of your data over Snowflake’s Global Mesh.
Keep in mind though, any table or view in Snowflake can support up to one row access policy at a time. That means if you need to share data that has an existing RAP, you will need to make that content available in secure view to be included in an outbound share or clone the content and remove the existing RAP to apply the account-level RAP.
Snowflake gives you the tools to simplify your design and secure your data. Here we did not need to deploy views for every business user or use case. Instead by defining a policy once and applying it to the underlying tables, Snowflake’s RBAC takes care of filtering out that data for you!
Until next time!
Reference:
- Row Access Policies — https://docs.snowflake.com/en/user-guide/security-row.html
- Data Sharing — https://docs.snowflake.com/en/user-guide/data-sharing-intro.html
- Product Offerings for Secure Data Sharing — https://docs.snowflake.com/en/user-guide/data-sharing-product-offerings.html
- Published here too: https://www.snowflake.com/blog/data-vault-row-access-policies-multi-tenancy/
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.