Snowflake Data Clean Room (Native App)

Patrick Cuba
15 min readOct 16, 2024

--

Circle back to December 2023 and Snowflake acquires Samooha, a company whose sole product are data clean rooms built on Snowflake. Both companies see an enormous opportunity, Samooha sees Snowflake’s ground-breaking data sharing technology as ideal for deploying zero-movement data clean rooms, Snowflake sees a company who took the Snowflake Professional Services led framework to the next level and as a Native App, the opportunity to automate a data cloud native privacy solution!

A data clean room is a secure and controlled environment that allows multiple parties (organizations or divisions within an organization) to cross analyse and enrich their own 1st party data with 3rd party data without data movement and without disclosing any identifying characteristics of that data. A data clean room is not a physical room nor is it what we colloquially call a secure vault for dumping your data into. Again, there is no data movement within a provider of a clean nor is there data movement between participants of the data clean room.

In the data clean room, personally identifiable information (PII) is anonymised, processed and stored in a compliant way. Data clean rooms don’t allow data points that could be tied back to a specific user to leave the environment, giving the organization the ability to adhere to data privacy laws in their jurisdiction. Such as, General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) to name just a few.

Snowflake data clean rooms combine a host of what is known as Privacy Enhancing Technologies (PETs) to protect and manage private and personal data.

  • Encryption, double blind joins, confidential computing techniques to protect data during storage and use.
  • Differential privacy and noise injection designed to prevent individual identification in data queries.
  • K-anonymity to ensure data anonymisation and prevent re-identification.
  • Principle of least privilege (PoLP), minimal data is selected into a data clean room and granted and revoked using Snowflake’s role-based access control (RBAC)
  • Query and access limits to restrict the number, type and duration of queries.
  • Dataset reuse restrictions to prohibit reuse with other data clean room participants

Snowflake data clean rooms are built on:

  • Snowflake horizon catalog which contains the necessary components for securing your data in the cloud, meeting compliance and privacy
  • Snowflake’s native app framework that includes a Streamlit frontend deliberately making the clean room business user friendly to administer and run a data clean room through ClickOps but we also support developer-friendly APIs.
  • Data Sharing, a native feature allowing for you to share data securely between Snowflake accounts.

Let’s kick-off by describing the esoteric terms I listed above before delving into the use cases and techniques. If you want to skip ahead to the implementation, scroll down to “The Data Clean Rooms”.

Secure by design

By design Snowflake is built on the highest levels of end-to-end encryption, everything within the Snowflake perimeter is designed with security top of mind. There’s never been a breach of Snowflake and Snowflake provides the flexibility for you to apply even more security based on your security posture. To name but a few, Snowflake natively applies:

  • Hierarchical encryption from the file level, table level (enables data sharing), account and root (locked away in the cloud service provider’s hardware security module — HSM)
  • Monthly key rotation by versioning Snowflake managed encryption keys ensuring all new files is encrypted with the new key

In addition, Snowflake supports:

Snowflake Horizon Catalog

Snowflake Horizon is simply an encapsulating term for all of Snowflake security and privacy features

To protect your data in your own account, Snowflake provides a robust set of features that we touched on earlier. You can explore them here.

Trust Centre

Special attention is devoted to the Snowflake Trust Centre, not only does Snowflake periodically conduct and participate in independent security and privacy reviews for their software and architecture, you can execute a suit of built-in utilities to ensure how you have configured your Snowflake account meets Snowflake’s own security best practices and recommendations as well as the CIS benchmark. Two scanner packages are currently provided and available for you to schedule:

  • Security essentials, free
  • CIS Benchmarks

Snowflake Data Clean Room

The Classic Snowflake Data Clean Room, Privacy Preserving Collaboration

How does a Snowflake Data Clean Room differ from a Snowflake Data Share? Where does the Snowflake Native App framework fit in?

Data Sharing

A game changing innovation from Snowflake, being able to share data between Snowflake accounts within the same organization or between you and your customers and partners in other organizations and without any data movement. This is as close to real-time data as you can get, you can optionally provide secure sharing access to your data either as:

  • A direct share between you and a target Snowflake account, or
  • By providing your data to an external or internal Snowflake Marketplace which can also be managed using Snowflake APIs

We will use data sharing as a basis for data being shared by a provider and the consumer being able to query it.

Basics of Data Sharing between Snowflake Accounts, ex. a Direct Share

You select the data you want to share, either as secure views or actual datasets. Create a share-based granting access to that data and then you designate the target Snowflake account to share it to. If you wanted to share the same data to multiple target accounts then you could also apply a row access policy on that data and in that policy define which accounts get to see what rows by using the context function current_account(). Snowflake has also introduced database roles which you can make available to the consumer account and they can administer those shared database roles to their own account-based roles of their choosing.

A data clean room takes this further by introducing a framework for authorised query patterns for your shared data, we will discuss shortly.

Native App Framework

The Native App framework is a combination of business logic (you acquire from the Snowflake Marketplace or you develop, share, maintain and optionally share to the marketplace), Streamlit in Snowflake (SiS), application event logging and marketplace licensing. Like data sharing, there is a provider and a consumer (and private and marketplace listings), unlike data sharing you must provide in your application:

  • An application package (data + business logic + metadata).
  • Manifest file (configuration + setup properties).
  • Setup script (SQL statements to install or upgrade an application).
  • Application role used for setup and an application role your customer can use to grant access to their own account-level roles
Basic components of a Native App

The Snowflake data clean room is a Native App; therefore, it must be installed into your Snowflake account along with the setup databases, scripts and access needed for the app to function within your account. The setup is automatically deployed when you accept the terms for using the app and allowed to be installed in your Snowflake account.

Now that we’re acquainted with the Snowflake security perimeter (horizon), the native app framework and secure data sharing, let’s discuss the Snowflake features leveraged by the data clean room.

The Data Clean Room

Data clean room provides a virtual firewall by means of securing what, at what depth and how often consumer queries can be run on data you have selected for your data clean room. We extend the above architecture diagrams with the components of the Snowflake data clean room (high-level).

Snowflake Multi-Party Data Clean Room as a Native App

At a high-level, the steps to setup a Snowflake data clean room are:

  1. As a Snowflake customer, sign up and install the data clean room native app.
  2. As a provider, link what of your own data (internal or external) you are intending to share and define the necessary privacy protections you decide on before data is added to your data clean room. These are discussed later but include projection policies, aggregation policies and differential privacy functions with a privacy budget.
  3. Select which templates would be allowed to run on your data
  4. Add your collaborator (consumer) through the Native App, either an existing Snowflake customer or through a Snowflake account you spin up on their behalf.

As you have been sold on the concept of a data clean room you have a few pertinent architecture decisions to resolve:

  • How will I join my 1st party data with my partner’s 3rd party data?
  • How will I ensure the privacy of my 1st party data as a provider?
  • As a provider, how do I ensure specific consumers have access to specific data?
  • If my consumer is not a Snowflake customer can I still deploy a data clean room?
  • Does my 1st party data need to be loaded as Snowflake FDNs?

Let’s dig in!

Secure double-blind joins & identity providers

What is the risk of using an identifier to join 1st and 3rd party data? Well, depending on the context (as always), the identifier itself isn’t secure and reveals a lot more customer information then you should. You should never use plain-text government-issued identifiers for your provider-consumer joins! What are our options?

  • If you do not consider email addresses to be personally identifiable then this could be a candidate join key.
  • Tokenise an identifier and use the encrypt function with an agreed upon passphrase to join on. With Snowflake you can also make use of an external function to leverage tokenization supported by 3rd party tools. Also consider, that you might load data into Snowflake already tokenized, you decide!
  • Leverage a supported identity provider in the data clean room. We do work with several partners who provide their own durable surrogate join IDs through their respective APIs, they are:

o LiveRamp’s RampID and household RampID.

o TransUnion’s TruAudience.

o Acxiom’s Real ID.

By the way we can avoid re-identifying users by preventing consumers from rerunning the query authorised queries multiple times to re-identify explicit users, more on this later under privacy budgets! (Refer to Yao’s millionaire problem).

Policy Constrained Access

In addition to the column and row protections Snowflake already provide (row access policies, dynamic and conditional masking as well as tag-based masking and custom and Snowflake classification); the following are privacy specific policies you must consider for protecting your data in a data clean room.

Projection Policies — Column Based

Controls whether a query can project a column, that is, include the column in a SELECT statement. This affects what columns a consumer can see in the data shared by the provider, although the column can still be used in a clean room analysis.

Reference: https://docs.snowflake.com/en/user-guide/projection-policies

Aggregation Policies — Table/View Based

An aggregation policy protects the privacy of entities (in a table or view) by requiring that each aggregation group contain a minimum number of entities and entity records. Because you are enforcing that the data shared is only an aggregate level, the theory applied here is that the aggregate data does not expose finer details the aggregation is based on. If the aggregation meets the threshold for minimum group size, an aggregate value (for example: count, sum, average) is returned, those entity aggregates that do not meet that threshold are lumped together and the entity name is set to NULL.

Aggregation policies can be applied in one of two ways,

  • Non-entity based — as described above (row-level privacy)
  • Entity-based — in addition to the aggregate by row count protection we discussed above, entity-based aggregation protection gives you the option identify which columns in the table or view constitute an entity-key, i.e. uniquely identifies an entity. (entity-level privacy)

Reference: https://docs.snowflake.com/en/user-guide/aggregation-policies

Differential Privacy — Table/View and Column Based

Finally, Snowflake provides differential privacy techniques to further protect your data with the following techniques under the differential privacy umbrella:

  • Injecting noise into query results — this implies that the results of an analysis are indicative and not the actual results. We do this by specifying privacy domains for categorical and numeric data.
  • Allocating a privacy budget to prevent privacy loss — this implies setting a budget for the number of times a consumer can run their queries of your differential privacy protected data. This privacy budget is not an actual monetary value, rather it is a custom count you define for your consumer’s queries and therefore limit privacy loss by restricting how many times they can query your data based on their allocated privacy budget.

Reference: https://docs.snowflake.com/en/user-guide/diff-privacy/differential-privacy-overview

Approved Templates

On top of the privacy policies you use to protect the grain and limit the use of your data, you can also allow what type of queries a consumer can execute on your data. That’s right, you not only have control of how your data can be queried but also control of what is executed on your data. This is the last protection provided in your SQL virtual firewall.

To that end, let’s present the out-of-the-box (OOTB) available query templates you can right now support in a Snowflake data clean room. We follow that up with how to build your own templates too.

To that end, let’s present the out-of-the-box (OOTB) available query templates you can right now support in a Snowflake data clean room. We follow that up with how to build your own templates too.

  • Customer overlap between advertiser and publisher.
  • Enriching your own data from a data clean room.
  • Utilizing partner data in the data clean room for your machine learning models
  • Building your own custom templates using Jinja SQL

Find these examples here: https://docs.snowflake.com/en/user-guide/cleanrooms/demo-flows/python-based-templates

Use Cases

The original data clean room (as far as I am aware) was overlap analysis between advertisers and publishers in media and advertising. Given the regulatory requirements around data privacy, other business use cases have emerged where the benefits of zero data transfer is desired, and data privacy is non-negotiable.

Speaking of zero data movement is desired, Snowflake data clean rooms also support bringing in external table formats such as an Apache Iceberg table.

Where to put it

Where your Snowflake data clean room lives

You manage your data within Snowflake using databases and schema; when participating in data clean room you can make existing data available to the clean room through the use of Snowflake role-based access control (RBAC). However, if you still feel vulnerable despite the rigorous data privacy and governance controls Snowflake provides, you could consider the following options:

  • Deploy an isolated Snowflake account — you as an ORGADMIN can spin up a new Snowflake account to provide your data to via direct data sharing or through the use of a Snowflake managed (reader) account. To track the actual queries run in the data clean room in this setup, either interrogate the query_history in the deployed Snowflake account’s account_usage views or within your own Snowflake account’s read_account_usage views.
  • External tables — you can include data not loaded as Snowflake FDNs but exist in your infrastructure and available to the data clean room. A good example here is being able to leverage Apache Iceberg tables in the data clean room.

Do consider however, the greatest advantage of using Snowflake data clean rooms is the zero-movement of data (data sharing).

The process to remove your data from a clean room is very easy, revoke access to that data with either

  • a simple revoke select access to a table or view;
  • revoke usage of the schema and/or database where the data resides;
  • drop the secured view of the underlying data;
  • drop the data share to a target account;
  • if the data your provided was ephemeral, then drop/truncate that data.

What if a single dataset you have linked to your data clean room should serve multiple consumers of your data? A solution you can explore is the use of row access policies (RAP) to designate specifc records as available to target accounts with the use of the current_account() context function.

Consumer is not a Snowflake customer

What happens when your collaborator is not yet a Snowflake customer? Well, Snowflake provides the reader account architecture you can use or you can spin up a Snowflake account for that collaborator. Let’s breakdown why you would choose one or the other and what are the implications.

Collaborators are not in the same region or CSP

This goes to the fundamentals of cloud architecture; Snowflake is available on all three major CSPs and you will have a near equivalent experience using Snowflake in all three. Data sharing by itself works because provider and consumer are in the same CSP and in the same region. If a consumer is in a different region to a provider whether that is plain data sharing or data clean room data must be replicated to the consumer’s region and cloud (the data must exist in the CSP’s data centre) and then shared from there. This means the data is no longer available real but instead available dependent on the time it takes to replicate the providers into the region(s) where the consumer has their account.

Snowflake makes the process seamless to their customers but what is happening underneath is something called “auto-fulfilment”. These are important concepts because it affects both latency and cost. A CSP will charge Snowflake an egress cost and that cost is passed onto the Snowflake customer providing data in the data clean room. This is the same technology used to make your public marketplace data available globally across the Snowgrid.

Take home

Snowflake is a service utilising a Cloud Service Provider’s (CSP) infrastructure to deliver a secure AI & Data Cloud as a service (you do not need the expertise and time to set this up yourself). Just like those CSPs, we too have a shared responsibility model but our components are data focussed, secure by design and executable by SQL.

One of Snowflake’s design principles is to provide maximum scalability with their features while reducing the knobs needed to configure for your data architecture. From the start, Snowflake separates its compute engine from how your data is stored and manages both through a separate metadata layer. This means we reduce the need to move data, and you can scale your compute nodes (virtual warehouses) running on the same data to your heart’s desire.

With Snowflake horizon we have the necessary components for you to design your data architecture securely and to ensure data privacy at the same time. Those same data governance semantics (along with Snowflake data sharing and the native apps framework) means that you can securely and privately collaborate with 3rd party providers without exposing sensitive or confidential data. This Snowflake advantage translates to:

  • Reduced time to value
  • Repeatable and secure by design
  • Ease of use and safety in the knowledge that your data cannot be re-identified
  • Mutual beneficial partnerships with different custodians of common data, even if you would never know what was common, only how much was common.
  • Participating in an ecosystem of valuable data curated by secure 3rd parties but never revealing sensitive or confidential data.
  • Ensuring that as a provider, you have taken all the mathematically secure precautions to secure your own customer data.
  • As a consumer, you can easily augment your own data with 3rd party data you do not need to move, manage or secure to enhance your own business case.
  • With other cloud services solutions not based on Snowflake, your engineers need to master much more than what this article has described to secure your data. Snowflake provides SQL semantics database administrators should be familiar with to meet the business use cases. Securing your data is an automatic requirement for modern data platforms on the cloud, Snowflake and its many industry leading features just happens to be the outright leader in this quadrant.

What does this have to do with data vault you ask? Well…

References

??

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

No responses yet