Data Vault on Snowflake: Feature Engineering & Business Vault

Patrick Cuba
16 min readMar 30, 2023

--

“The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering” — Luca Massaron

Snowflake continues to set the standard for Data in the Cloud by taking away the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. You will expect the same relational capabilities for your data model as any other platform and Snowflake certainly delivers.

A 2016 data science report found that data scientists spend around 80% of their time in data preparation (collecting, cleaning, and organising of data) before they can even begin to build machine learning (ML) models to deliver business value. Collecting, cleansing, and organising data into a coherent form for business users to consume are standard data modelling and data engineering tasks for loading a data warehouse. Just as a dimensional data model will transform data for human consumption, ML models need raw data transformed for ML model consumption through a process called “feature engineering”. It is a vital step in building successful ML models to ensure that the data is in a good shape for a ML model to be successful and for it to perform efficiently.

Data vault explicitly defines a set of standards called “hard rules” to automate data extraction, cleansing and modelling into raw vault tables. Data vault’s “soft rules” are those auditable transformations that use raw (and other business vault) data artefacts whose outcomes are loaded into business vault based on the same business entity and unit of work as raw vault. Data vault as a practice does not stipulate how you transform your data, only that you follow the standards to populate business vault link and satellite tables as you would have done to populate raw vault link and satellite tables.

In this episode we will use what we have learnt in this series to support the data preparation requirements for ML on Snowflake using data vault patterns for modelling and automation.

Episode catalogue

1. Immutable Store, Virtual End-Dates

2. Snowsight dashboards for Data Vault

3. Point-in-Time constructs & Join Trees

4. Querying really BIG satellite tables

5. Streams & Tasks on Views

6. Conditional Multi-Table INSERT, and where to use it

7. Row Access Policies + Multi-Tenancy

8. Hub locking on Snowflake

10. Out-of-sequence data

9. Virtual Warehouses & Charge Back

11. Handling Semi-Structured Data

12. Feature Engineering and Business Vault

A reminder of the data vault table types,

[BONUS] Episode 12: Feature Engineering and Business Vault

To visualise how data vault can support ML lets evaluate where data management fits into a workflow of a typical ML project.

ML Workflow, ubr.to/3EJHjvm

Once a business need is defined and a minimal viable product (MVP) is scoped, the data management perspective begins with:

A) Data ingestion — data is acquired, cleansed, and curated before it is transformed.

B) Feature engineering is executed to support ML model training.

C) Trained models are deployed into production and predictions are monitored for ML model performance.

As your data science team grows, more machine learning pipelines are needed to support the above cycle; the risk of deploying redundant, non-standard ML pipelines grows and the overall efficiency of the data science practice suffers as well.

For this reason, a new data management for machine learning framework has emerged to help manage this complexity, the “feature store”.

Feature Store

As described in Tecton’s blog, a feature store is a data management system for managing machine learning (ML) feature pipelines including the management of feature engineering code and data. The feature store enables teams to share, discover and use highly curated sets of features to support ML experiments and deployment to production.

Based on Tecton blog

Similar to data engineering pipelines into a Data Lake/Warehouse? Yes, feature stores are part of the MLOps discipline; here’s how Data Vault on Snowflake supports the data model components of a feature store.

A) Ingest RAW data into a modelled Raw Vault

Snowflake supports batch, micro-batch, and streaming ingestion

Snowflake supports the ingestion of structured and semi-structured (JSON / XML) data through the following options,

  • File-based batch COPY into Snowflake — this can be automated using a partner ELT/ETL tool, the tool will execute a Snowflake COPY command to convert a supported file format into a traditional Snowflake table
  • Snowpipe micro-batch into Snowflake — either triggered through a cloud service provider’s messaging service (like AWS SQS, Azure Event notification or Google Pub/Sub) or making calls to Snowpipe’s REST API endpoints. Use Snowflake’s native Kafka Connector to configure Kafka topics into Snowflake tables.
  • External table registration and consumption as

o Parquet — a popular column-oriented data file format for analytics

o AVRO — a serialisation format for data under streaming pipelines

o ORC — an optimised row-columnar file format designed for Hive

o Iceberg — an open table format specifically designed for analytical queries, underneath the table is stored as parquet but to the user, the format’s manifest allows for advanced analytical functions and simplified querying. It even has a time-travel feature like Snowflake tables do and it will be supported within Snowflake’s internal named stage soon.

  • Snowflake can also ingest external tables from on-premise data sources via S3 compliant data storage APIs.

Batch / file-based data is modelled into the raw vault table structures as we have depicted above as hub, link, and satellite tables. They can also be modelled into the following table structures based on the grain of the data being ingested and modelling requirements, such as:

When the state of a business entity or relationship is multi-row

Snowflake now introduced,

  • Streaming API for Snowpipe — real time rowset inserts of records into Snowflake tables without staging the data as files first.

Should the need arise, raw vault does also support real-time ingestion into non-historised link and satellite table structures; these differ to the previously articulated data vault table types because the content we receive here is always a change and therefore there isn’t a need to check that the incoming data is a true change, it always is.

Supporting streaming ingestion

Now that we know how we can get data into Snowflake, let’s turn our attention to feature engineering options within Snowflake…

B) Transformations — Feature Engineering into Business Vault

Transformations can be supported in SQL, Python, Java, Scala… bring your poison!

The purpose behind a business vault is to cheaply augment existing business processes with additional intelligence to:

a) complete the business process as the business sees it as opposed to how the source-system business model (raw data) sees it.

b) derive auditable facts and metrics while reusing the same automation, auditability and agility offered by raw vault table and loading patterns

The transformations we apply under feature engineering prepares the data for ML model training. The reusable and sharable (by more than one ML model) feature values stored in business vault will:

  • aggregate raw data into meaningful/useful features (shape of data), these can over varying time-windowed aggregations such as counts, min and max values, average, median, mean, standard deviation etc…

The non-shareable and non-reusable feature values will not be stored in business vault because they have point-in-time state such as:

  • either numerical or categorical imputation (handling missing values),
  • managing data outliers,
  • one-hot encoding,
  • binning… oh my!

Features and feature engineering (like business vault satellites) can depend on

  • a single raw data source — single raw vault satellite table = single source-system data source based on either a business object or unit of work (hub or link tables respectively),
  • multiple raw data sources (joining raw vault satellite tables), or
  • a combination of the above.
Single-source, multi-source or using other labels

By using raw and business vault satellite table structures as separate entities we guarantee:

  • audit history because we have a history of raw and engineered feature data
  • agility and scalability, being separate satellite tables a change to one table does not affect the other unless it is explicitly desired to do so
  • automation, because the same loader patterns are used for both, the same metadata tags are expected from both meaning the applied date timestamp in business vault will match up with the raw date timestamp where it came from.

Let’s explore that last point a little more closely…

Data Versioning

Raw vault stores the business process output in its applied state as raw hub, link, and satellite tables. The timestamp of that business object’s state is what in data vault is referred to as the applied date timestamp and as data is landed to be ingested into raw vault a load date timestamp is also recorded per record to denote when that record enters the data vault. This means data vault is bitemporal, every record in every data vault table carries the applied timestamp (business process outcome state timestamp) and the data platform load timestamp. This also means if a correction must be applied from the source system you can load the new corrected state into data vault and the load timestamp will be of a newer timestamp, this will be referred to as a newer version of that same record. There is no need to unload or delete content in data vault, simply load the new version and the version number is synonymous with the load date timestamp.

The same concept is applicable to business vault, you can version the feature code in git to reload and supersede the data loaded into business vault. New deltas are detected in business vault in one of or a combination of the following situations:

  • the data the feature code is based on has changed (raw or other feature data), or
  • the feature code (soft rule) has changed

A requirement of (feature) soft rule code is that it must be idempotent, if neither of the above conditions occurred and we re-run the transformation and load the same data then the outcome must be the same every time, i.e., we do not load any new records by applied timestamp because we have already loaded them before!

Further to the above in terms of orchestration, the business vault satellite must carry the same applied date timestamp as the raw vault data it is based on and a change in the business rule is recorded in the satellite table’s record-source column as a business rule name and version.

Of course, this is optional, you can (if you so desire) rely on Snowflake’s time-travel and your chosen version control software to rollback and replay feature logic and reload the soft rule outcomes, but these will be limited to Time-Travel’s configurable 90-day window. By using the data vault’s bi-temporal approach the data loads and data selection are seamless and no refactoring is required. This approach guarantees full audit of what happened, even that a reload occurred and in fact the feature version can be recorded in the satellite table’s record source column value.

Data and versioning supported in a single satellite table

Feature Set structure

Business vault artefacts are nothing more than the same table structures as raw vault except the table is storing the outcomes of derived (soft rule) based on raw and other business vault content, which in our case are features and feature sets (aka feature grouping). Business vault therefore can be made up of:

  • satellite tables — single row of applicable features at a point in time
  • satellite tables with one or more dependent child keys — this is when a sub-category is needed per business entity
  • multi-active satellite tables — when a SET of records are applicable at a point in time
  • link tables — when constructing a relationship between participating entities as the unit of work as the business sees it, or what is needed (in our case) to support machine learning experiments. This differs to how the source-system has defined the unit-of-work and may in fact be deterministic relationships between business entities based on soft business rules.
  • an effectivity satellite table — by extension if the business vault link table needs a way to track which is the current relationship record by a driver key, then a business vault effectivity satellite may be required. At the same time the effectivity satellite can be expected to support the tracking of multiple relationships being active per driver key ( or keys). Learn more about this data vault artefact here.

Depending on your need a feature set (satellite) can be supported by

  • a wide table structure — typically found under stable feature sets or feature sets dedicated to specific ML models. Each column is strongly typed as VARCHAR or NUMERIC (or other structured) column data types.
  • a tall table structure — using Snowflake’s VARIANT data type, features are stored as key-value pairs which makes the management of features flexible and impervious to schema drift/evolution.
  • a combination of wide and tall

Schema drift on a wide table structure needs an ALTER TABLE statement whereas the tall table structure does not. As described in the semi-structure satellite table article, a level of flattening of the semi-structured data is required to keep the performance of using the table optimal if you do need to retrieve the feature-vector (latest-value) efficiently.

Tall vs Wide feature styles

This leads to another important concept in feature engineering that data vault’s standard loading patterns support…

Backfilling

With raw vault supporting the business vault it means that we can also use raw vault to backfill business vault features where needed. Feature backfilling is a common requirement for building up history to support quality ML model training from a point in time. When such an operation is desired, data vault only asks that backfilled data carry an indication of where those backfilled records came from by recording it as such in the record source column of that record. Also keep in mind that the applied date timestamp should reflect the raw and/or business vault applied date timestamps (where the feature came from) but will have the same load date timestamp to show that the backfill occurred as a one-time load.

Note that depending how the feature is structured an element of schema drift may be required

Now that we have the width and depth of feature data available, ML models will be selective of what it needs from the feature table(s) as training, validation, and test datasets…

Feature Training using Information Marts

In a data vault, information marts are deployed as views, selecting the features you need into a feature view can be supported using the same techniques we are already familiar with in data vault to optimize the speed of information-mart delivery. The same logarithmic and managed window concepts utilised in PIT and Bridge tables we have discussed in earlier articles can be used to support this.

As a refresher these table and view structures are:

The disposable artefacts of data vault

Using a snapshot-based table structure like PITs and Bridges splits the feature views historically into a dataset used for ML model training and a dataset to test that ML model where predictions have not yet been made. Meanwhile, data can be continually loaded into the underlying data vault constructs of hub, link, and satellite tables, but because the PITs and Bridges used for feature selection haven’t had their keys and load dates updated then the feature views will still be based on the same snapshotted data.

If we were to rely on Snowflake clones of time-travel periods of the business vault satellite, we would end up with all data up to that time-travel timestamp; whereas a PIT is instead constructed by a PIT window (start and end applied timestamp snapshot).

Feature engineering within Snowflake

Raw and business vault artefacts are the historized business process outcomes at that point in time. Raw vault does not dictate how those business process outcomes were calculated at the source system and neither does business vault dictate how the soft rules were calculated based on raw data.

What this means is that there is a separation of concern between feature transformation (code) and the table structure (data) storing those generated features, since we have the latter in place it does not matter how the former is executed.

Enter Snowpark

Snowpark for data engineering and machine learning

Snowpark is a feature rich and secure framework for running transformations (feature engineering) and machine learning algorithms that are otherwise not possible when using SQL alone in Snowflake. By adding the ability to run your Java, Scala, and Python within the platform you no longer need to rely on external programming interfaces to run your transformations/algorithms. The only requirement in terms of data vault is that the output of your transformations be stored in a business vault link or satellite table as we have described above.

With transformations and model training performed within Snowpark you not only reduce the latency needed to produce features and inferences, but you simplify your overall architecture as well. The friction of data movement is reduced and in addition by using our “Set and Forget” framework described in this article the entire orchestration can be managed by Snowflake as well.

Snowflake’s support for unstructured data also means you can annotate and process images, emails, PDFs and more into semi-structured or structured data usable by your ML model running within Snowflake.

C) Model Prediction Monitoring

Monitoring for Model Performance, ground truth vs predictions

ML models require constant monitoring as their performance can degrade due to several factors that causes the model outcome to skew.

These phenomena are known as:

  • Concept drift: the statistical properties of the target variable (prediction) changes in unforeseen ways
  • Data drift: the statistical properties of the predictor variable (model input) changes in unforeseen ways

Read more to understand drift here.

A business vault satellite table will be used to store ML model predictions to support this analysis.

To complete our story, we will touch on a very important suite of features natively supported on Snowflake.

Data governance

As a data management framework, feature stores must consider data privacy and data governance, Snowflake natively supports:

  • Dynamic data masking — based on a policy role, you design who gets to see non-redacted or redacted columns data, this also includes the ability to base masking on an adjacent column within the same table/view. These methods can be applied to structured and semi-structured data too.
  • Tokenisation — using the same policy-based application of data masking but with the options of obfuscating the content using partner software external to Snowflake.
  • Row access policies — a popular method of allowing access to specific data rows based on functional roles.
  • Auto-classification — using Snowflake’s trained semantic model, applying this system procedure classifies identifying and quasi-identifying columns with tags denoting probability and confidence you can utilise to apply masking or tokenisation to.
  • Anonymisation – applying k-anonymity or data hierarchies to anonymise sensitive data while still making it usable for ML model training.

Once the above are defined they are reusable and from a data transformation and selection perspective they are transparently applied. A Snowflake role without the required access simply will not see sensitive data because nothing within Snowflake can be executed or accessed without a role, a way to associate work to a virtual warehouse and access to the database and schema objects itself.

Know your data, Snowflake’s robust features you can customise to your needs

In addition, Snowflake also provides impact analysis and data provenance through

  • Object dependencies — used primarily for impact analysis, this recursive view is useful for understanding what downstream or upstream objects in a lineage will be affected by any changes you introduce to your data objects.
  • Access History — this recursive view tracks every data object accessed by any role and it also tracks what data objects were created by that access down to the column level and what policy was in place when the content was accessed.
  • Object Tagging — a dynamic method for grouping Snowflake objects into cost-centres and for applying masking policies too.

Complimentary to the above, Snowflake provides a robust set of observability views to track all the information you need. These are accessible with account usage and organisation usage schemas and within the information schema in every data base you create.

Conclusions

“There is no AI without IA” — Seth Earley

Feature stores are a framework for managing and operationalising feature data requirements for machine learning models and they have overlapping capabilities of a data warehouse. This means you can leverage the same agility, automation and audit guarantees used in the data vault approach on Snowflake along with your data engineering best practices.

Snowflake’s capabilities continue to lead the industry and strives for as little data movement as possible. Your data science workload can even be augmented or sourced from data providers in the Snowflake marketplace or direct data shares from your data partners. Not to mention the ever-expanding business cases being supported by managed, connected or native application frameworks using Snowflake. Combined with Snowflake’s industry leading standards for data governance and platform security you can rest assured that your data science model deployment scales scales at Data Cloud scale.

Data Vault for Data Science

Note that this blog does not cover the full suite of capabilities of a feature store or the requirements needed for a full MLOps implementation, but includes the data management components the data vault methodology you can reuse for your feature store on Snowflake.

References

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

Responses (1)