You might be doing #kimballmodelling wrong!

12 min readApr 21, 2025

Why is this important?

Architecture. Surrogate hash keys are durable surrogate keys meaning that the value generated will always have a one-to-one cardinality with the natural key used to generate it.

Why not just use the natural key instead?

A great question, surrogate keys are used to simplify join conditions between tables and in the case of a data vault, related hubs, links and satellites can be loaded in parallel too. However, the surrogate join key between facts and dimensions are meant to be temporal, a dimension table record id that by joining to a fact will always select the correct dimension at a point in time (no ‘between’ clause needed).

Can we instead hash the natural key and the date together to make generated surrogate key temporal?

Think about what you would be suggesting. Attempting to parallel load a fact and dimension has not yet determined that the dimension is a true change.

For loading facts and related dimensions, you need to look up (retrieve) the temporal hash key from a type 2 dimension to pick the nearest key by fact date. Hashing itself is not free, and even as a binary/raw data type it will use more bytes than an integer data type. Why not lookup (retrieve) an autoincrement / identity column you define when the dimension table was created? It’s cheaper and faster to join on.

Why is this important?

Architecture. Kimball prescribed many techniques for many scenarios in fact and dimension table design, dealing with non-additive facts (like exchange rates) is one of them. Here’s a matrix of fact table types and fact types in a Kimball model.

Of course we could also consider a matrix of dimension table types.

Reference: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/

Why is this important?

Model. The beauty of temporal surrogate keys is that all you need is to attack the middle fact table in a star schema is with a date+time dimension and with that foreign key relationship your query will always return the correct point in time records in all the other dimensions included in your query.

Where does the date and time fit in?

The date and time dimensions are crucial role-playing dimensions ensuring you can aggregate by date and/or time and slice and dice across time. Add business date flags in a date or time dimension and you can aggregate by that in a where clause. If the date type used in all dimensions carry the same meaning you can also pick facts based on different business dates and times (for example, a transaction may have different entered and posted timestamps).

Why is this important?

Model & Cost. Everything in the right place, metrics tend to change rapidly and are typically aggregated to show some collective statistic over time. If your non-aggregate dimension changes rapidly then consider using a non-additive fact table instead.

As for cost, the act of updates to a record (end dating an SCD Type 2 end-date column) must update a record in place. For a file-based storage this involves a copy-on-write disk operation, an expensive operation churning many files which can also mean the performance of your table’s indexes deteriorates rapidly.

Why is this important?

Architecture. Business keys (even numeric ones) are not metrics. There is no value in storing them as a numeric data type. Should a dimension need to track multiple formats of a business key (alpha and numeric) then not using a string data type will force you to refactor that dimension.

Worse, some business keys have discretely placed leading zeros and those are lost if the business key is cast as a numeric data type.

Why is this important?

Model. You need to ensure that the same grain and representative date is used between these tables. Secondly, you should use twine (union all) or ASOF joins to set (instead of join) the nearest facts together.

Why is this important?

Model. A key benefit of modelling facts and dimensions is the execution of a right-deep join tree in your query plan. Fact tables are expected to be more voluminous than the surrounding dimension tables so what the OLAP engine does is load the join keys into memory. The dimension table keys are a part of the build-phase and built hash tables are used to probe the middle-hashed fact table keys for matches in a probe-phase (called build and probe).

You will see the right deep join tree in your query plan, and it is the optimal method to join facts and dimensions.

Why is this important?

Model. There’s nothing worse than querying a data model with mixed standards for column names, data types, join keys or inconsistent domain values.

Let’s explain

  • Naming standards

To consistently call a thing a thing, first we need to understand what the business terms are. Do we deal with customers or clients? Where are their definitions? You must choose one that the users use and forever your data assets will refer to ‘customer’ instead of ‘client’ when working with customers. Do not let source system naming conventions dictate your business facing data model.

Next, how to consistently identify what a table is, i.e. dimension tables are prefixed with the keyword ‘dim_’ and fact tables with ‘fact_’; reference tables with ‘ref_’ etc.

With columns it’s a little more nuanced as we relate to what the column is describing and how to use it. For dimensional columns you may consider the abbreviated keyword ‘curr_’ for current and ‘prev_’ for previous (don’t use the keyword ‘last_’ for previous values, it literally means the last value and not the previous one!).

Because a dimension table could contain multiple dimension types we reserve the naming of such dimensions in the column name themselves. For instance, use the modifier word ‘_prev_’ for any type 3 dimensions and perhaps ‘_last_’ for any type 1 dimensions. All other attributes are type 2 dimensions which the modifier ‘_curr_’ may be used.

As for the non-Kimball portion of the column name it should be singular (table names are singular too) and be designed in the form of:

subject_modifier_classword

Subject: customer, product, order, invoice, account, vendor, contract, shipment, pmt (payment), policy, asset

Modifier: effective, avg (average), prev (previous), curr (current), temp (temporary), max, min, total, annual, daily, monthly

Class word: date, description, type, status, category, name, count, id, code, ind (indicator), sk (surrogate key)

Shortening keystrokes makes the selecting of a column in a query more intuitive but also exercise caution in making them too short that no one gets it right away and must constantly be reminded of what the column is! Avoid using acronyms unless it is a business term everyone in the business knows and if inclined to use acronyms take care not to use one that might have a double meaning!

  • Data type standards

These are vital across the data type spectrum, do you use varchar or char, numeric(38.0) or numeric(36.2)? Pick one and don’t mix. Datetime formats are especially crucial because it depends on where these columns were recorded that the type of format used may be incompatible with other systems. These must be understood before they are standardised into your data model. Time zones and locales represent dates and times differently however when querying over a dimensional model they should be represented consistently and never stored as a string data type.

Here is how Snowflake treats time stamps and time zones differently,

Local Time Zone (LTZ) — Local timezone.

No Time Zone (NTZ) — stored as ‘walltime’ (time on the wall).

UTC Time Zone (TZ) — Records the UTC offset.

Reference: https://docs.snowflake.com/en/sql-reference/data-types-datetime#label-datatypes-timestamp-variations

  • Join keys, or metadata column naming and data type standards

The columns that have no business value but are used by users to integrate, ingest and query should have a set naming and data type standard. Here we are talking about:

o Primary keys defined in a table should be the same as the table name and be suffixed with a keyword such as ‘_sk’, ‘_pk’ or ‘_sid’. Pick one. Designing primary key index names should include the letters ‘pk’ and the table name, like ‘pk_dim_account’.

o Foreign keys key names should match the primary key it is referencing where possible, the foreign key index name should be constructed like ‘fk_fact_payments_dim_account’ capturing the fact table and relation to dimension table.

o Start and end dates should be consistently labelled as ‘start_date’ and ‘end_date’ and have the same date data type supported (if you are dealing with time too then adjust for your scenario). You also need to universally decide on a high-date value for your end-dates for current records, for example ‘31-Dec-9999 23:59:59’ should suffice.

o Current or active flag should be labelled accordingly, like ‘current_record’ or ‘active_record’. Just as applying standardised indicators or flags these should be set consistently as well (1/0, ‘Y’/‘N’, True/False) to denote if the record is current.

o Deleted flag should be labelled according , like ‘deleted_record’ with a consistent value

o Default keys are needed to ensure an inner join is all you need between facts and dimensions. The standard requires pre-loading dimension tables with values below zero like ‘-1’ for late arriving dimension, ‘-2’ for invalid record, etc.

  • Domain value standards

Finally, transformation may be applied to raw data either by joining data together or through clauses and conditions, or by all the above. What these transformations are doing is creating indicators of nominal and ratio values.

Nominal values represent a finite state and therefore can be used to categorise dimensions such as a standardised gender state (like ‘M’ for male and ‘F’ for female) and hierarchies and taxonomies you can drill through to slice and dice your data. When the data is cubed (Excel Pivot table, ROLAP, MOLAP and HOLAP) then the related measures should aggregate depending on the selected dimension in your drill path.

Why is this important?

Model. Being able to match on something like an extract date is vital for tracking business changes to a record, even if it is not a business date. Let’s clarify.

  • A business date is the result of some business rule, it can be a future date (like when a policy will become effective);
  • it can be a date based on a past event; the problem is that a business date is susceptible to data quality failures;
  • and often is not consistently applied across business processes.

For these three reasons an extract date is more suitable because it is the consistent date applicable as the state of the information at that point in time.

Why not use a load date?

A load date is based on a context function, either a function like current_date() or a load-date parameter for a batched load or extract. The time difference between extract and load may be susceptible to run failures causing loads to happen a day later for perhaps some of the batch and not all. Furthermore, the source system may have had multiple updates between those dates or date times. The safest is the extract date, or what we call it in data vault, the applied date. This ensures you can refactor when you need to and ensure the data reflects as closely to operational reality as possible.

Why is this important?

Practice. Considering that dimensions should essentially be about common entities your organisation is analysing, the bus and event matrix records the activities or events between dimensions and is a key artefact for designing star schemas. An event matrix (by Lawrence Corr) takes the analysis a step further by denoting which event creates dimensions and which event uses dimensions.

Pair these techniques with Steve Hoberman’s data model scorecard and you will have a solid grip of your Kimball based design

Why is this important?

Architecture. As awesome as dbt is it’s not Kimball modelling. Over and above the reasons we discussed above the issue with dbt is in its scalability giving engineers the ability to build as many dimensions as there are business questions. That is good, but it also means that some less disciplined analytics environments end up running and managing upwards of 10,000+ dbt models. To a consumption-based data platform it’s music in their ears, to the customer it’s muda (waste). We discussed this topic in detail here.

Topping this much waste with a semantic layer is not a solution, it’s akin to a Potemkin village.

Why is this important?

By now you might have seen something you were missing in Kimball data modelling, or you already knew all the above. Not modelling dimensionally because ‘storage is cheap’ misses the point of data modelling all together. A Kimball model not only is storage efficient but makes use of the OLAP platform’s capabilities. The difference between data modelling today and two decades ago can be attributed to volume, variety and veracity. Technology has introduced new ways to bring that to the fore, modelling a Kimball style model eliminates waste and shifts model complexities left.

Reference: https://www.reddit.com/r/dataengineering/comments/yzjqv2/is_dimensional_modelling_dead/

Why is this important?

Why is this surrogate key different? It’s an ephemeral table that we might want to rebuild as often as we need and if we used the standard temporal surrogate key column (identity/autoincrement) then recreating this table adds the risk of creating surrogate keys that do not match the existing date/time foreign keys in a fact table. Yes, for this role-playing dimension we need a durable surrogate key.

Reference: https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/dimension-surrogate-key/

Why is this important?

Rudimentary but come on, can you see the unnecessary duplication you have to solve at query time?

Why is this important?

Architecture. Querying physical tables will use the optimal algorithm for your SQL joins, deployed as views likely the SQL planner will not and worse still: you will have to execute the same query complexity at every query chewing CPU cycles.

Yes, storage is cheap, but compute is not!

Why is this important?

Model & Architecture. Snowflake schemas may not be optimal for your queries although when those outrigger dimensions are reference tables these could be preloaded into memory and simply referenced in your dimensional queries.

The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.

--

--

Patrick Cuba
Patrick Cuba

Written by Patrick Cuba

A Data Vault 2.0 Expert, Snowflake Solution Architect

Responses (2)