Data Mapping
An oldie but goodie…
How great is excel? You can add comments below this article to express your love for it! I am a fan too but what often happens is that Excel gets abused; used way beyond its limitations, way beyond it was designed to handle; think VBA + Excel + 2m+ records! and Excel starts getting used as a database! I do like Excel and I use it for data mapping because chances are that the business understands Excel and we can easily talk about the elements in excel without getting lost in translation.
The intention of this article is to set a few guidelines and principles around data mapping as it pertains to the use of Excel. Data Analysts can assist rapid development by providing data mapping in a form that is easily digestible by a coder and readable by the business.
Without further delay let’s get into the ten standards (I can think of) for data mapping.
1) Build a mapping sheet that covers only 1 subject area
This accomplishes a few things:
- You can have multiple subjects being worked on at the same time by different people.
- Isolating mapping to functional areas — — therefore allowing for teams/departments having their own “copy” of the data structure their responsible for.
- Easily add references in products like Conference, JIRA etc so if the area that talks about credit cards then the link would only pull credit card mapping.
2) Business artifacts must have version control
Sharepoint and JIRA have Check-in and Check-out facility ensuring that only one person can edit the mapping at a time. But it is also a good idea to include versioning inside the document itself.
a. Document version tab — the changelog
As a rule the document suffix should also contain the version number.
b. Worksheet version = mapping version
Furthermore (and developers will love you for this) the version number should be included in the mapping tab too!
Add a filter
Select the version you need
Voila! No need to trawl through potentially hundreds of columns to get to the latest changes to the model!
3) Use multiple tabs
The tab approach let’s you group related items together — it also enhances your spreadsheet to add some much needed safety nets. Here’s how.
a. Drill through from tables to columns
Rather then repeat the database, schema and table transformations in column mapping it would be better to have that kind of information recorded once.
b. Lookup tables for validation
Also keeping a validation list reduces the possibility of spelling errors when specifying column types and indexes. The syntax would be tied to the database you are populating; here we are using SQL Server.
A quick link on how to do this: https://support.microsoft.com/en-au/help/211485/description-and-examples-of-data-validation-in-excel
4) Use filters and freeze panes
This helps readability and easy access to the items you want to map. Also minimizing mistakes by only viewing and editing the columns or tables you are interested in.
5) Map source to target from left to right — ALWAYS!
Like reading a book your eyes follow a certain direction because the story is going somewhere… the data too! Left to Right, Source to Target.
6) Use some column intelligence
Another feature that helps with easy navigation for readers of your template is mapping type. Using filters you check out which columns were system generated, conformed, derived or mapped “as is” from source. Here’s what I mean:
- System Generated
Think of these columns as “Tags” and columns needed by the system to make the model work or link the tables together Kimball-style. In modelling terms it is up to you to decide the naming standards, here for keys I have used the suffix “_K”. TO_DTM is “Effective To Date” and is managed by the ETL tool you use. By looking at the tables that have these columns you can deduce that the tables are Type 2 dimensions and we are using date and time (are we micro-batching?), but where are the “Effective From Date” columns? Also as a helpful tip you need to decide the plurality of your columns and tables, i.e. do we say D_ACCOUNT or D_ACCOUNTS? Decide up front before building data models for your enterprise. You guessed it “D_” is for dimensions and “F_” is for facts.
2. Conformed
And speaking of plurality, a conformed column has had no transformation applied to it but the column name has been modified to conform to corporate standards, this may include data types as well — — DATE to DATETIME2.
3. Derived
A derived column — something that we did not get from the source tables but we deduced it from what the source gave us. This could be changing codes to enterprise standards. We have an example here for Gender which is derived from the literals ‘1’ and ‘2’ and changed to ‘M’ and ‘F’. This type of decision is also to be made up front.
4. 1–1
Finally no change was applied; we take the fact as is from the source table.
7) Use some table intelligence
If our model does explode into a snowflake then tracking which tables are direct copies of source (with tagging) or not then we would like to track those easily as well.
- Generated
Date dimensions are typically generated or they could be pulled in from a spreadsheet supplied by business. The date dimension may include business or jurisdiction type columns for financial periods and public holidays; and a whole lot more! If there are no changes needed for the date dimension then mapping it here may not be necessary. Don’t map what you don’t need to! Especially if the table already exists elsewhere. Also note that the date dimension is a role-playing dimension meaning that depending on what you join it with then you retrieve the date you want. Example: the fact table for for sales may have multiple dates and if are relying on the date dimension for dates then we would likely join it multiple times: Order Date & Sale Date.
http://www.sqlservercentral.com/scripts/Data+Warehousing/156805/
2. Join
More than one table is used to create the target table; in this particular example I needed to pull in the target table to create a Type 3 (limited history) column called Previous_Credit_Limit. I am SQL savvy so I would write what I think is the t-SQL code to the transformation column, this may be edited by the developer to stipulate what the join actually is.
3. Load
Finally if we are loading the table “as is” we would like to see that too.
8) Use light colouring to easily distinguish the related columns
Obviously what you have seen so far has had colour applied to it. That is intentional and the light tone is intentional too; This is easy to read and not harsh on the eyes. Secondly the colours are grouped by their place in the mapping.
Importantly the colours line up Source Table to Source Columns with Transformations, Versions and Target Tables to Target Columns. You might of guessed it; the tables in the column mapping are locked by validation to the tables in the table mapping tab.
9) Include column ordering, column types and indexes
Nice to have but not vital; this alludes to being SQL savvy and knowing something about the underlying database. As a data analyst you need to understand how new records are arriving, the underlying database data types and the column order of the target. Imagine you want to add a new column then doing an ALTER table to add a column means the new column will be added to the far right of the table. Also what about the data already in the target table? Do we set a default value for the historical data in the table? Index changes are tough to decide too, think about what a clustered index is and how you would load it. What about using Columnstore or partition keys? The Target Length is important when considering the target table restrictions — — SQL Server has some table size performance considerations (8kb width comes to mind). Again ask the expert.
10) Include notes with dates and decisions with reasons for change. Strikeout changes
Ah documentation… bane of programmers but vitally necessary to deal with staff churn and just plane remembering why the hell you did something! Avoid the “what was I thinking?” argument you’ll end up having with yourself by just ensuring the documentation is sound!
Leave a date, an initial and a reason.
Strike out changes, don’t delete them. Leave a reason.
11) Derive DDL from mapping in Excel (OPTIONAL but helpful)
Finally if you can auto-generate the DDL then do so; this helps to reduce typing, reduce sticky or fat fingers and speeds up development. Generate the code with some simple formulas and then just copy&paste it to the development environment.
Excel for requirements gathering
Another example of using Excel templates for business is in the space of Model Storming using BEAM. (modelstorming.com). Take a look at the book by Lawrence Corr as it contains a wealth of terminology, methodologies and excel templates for requirements gathering.
Technical numbering
A note about this…. during the development phase stick to incrementing the right side of the period. Once committing to production then increment the left side of the period; like so…
Development: 0.1, 0.2, 0.3, 0.4
Production: 1.0
Development: 1.1, 1.2, 1.3
Production: 2.0
and so on!
Template Sample for Free
Finally here is a link for a downloadable version of my excel template. If you do use it just let me know!