Designing the Data Warehouse structure - Dimensional Modelling
After analysing business requirements of the Data Warehouse the next stage in building the Data Warehouse is to design the logical model. In order to go about designing this model we must first understand the different requirements between Transactional Data systems and the Reporting systems of the Data Warehouse.
Transactional systems such as sales databases have a rapid turnover of data with a large amount of INSERTS / UPDATES / DELETES. The model used for transactional systems is a relational model where data is normalised with the aim of not having duplicated or redundant data in the database.
The relational design of transactional systems isn't so suited to reporting needs as it often means creating complicated joins from the related tables. Reporting systems tend to analyse data by dimensions. An example of these dimensions can be illustrated by seeing some of the reporting needs of a company that manufactures and sells a range of widgets. The dimensions are highlighted in red:
- identify sales by region
- identify sales by product category
- identify units made by factory.
- identify units sold by customer
- identify volume of sales by time
Dimensional modelling uses the concept of fact tables and dimension tables. Both these types of tables are regular database tables with rows and columns. The fact table joins to the dimension tables using primary and foreign key relationships in the same way joins and entity relationships are defined in Relational data modelling.
A fact table records measurable facts such as a sales transaction and contains numerous foreign keys that are used to join the fact table to the dimension tables. This enables the fact data to be sliced by the required dimensions.
Fact tables tend to have a large number of records (up to billions) whilst dimension tables tend to have a small number of records (a few to a few thousand). Dimension tables tend to have a small number of columns, they contain the different levels of detail that you want to view your reports at. For instance the time dimension might include levels for year, quarter, month, weekOfYear and dayOfYear. A Region dimension might contain the following levels: Continent, Country, Region and City.
The diagram below shows the fact table in the centre and the related dimension tables radiating out from the centre. This is what is known as a star schema and is the most common form of dimensional modelling.
#### IMAGE ###
You may have a number of fact tables in your data warehouse but each fact table will have a number of dimension tables that relate to it.
Whats so different about Dimensional modelling?
You might be thinking that if there are still related tables using primary and foreign keys then what is the difference in dimensional modelling? The difference lies in the fact that the dimension tables hold redundant data. They are not fully normalised. Let us take the example of the company that sell their widget products.
In a relational database the products part of the database might look something like the following:
### IMAGE ###
PRODUCT ProductID, ProductCategoryID, ManufacturerID, SupplierID, Title, Description
PRODUCTCATEGORY CategoryID, CategoryTitle, CategoryDescription
MANUFACTURERS ManufacturerID, ManufacturerName, ManufacturerDescription
SUPPLIERS SupplierID, SupplierName, SupplierTelephone
In a dimensional database the product dimension would contain the required information from the above tables in one dimension table:
#### IMAGE ###
ProductID, ProductTitle, CategoryTitle, ManufacturerName, SupplierName
This means that in a dimensional design, a new table isn't created for each dimension level, instead more redundant data is stored in the dimension table.
Occasionally the dimension table may be normalised into more than one table and this gives a Snowflake schema dimensional model type. A snowflake schema is an extension of the star schema and is the 2nd most common type of Dimension modelling schema.
Continuing from our example above the Product dimension could be split into the following structure to give a snowflake schema:
PRODUCT ProductID, ProductTitle, ProductCategoryID, ManufacturerName, SupplierName
PRODUCTCATEGORY ProductCategoryID, CategoryTitle
There would then be a relationship from the Sales fact table to the Product dimension table and a relationship from the Product dimension table to the ProductCategory dimension table.
Most Data Warehouse models start as a star schema and if a dimension table becomes unwieldy then it undergoes a degree of normalisation and the model evolves into a snowflake schema.
Slowly changing Dimensions
The values in dimension tables can change over time and this can prove to be the cause of one of the biggest headaches that a DW designer must overcome. For our company above when using the star schema, what happens if a CategoryTitle changes its name? Options could include:
- Updating the existing rows that belong to that Product Category in the product dimension table so that they just have their CategoryTitle updated. This is the easiest option but does not give us any historical tracking of the previous category name.
- Because data warehouses are often used to track data trends over time an 'ObsoleteDate' column is often added to dimension tables so that when a value in a dimension table changes it can be tracked over time. We could create a new row for each of the products in this category so that we can keep track of the previous category name? If doing this then the rows with the previous category name must have their obsolete date 'activated' so that they won't appear in the same reporting 'time zone' as the new rows. When using this method, the product ID could no longer be used as the unique identifier for the Product tables primary key as each product would be entered more than once. This means that an autogenerated SequenceID column would need to be added as a unique identifier.
- Creating a new entirely unrelated
Tracking the history of changes in dimension table values is often referred to as slowly changing dimensions.
This article has examined the stage of the Data Warehouse lifecycle that occurs after the business requirements have been identified. It has shown us that in order to slice and analyse data by different dimensions that contain hierarchal levels of analysis we must use a dimensional data modelling process rather than a relational modeling process.
The dimensional model is characterised by one or more fact tables that are related to a number of dimension tables that contain redundant data. The most common result of the relations between the fact and dimension tables is a star schema and this may be extended into a snowflake schema by partly normalising one or more dimension tables.
Tracking the changes to values in dimension tables is a challenge that the DW designer must anticipate and overcome. In the next article in this series on 'An introduction to Data Warehouses and Data Warehousing' we will look at how to populate the fact and dimension tables with data that comes from the heterogenous data systems using a process known as Extract, Transform and Load (ETL).