ETL - Extract Transform Load
Once you have identified your business requirements, analysed your source systems and developed a data model for your Data Warehouse you can then start to look at the Extract Transform Load (ETL) processes that are critical in the success of the Data Warehousing project.
What is ETL?
ETL processes can be summarised as?
- the extraction of data from many heterogenous systems
- the transformation of this extracted data into structures and types that follow the business rules of the data warehouse
- the loading of this transformed (cleansed) data into the data warehouse structures in preparation for data analysis
The ETL design process is perhaps the most time consuming stage of the Data Warehouse project. It is often the case that over 50% of the time dedicated to the Data Warehousing project is spent on designing and developing the ETL processes.
Your ETL processes will determine the quality of data that ends up in your Data Warehouse and so it is vital that you get it right because if you put rubbish into it you will get rubbish out, i.e. reports that are of little use to your companies decision makers due to the incomplete / non consistent nature of the data that they are based upon.
ETL processes will need maintaining and changing with time due to changes in the data sources or the Data Warehouse business requirements. Badly designed ETL processes can lead to lengthy, unneccessary time and expense spent on maintaining and updating them.
It is important to keep in mind when we are designing our ETL processes that they should improve the data quality / integrity and although a transformation may alter the format / data type etc of the data, it should not change the meaning of the data.
Operational Meta Data
The Data Warehouse schema often contains Meta Data tags which help describe the data. This operational meta data enables easier adaptation of the extract, transformation and load processes and enables any amendments that are made to the data source systems / business rules to be more quickly integrated into the ETL design.
ETL Staging - Staged ETL design
ETL design is normally split into a sequence of logical stages. These separate stages give the design a more flexible, extensible base on which the ETL processes can be modified to suit the changing data sources / business requirements of the data warehouse.
As part of these stages the data is normally moved onto a separate physical staging area between the data sources and the data warehouse so that the data can be transformed without slowing the source systems or affecting the integrity of the live data warehouse.
Once these stages have been designed they form a process that can be used for repopulating the data warehouse with successive 'loads'.
A common sequence of ETL processes are:
- Data Extraction
- Data Verification
- Data Cleansing
This is the process that captures the data from the source system(s) and moves it to the staging database. The extraction can be a Full refresh where all the specified data from the source system is copied or an incremental upate which only copies the specified data that has changed since the last time the ETL process was run.
Various methods may be used for extraction such as copying data objects, using SQL to capture all or a subset of data, FTP to upload data files, replication etc.
This stage compares the data compares the data extracted in the previous step with the data quality rules specified in the business specification of the DW. It audit files are recorded from the source system the extract can also be compared against these to check we have extracted the right data, i.e. row counts and byte counts. The data quality rules that we are checking can be the responsibility of the owner of the source system OR the responsibility of the DW administrator and these responsibilities should be declared in the business specification.
If data does not meet the data quality rules then there are a number of options:
- Reject the data so it doesn't enter the DW
- Keep the data in a holding area so that it can be fixed before insertion into data warehouse
- Added to the data warehouse with its errors – this can lead to major problems down the line
When data is rejected or placed in a holding area the administrator of the source system must be informed and be given details of/access to the rejected data so that they can try and fix the issues that are causing the problems.
Data cleansing or data cleaning is the process where valid data is made more precise. Common aspects of data cleaning transformations include:
- Find and Replace – for instance to synchronise a building name where there were instances of the same building referred to under different abbreviations – i.e. 'London Health Centre', 'London HC', 'London Hth Cen' etc
- Convert Case – for example on the 'title' column of a 'customers' table, converting all instances of 'MRS', 'mrs' and 'MRs' to 'Mrs'
- Merging data from different data sources
- NULL value handling – conversion to a default value
- Data Type conversion – to syncronise data from different systems, i.e. the customerID in one system may be an integer but another system may store the integer as a character data type.
- Splitting of source files into multiple staging files that can be used to drive multiple target table loads.
- Inclusion of meta data to help describe the data in the Data Warehouse such as source system identifiers, end date etc.
Data Integration is the incorporation and merging of data from the different source systems into a unified data interface that can be used for front end analysis. Data Integration normally results in the creation of new entities and column structures and involves Join operations that tie together data from the different sources. These new entities and columns should be an easier to understand structure than the disparate source systems.
Once we have tables that are ready for loading into the DW we can perform summary calculations (aggregations) and store this summary data to enable quicker running of queries. When creating our dimensional data model it is essential that good paths of aggregation form part of the design of the dimension tables.
This is the stage that 'uploads' the transformed data to the live data warehouse. The speed of loading can be influenced by factors such as size of tables and the proportion of updates / inserts. Most Database Management Systems (DBMS) come with a bulk upload utility such as SQL Server's BULK COPY program. It can often prove quicker to drop entire tables and insert all rows using such bulk load utilities rather than trying to update all rows that have changed since the last update.
In this part of this series on an Introduction to Data Warehousing we have seen that the Extract, Transform, Load process is vital in achieving a reliable, usable data source for the front end analytical programs to use. ETL is the most time consuming and one of the most challenging aspects of data warehousing.