SQL script for data cleaning peoples names to be the correct case
When writing Extract, Transform and Load (ETL) routines for a data warehouse there may be times when you are importing data about customers, employees, subscribers etc which includes their name. People may have entered the names into the data sources in different formats, e.g. all upper case, all lower case, capitalised first letter, mixed case etc.
This article demonstrates a data cleaning SQL script that 'repairs' these different name formats into a uniform format of first letter of each part of the name capitalised as well as capitalised letters following a dash (for hyphenated names), spaces (for complete names or middle names) and apostrophes (for O'Neil, O'Murphy etc).
The script uses a number of SQL String Functions to achieve this.
As this script will be used in our data transformation routines we will wrap it in a User Defined Function (UDF) so that we can easily apply it from a SELECT statement. Placing the code in a UDF has the additional benefits of the code being pre-compiled and in a centralised location enabling easy maintenance / debugging of the code.
The complete function definition can be found below. The UDF is called CleanNameCase and takes one input parameter (the name to be cleaned) and returns the cleaned name as a varchar(30).
CREATE FUNCTION [dbo].[CleanNameCase] ( @Name varchar(30) ) RETURNS varchar(30) AS BEGIN -- first trim the name SET @Name = LTRIM(RTRIM(@Name)) -- declare variable to hold the reset name DECLARE @Reset varchar(30) SET @Reset = '' If @Name <> '' BEGIN -- declare and assign variables that will be used to -- loop through each character in the name DECLARE @CharCount int, @LoopCount int SET @LoopCount = 1 SET @CharCount = Len(@Name) -- should the next character we append be upper case? -- first character is always upper case. DECLARE @MakeUpper bit SET @MakeUpper = 1 WHILE @LoopCount <= @CharCount BEGIN DECLARE @Character char SET @Character = Substring(@Name, @LoopCount, 1) -- append this character to the value we will return IF @MakeUpper = 1 SET @Reset = @Reset + UPPER(@Character) ELSE SET @Reset = @Reset + LOWER(@Character) -- work out if the next character should be upper case SELECT @MakeUpper = CASE WHEN @Character = '-' THEN 1 WHEN @Character = ' ' THEN 1 WHEN @Character = '''' THEN 1 ELSE 0 END -- increment the loop counter SET @LoopCount = @LoopCount + 1 END END ELSE BEGIN SET @Reset = '' END RETURN @Reset END
A key point of the above script is that a zero length string must be assigned to the @Reset variable before entering the loop. This is because otherwise the line SET @Reset = @Reset + UPPER(@Character) would leave @Reset with a value of NULL as NULL + a character = NULL.
The CASE statement is used to determine whethe the next letter in the name should be upper case or not, you could add more characters to this list if you can think of any others that might require a following upper case letter.
To call this user function in your ETL routine you could use something similar to below. Note you must use the 2 part name when calling a UDF i.e. [owner].[functionname]
INSERT INTO Stage_CleanedCustomers (Title, FirstName, LastName, CustomerID, DataSource) SELECT [Title], dbo.CleanNameCase(FirstName), dbo.CleanNameCase(LastName), CustomerID, 'WebSite' FROM WebSiteCustomers
In this article we have demonstrated a script that can be applied to name data from has been extracted from different sources that may / may not use standardised data entry techniques. We hope this SQL script proves useful in your ETL routines and gives you ideas for other such 'data cleaning' scripts that may be required due to inconsistencies in approaches to data entry.