Cleaning Strings: Removing & replacing extra characters
Part of managing data is cleaning text. Tabs, spaces and line breaks can sneak in, and confuse output. When working with whitespace characters and line breaks, we can’t see them. This can cause a lot of confusion for users (and developers) when unexpected behavior occurs.
Keep in mind we have to be careful, often we do want to store and recreate text exactly. Not all strings/varchars need cleaning.
Why does it happen?
Trailing space characters can come in from the sauce data when the original data type is ‘fixed width’ (padded with spaces).
Most often, extra characters come from a user pasting something into a front end. If the input isn’t sanitised or checked, then the extra characters will get pulled in with everything else.
Methods for cleaning
Trimming strings
Usually the first step to cleaning strings is to remove spaces from the start/end. There are the two functions for this, LTRIM() and RTRIM(). which take care of spaces before (to the left) and after (to the right) the non text characters.
Note, oracle (and some other databases have a plain TRIM() which does the same as applying both LTRIM() and RTRIM() to a string. This functionality isn’t in Microsoft SQL Server.
SELECT string_value               AS string_value,
       LTRIM(string_value)        AS string_value_LTRIM,
       RTRIM(string_value)        AS string_value_RTRIM,
       LTRIM(RTRIM(string_value)) AS string_value_TRIM_BOTH
  FROM [test_strings]
;
See some examples in SQL Fiddle:
Note, in the examples I have used the REPLACE() functions explored below to show where spaces are.
Replacing characters
Line breaks and tabs are different to trailing spaces, they aren’t picked up by any TRIM command. Fortunately there is the REPLACE() command, which does exactly what it says on the tin.
SELECT REPLACE([column_name], '<text to replace>', '<new text>') AS [altered_text], FROM [data_table] ;
For special characters, this is best combined with getting the CHAR() values of text, rather than trying to copy/paste a line break or tab into your search string. The key ones are:
- CHAR(9)– Tab
- CHAR(10)– Line Feed
- CHAR(13)– Carriage return
SELECT REPLACE([column_name], CHAR(9),  '') AS [column_name_NO_TAB],
       REPLACE([column_name], CHAR(10), '') AS [column_name_NO_LINE_FEED],
       REPLACE([column_name], CHAR(13), '') AS [column_name_NO_CARRIAGE_RETURN]
  FROM [data_table]
;
Documentation on these functions:
All of these methods can be combined to clean a string, cleaning is normally achieved by a combination of trimming and replacing.
One Reply to “Cleaning Strings: Removing & replacing extra characters”