In SQL Server, not all spaces are equal

In SQL Server, not all spaces are equal

I recently had a challenge mapping attributes using a lookup table. I had the same customer name in two tables “bobs widgets”, but when I tried to join on the names, there was no match. To cut a long story short, there is more than one way to encode a space.

This picks up from the recent article on Cleaning Strings: Removing & replacing extra characters in that it looks at the same methods for character encoding. However in this case it’s a character with two different encodings but the same visual representation. Note, this issue also comes up with line breaks and some other characters, especially when working with unicode.

The Encoding Problem

The Problem

There are two character encoding references for a space. The look the same on the screen, however their encodings are different and they will fail an equality test. This can be extra confusing with spaces as we are used to ignoring trailing spaces when checking string equality.

PRINT 'Simple Select - They look the same';

SELECT 'Bobs' + CHAR(32)  + 'Widgets' AS s_1,
       'Bobs' + CHAR(160) + 'Widgets' AS s_2
;

GO

PRINT 'Select encoded values: Where they are different';

DECLARE @NAME_1 VARCHAR(100) = 'Bobs' + CHAR(32)  + 'Widgets';
DECLARE @NAME_2 VARCHAR(100) = 'Bobs' + CHAR(160) + 'Widgets';

SELECT @NAME_1 AS [name_1_coded], @NAME_2 AS [name_2_coded]
 WHERE @NAME_1 <> @NAME_2
;

GO

PRINT 'Select encoded values: Where they are equal';

DECLARE @NAME_1 VARCHAR(100) = 'Bobs' + CHAR(32)  + 'Widgets';
DECLARE @NAME_2 VARCHAR(100) = 'Bobs' + CHAR(160) + 'Widgets';

-- Note: No results
SELECT @NAME_1 AS [name_1_coded], @NAME_2 AS [name_2_coded]
 WHERE @NAME_1 = @NAME_2
;

This can be seen in action over at rextester.com

The update & Replace (to make uniform)

Similar to the extra characters article, the solution when finding rows in a table is to do a quick update:

UPDATE t1
   SET customer_name  = REPLACE(customer_name, CHAR(160), CHAR(32))
 WHERE customer_name <> REPLACE(customer_name, CHAR(160), CHAR(32))
;

This can be seen in action at rextester.com

Replacing the values fixes the immediate problem. Note that it can fail if there is a primary key/unique index on a column which has both variations in it.

Prevention Is Better Than The Cure

Some programmatic methods to solve this issue are.

Perform the conversion on insert/update

If a stored procedure performs the insert or update, then add a conversion to the procedure to clean all new values going into a table.

INSERT
  INTO t1 (customer_name)
VALUES(REPLACE('Bobs' + CHAR(160) + 'Widgets',
               CHAR(160),
               CHAR(32)
              )
      )
;

A Constraint

Use of a constraint would prevent the unwanted character from going into the column (this causes an error message/failure).

A Trigger

A trigger can be a handy way to clean data, configured to update values on insert/update so that the DBMS attempts to “fix” things. It applies the same idea as modifying the insert/update statement, but performs it programmatically for each record. Note, it could have performance issues with large volumes of data.

This is helpful when you have less technical users providing/inserting updates. It should work silently.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.