Querying SQL Server with something LIKE a regular expression

Querying SQL Server with something LIKE a regular expression

Most of us are familiar with the LIKE command in SQL Server (and other DBMS environments), but few venture beyond simple text matching and the ‘%’ wildcard. It’s ability to use some regular expression (a.k.a. regex or regexp) syntax is a surprise to many.

For this article, I’m assuming that readers have already used the LIKE statement a few times and the first example is almost patronising.

Basic Text Substring Matches

Let’s start with what most people use, a simple text substring match. Using the ‘%’ to say “and anything else”…

WITH sample_data AS (
SELECT 1   AS [id], 'abc' AS [val] UNION ALL
SELECT 2   AS [id], 'ABC' AS [val] UNION ALL
SELECT 3   AS [id], 'xyz' AS [val] UNION ALL
SELECT 4   AS [id], 'XYZ' AS [val] UNION ALL
SELECT 5   AS [id], '123' AS [val] UNION ALL
SELECT 5   AS [id], '1T3' AS [val]
)
SELECT *
  FROM sample_data
 WHERE [val] LIKE 'ab%'
;

Which gives us the below output:

| id  | val |
| 1   | abc |
| 2   | ABC |

While powerful (and the most used feature of LIKE), this just the tip of the iceberg. It’s also a perfect example of where the syntax of the LIKE command is different from the equivalent regular expression.

Regular Expressions

Some people will read “regular expression” and react with a knowing nod (feel free to jump ahead), everyone else is about to learn something you never realised you needed to know.

What is a Regular Expression?

A regular Expression is a string which represents a pattern for matching. There are many cases where there is a pattern which we want to test text against. A common example is if a date is formatted as a process expects it (YYYY-MM-DD or DD/MM/YY etc…).

A more advanced case is validating an email address, very quickly you can build up rules in your head (it has to have the ‘@’, but only once, and there has to be something either side of it…. and so on)

What do they do in SQL Server?

The common one we have all seen is shown in the first bit of code at the top, ‘ab%’ (the letter ‘a’ followed immediately by the letter ‘b’ then “anything else”). The syntax of a regular expression allows us to build more complicated rules, such as ‘a’ or ‘b’, followed by a number (less than 5), and then any letter.

For illustrative purposes, here is a sample of the two expressions mentioned above (don’t worry about understanding them just yet).

--the letter 'a' followed immediately by the letter 'b' then "anything else"
   AND [val] LIKE 'ab%'

--'a' or 'b', followed by a number (less than 5), and then any letter.
   AND [val] LIKE '[ab][0-4][a-z]'

I’ve put some links to documentation on regular expressions at the bottom. I believe they are something that every developer should have a basic understanding of. Just keep in mind the syntax in SQL Server is a little different (and cut down) than most other languages.

Using Regular Expressions in SQL Server

Now we know what a regular expression is, it’s time to look at the syntax. There are only 4 sets of wild card characters used in the LIKE statement, but combined they are very powerful.

  • % – Anything, any number of times.
  • _ (underscore) – Anything, but once.
  • [] – Any one of the characters within the square brackets (note, a range can be used such as 0-9 for numbers).
  • [^] – Any character not within the square brackets (note, a range can be used such as 0-9 for numbers).

We are already familiar with the %, as it’s well used.

The use of the underscore is just the same, except it matches just one character. So 'a_c' matches ‘abc’ and ‘a1c’ but not ‘abbc’ (which would be matched by 'a%c').

It’s easy to think of the square brackets as an extension of the underscore. Rather than saying one of any character, we are saying one character from a list. For example 'a[bt]c' matches ‘abc’ and ‘atc’, but now it won’t match ‘a1c’ or ‘abbc’. Common ranges used in this are numbers such as [0-9] and letters [a-z].

Lastly, using the [^] syntax allows for a negation. Changing our previous example to 'a[^bt]c' no longer matches ‘abc’ and ‘atc’, but does match ‘a1c’ (1 isn’t b or t). Keep in mind, ‘abbc’ or ‘a11c’ still won’t match (it has too many letters).

Examples

Below is some code which will buts the above examples into practise (just comment/uncomment the lines to see each filter, or play with it over at rextester.com: SQL Server – regex filtering using LIKE

WITH sample_data AS (
SELECT 0   AS [id], '123'  AS [val] UNION ALL
SELECT 1   AS [id], 'abc'  AS [val] UNION ALL
SELECT 2   AS [id], 'atc'  AS [val] UNION ALL
SELECT 3   AS [id], 'a1c'  AS [val] UNION ALL
SELECT 4   AS [id], 'abbc' AS [val] UNION ALL
SELECT 5   AS [id], 'a11c' AS [val]
)
SELECT *
  FROM sample_data
 WHERE 1=1
 --AND [val] LIKE 'ab%'
 --AND [val] LIKE 'a%c'
 --AND [val] LIKE 'a_c'
 --AND [val] LIKE 'a[bt]c'
 --AND [val] LIKE 'a[^bt]c'
;

More Ways of Using Regular Expressions

All these can be seen and tried over at rextester.com: SQL Server – regex examples

Checking for a date

This comes in handy when working with data which needs cleaning, or if a text field is a free form field which happens (in some cases) to contain a relevant date… Say what you will about good/bad practise, I’ve seen this sort of issue in more than one place.

WITH sample_data AS (
SELECT 1  AS [id], '2019-01-01' AS [val] UNION ALL -- Normal Valid
SELECT 2  AS [id], '2019-01-02' AS [val] UNION ALL -- Normal Valid
SELECT 3  AS [id], '1998-01-03' AS [val] UNION ALL -- Normal Valid
SELECT 4  AS [id], 'BROKEN'     AS [val] UNION ALL -- Text
SELECT 5  AS [id], '1800-01-01' AS [val] UNION ALL -- Outside our date range
SELECT 6  AS [id], '01-06-2019' AS [val] UNION ALL -- Wrong Format
SELECT 7  AS [id], '2019-31-01' AS [val]           -- Month/Day clearly in wrong order
)
SELECT *
  FROM sample_data
WHERE 1=1
   AND [val] LIKE '[1-2][09][0-9][0-9]-[01][0-9]-[0-3][0-9]'
;

Looking for a literal wildcard character

To find a literal character which is one of the wild cards (for example ‘%’), just place it between the square brackets and it becomes a literal. Such as '[_]', with the exception of ']' which can be treated as any other character.

WITH sample_data AS (
SELECT 1  AS [id], 'a[cxyz' AS [val] UNION ALL
SELECT 2  AS [id], 'ABCX]Z' AS [val] UNION ALL
SELECT 3  AS [id], 'xyzabc' AS [val] UNION ALL
SELECT 4  AS [id], 'XY%ABC' AS [val] UNION ALL
SELECT 5  AS [id], 'ABC123' AS [val] UNION ALL
SELECT 6  AS [id], '123XYZ' AS [val] UNION ALL
SELECT 7  AS [id], '123456' AS [val] UNION ALL
SELECT 8  AS [id], '123%Y6' AS [val] UNION ALL
SELECT 9  AS [id], '123_67' AS [val]
)
SELECT *
  FROM sample_data
 WHERE 1=1
   AND(   [val] LIKE '%[%]%'
       OR [val] LIKE '%[_]%'
       OR [val] LIKE '%[[]%'
       OR [val] LIKE '%]%'
      )
;

Further Reading

The documentation from Microsoft covers the LIKE statement in depth and is a good place to look into this further.

If regular expressions are new to you, or you haven’t seen much more than the above, now is a time to learn more! Many developers will go on and encounter regular expressions in many other environments. From using shell commands in a unix system (sed for example), in other programming languages, and in big data interfaces (hadoop, built on java manages to leverage the full power of regular expressions).

So, make a bit of time, pour a nice cup of coffee and read (thanks to the web archive): The absolute bare minimum every programmer should know about regular expressions

Online Tools

I found this interactive site, very useful for testing regular expressions (although it goes beyond the SQL Server syntax): https://regexr.com/

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.