RLIKE in hive: Filtering with regular expressions

RLIKE in hive: Filtering with regular expressions

As I previously did a blog post on Querying SQL Server with something LIKE a regular expression (Using simple regular expressions in a LIKE statement), I thought I would use that as a segue into Apache Hive and HiveQL. Because Hive and therefore HiveQL is built using Java, it has the full power of Java regular expressions in it’s RLIKE statement.

Regular Expressions

Some people will read “regular expression” and react with a knowing nod (feel free to jump ahead), everyone else is about to learn something you never realised you needed to know.

What is a Regular Expression?

A regular expression is a Computer Science tool for implementing what mathematicians know as a finite state machine (sometimes called a finite state automaton), usually on text (alphanumeric characters).

If you have used grep on the unix command line, you have probably seen this in action.

The actual regular Expression is a string which represents a pattern for matching. There are many cases where there is a pattern which we want to test text against. A common example is to test if a date is correctly formatted how a process expects it (YYYY-MM-DD or DD/MM/YY etc…).

A more advanced case is validating an email address, very quickly you can build up rules in your head (it has to have the ‘@’, but only once, and there has to be something either side of it…. and so on).

Further Reading

If regular expressions are new to you, then make some time to read (thanks to the web archive): The absolute bare minimum every programmer should know about regular expressions

Online Tools

I found this interactive site, very useful for testing regular expressions (although it goes beyond the SQL Server syntax): https://regexr.com/

RLIKE in Hive

Hive has both LIKE (which functions the same as in SQL Server and other environments) and RLIKE, which uses regular expressions. These are mentioned briefly in the LanguageManual UDF documentation.

The best way to understand RLIKE is to see it in action. So, I’ve created some sample data and some examples of regular expressions.

Sample Data

DROP TABLE IF EXISTS sample_data;

CREATE TABLE sample_data
(
    sample_column VARCHAR(100)
)
;

INSERT
  INTO sample_data
      (sample_column)
VALUES('abc'),
      ('ABC'),
      ('XYZ'),
      ('123'),
      ('000'),
      ('789'),
      ('1.3')
;

SELECT *
  FROM sample_data
;
sample_column
abc
ABC
XYZ
123
000
789
1.3

This sample data gives us some text and numeric characters. If you look a the UDF, most functionality is around text and numbers. But special characters can also be manipulated.

Contains Numbers

SELECT sample_column
  FROM sample_data
 WHERE sample_column RLIKE '[0-9]+'
;
sample_column
123
000
789
1.3

Because we are matching on if the string has one or more numeric character, we see the 1.3 value. We would also see a string which contained ‘abc123XYZ’.

All (and only) Numbers

SELECT sample_column
  FROM sample_data
 WHERE sample_column RLIKE '^[0-9]+$'
;
sample_column
123
000
789

Note that this hasn’t included the decimal point. See the example further down for use of a decimal point in a number.

Text (alpha) Only

SELECT sample_column
  FROM sample_data
 WHERE sample_column RLIKE '([a-z]|[A-Z])+'
;
sample_column
abc
ABC
XYZ

The pattern will match one or more text characters. Noe that it will allow for other characters to be returned as long as there is one or more text characters.

Number with Decimal Point

SELECT sample_column
  FROM sample_data
 WHERE sample_column RLIKE '[0-9]+[.][0-9]+'
;

Note that the dot is encased in square brackets, this makes it literal rather than a wildcard (default behaviour) for any character.

Useful Meta Characters

On top of using the square brackets [] to identify character sets, and regular brackets () to identify groupings, there are a few other common characters which control the behaviour.

  • * – Any number of occurrences (includes 0)
  • + – One or more occurrences
  • ? – 0 or 1 occurrences
  • ^ – Start of line
  • $ – End of line

The ^ (start of line) and $ (end of line) are very useful to match the whole value of a column, for example to avoid identifying ‘123B45’ as a number.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.