Getting LEFT & RIGHT in hive

Getting LEFT & RIGHT in hive

Hive doesn’t have the functions LEFT() and RIGHT() as used in other RDBMS/SQL platforms. Here I demonstrate using hive functions for parsing text values which can emulate LEFT and RIGHT.

After my longer Getting Started article, this is a short one covering a common challenge.

There are two ways to extract part of a string:

  • Getting a substring, using: SUBSTR
  • Extracting part of a string using a Regular Expression: REGEXP_EXTRACT

Note: The examples in this article work, but the use of a CTE for sample values performs much slower than using an existing dataset.

Using Substring

The SUBSTR command has a simple syntax:

SUBSTR(<string_value>, <start_location> INT, <length> INT)

Using SUBSTR for LEFT

This is our easiest scenario as the start position is always known (1) and the length is the number of characters we would like to return. In the below example I’m pulling back the first 4 characters:

WITH sample_data AS (
SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('AAABBB'       AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('ZZZ'          AS VARCHAR(50)) AS sample_string
)
SELECT sample_string,
       SUBSTR(sample_string, 1, 4) AS sample_string_left_4
  FROM sample_data
;

Note: if the original string is fewer than the desired number of characters, this will return the short string.

Using SUBSTR for RIGHT

For the RIGHT we have two options, the first is to use a negative position (which is applied backwards). This is very similar to the LEFT above.

WITH sample_data AS (
SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('AAABBB'       AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('ZZZ'          AS VARCHAR(50)) AS sample_string
)
SELECT sample_string,
       SUBSTR(sample_string, -4, 4)                AS substr_right_4
  FROM sample_data
;

Note: If the original string length is fewer than the desired number of characters in the output, the function will return NULL.

Using Regular Expressions

the REGEXP_EXTRACT command allows us to extract values based on a regular expression. This can be handy for other extraction methods (for example filtering for numbers etc).

The syntax is:

REGEXP_EXTRACT(<string_value>, <regular_expression>, <group> [optional])
  • string_value: The source string
  • regular_expression: The regex used to identify the value to return
  • group: optional argument, to identify which result to return (default 0)
SELECT REGEXP_EXTRACT('abc123XYZ789', '^.{4}', 0) AS left_4;

SELECT REGEXP_EXTRACT('abc123XYZ789', '.{4}$', 0) AS right_4;

In the above examples, the number in curly brackets is the number of characters to return, I use a dot ‘.’ to mark any character (I’m not going into the details of regular expressions today). In the first example, I use the symbol ‘^’ to identify “the start of the string” and in the second I use “$” to identify the end.

Note: the above examples will return NULL if the string value is fewer than 4 characters in length. To return shorter strings, the values in the curly brackets needs to be:

{<minimum_length>,<maximum_length>}

To get shorter strings returned, the expression can be run as:

SELECT REGEXP_EXTRACT('A', '^.{1,4}', 0) AS left_4_or_fewer;

SELECT REGEXP_EXTRACT('A', '.{1,4}$', 0) AS right_4_or_fewer;

Review

The regular expressions allow for more control with regards to shorter strings. But probably have higher computational overhead (I’ve not been able to test reliably).

WITH sample_data AS (
SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('AAABBB'       AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('ZZZ'          AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('N'            AS VARCHAR(50)) AS sample_string
)
SELECT sample_string,
       SUBSTR(sample_string, 1, 4)                 AS substr_left_4,    -- Allows for short strings
       SUBSTR(sample_string, -4, 4)                AS substr_right_4,   -- Doesn't allow for short strings
       REGEXP_EXTRACT(sample_string, '^.{4}'  , 0) AS regex_left_4_A,   -- Doesn't allow for short strings
       REGEXP_EXTRACT(sample_string, '^.{1,4}', 0) AS regex_left_4_B,   -- Allows for short strings
       REGEXP_EXTRACT(sample_string, '.{4}$'  , 0) AS regex_right_4_A,  -- Doesn't allow for short strings
       REGEXP_EXTRACT(sample_string, '.{1,4}$', 0) AS regex_right_4_B   -- Allows for short strings
  FROM sample_data
;

I have transposed the results below so they display better on the page for demonstrative purposes:

sample_string abc123xyz987 AAABBB ZZZ N
substr_left_4 abc1 AAAB ZZZN
substr_right_4 z987 ABBB
regex_left_4_a abc1 AAAB
regex_left_4_b abc1 AAAB ZZZN
regex_right_4_a z987 ABBB
regex_right_4_b z987 ABBBZZZN

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.