Getting LEFT & RIGHT in hive

2019-10-13

Hive doesn’t have the functions LEFT() and RIGHT() as used in other RDBMS/SQL platforms. Here I demonstrate using hive functions for parsing text values which can emulate LEFT and RIGHT.

After my longer Getting Started article, this is a short one covering a common challenge.

There are two ways to extract part of a string:

Getting a substring, using: SUBSTR
Extracting part of a string using a Regular Expression: REGEXP_EXTRACT

Note: The examples in this article work, but the use of a CTE for sample values performs much slower than using an existing dataset.

Using Substring

The SUBSTR command has a simple syntax:

SUBSTR(<string_value>, <start_location> INT, <length> INT)

Using SUBSTR for LEFT

This is our easiest scenario as the start position is always known (1) and the length is the number of characters we would like to return. In the below example I’m pulling back the first 4 characters:

WITH sample_data AS (
SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('AAABBB'       AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('ZZZ'          AS VARCHAR(50)) AS sample_string
)
SELECT sample_string,
       SUBSTR(sample_string, 1, 4) AS sample_string_left_4
  FROM sample_data
;

Note: if the original string is fewer than the desired number of characters, this will return the short string.

Using SUBSTR for RIGHT

For the RIGHT we have two options, the first is to use a negative position (which is applied backwards). This is very similar to the LEFT above.

WITH sample_data AS (
SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('AAABBB'       AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('ZZZ'          AS VARCHAR(50)) AS sample_string
)
SELECT sample_string,
       SUBSTR(sample_string, -4, 4)                AS substr_right_4
  FROM sample_data
;

Note: If the original string length is fewer than the desired number of characters in the output, the function will return NULL.

Using Regular Expressions

the REGEXP_EXTRACT command allows us to extract values based on a regular expression. This can be handy for other extraction methods (for example filtering for numbers etc).

The syntax is:

REGEXP_EXTRACT(<string_value>, <regular_expression>, <group> [optional])

string_value: The source string
regular_expression: The regex used to identify the value to return
group: optional argument, to identify which result to return (default 0)

SELECT REGEXP_EXTRACT('abc123XYZ789', '^.{4}', 0) AS left_4;

SELECT REGEXP_EXTRACT('abc123XYZ789', '.{4}$', 0) AS right_4;

In the above examples, the number in curly brackets is the number of characters to return, I use a dot ‘.’ to mark any character (I’m not going into the details of regular expressions today). In the first example, I use the symbol ‘^’ to identify “the start of the string” and in the second I use “$” to identify the end.

Note: the above examples will return NULL if the string value is fewer than 4 characters in length. To return shorter strings, the values in the curly brackets needs to be:

{<minimum_length>,<maximum_length>}

To get shorter strings returned, the expression can be run as:

SELECT REGEXP_EXTRACT('A', '^.{1,4}', 0) AS left_4_or_fewer;

SELECT REGEXP_EXTRACT('A', '.{1,4}$', 0) AS right_4_or_fewer;

Review

The regular expressions allow for more control with regards to shorter strings. But probably have higher computational overhead (I’ve not been able to test reliably).

WITH sample_data AS (
SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('AAABBB'       AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('ZZZ'          AS VARCHAR(50)) AS sample_string UNION ALL
SELECT CAST('N'            AS VARCHAR(50)) AS sample_string
)
SELECT sample_string,
       SUBSTR(sample_string, 1, 4)                 AS substr_left_4,    -- Allows for short strings
       SUBSTR(sample_string, -4, 4)                AS substr_right_4,   -- Doesn't allow for short strings
       REGEXP_EXTRACT(sample_string, '^.{4}'  , 0) AS regex_left_4_A,   -- Doesn't allow for short strings
       REGEXP_EXTRACT(sample_string, '^.{1,4}', 0) AS regex_left_4_B,   -- Allows for short strings
       REGEXP_EXTRACT(sample_string, '.{4}$'  , 0) AS regex_right_4_A,  -- Doesn't allow for short strings
       REGEXP_EXTRACT(sample_string, '.{1,4}$', 0) AS regex_right_4_B   -- Allows for short strings
  FROM sample_data
;

I have transposed the results below so they display better on the page for demonstrative purposes:

sample_string	abc123xyz987	AAABBB	ZZZ	N
substr_left_4	abc1	AAAB	ZZZ	N
substr_right_4	z987	ABBB
regex_left_4_a	abc1	AAAB
regex_left_4_b	abc1	AAAB	ZZZ	N
regex_right_4_a	z987	ABBB
regex_right_4_b	z987	ABBB	ZZZ	N

Using Substring

Using SUBSTR for LEFT

Using SUBSTR for RIGHT

Using Regular Expressions

Review

Leave a Reply Cancel reply