Getting LEFT & RIGHT in hive
Hive doesn’t have the functions LEFT()
and RIGHT()
as used in other RDBMS/SQL platforms. Here I demonstrate using hive functions for parsing text values which can emulate LEFT and RIGHT.
After my longer Getting Started article, this is a short one covering a common challenge.
There are two ways to extract part of a string:
- Getting a substring, using:
SUBSTR
- Extracting part of a string using a Regular Expression:
REGEXP_EXTRACT
Note: The examples in this article work, but the use of a CTE for sample values performs much slower than using an existing dataset.
Using Substring
The SUBSTR command has a simple syntax:
SUBSTR(<string_value>, <start_location> INT, <length> INT)
Using SUBSTR for LEFT
This is our easiest scenario as the start position is always known (1) and the length is the number of characters we would like to return. In the below example I’m pulling back the first 4 characters:
WITH sample_data AS ( SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL SELECT CAST('AAABBB' AS VARCHAR(50)) AS sample_string UNION ALL SELECT CAST('ZZZ' AS VARCHAR(50)) AS sample_string ) SELECT sample_string, SUBSTR(sample_string, 1, 4) AS sample_string_left_4 FROM sample_data ;
Note: if the original string is fewer than the desired number of characters, this will return the short string.
Using SUBSTR for RIGHT
For the RIGHT we have two options, the first is to use a negative position (which is applied backwards). This is very similar to the LEFT above.
WITH sample_data AS ( SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL SELECT CAST('AAABBB' AS VARCHAR(50)) AS sample_string UNION ALL SELECT CAST('ZZZ' AS VARCHAR(50)) AS sample_string ) SELECT sample_string, SUBSTR(sample_string, -4, 4) AS substr_right_4 FROM sample_data ;
Note: If the original string length is fewer than the desired number of characters in the output, the function will return NULL.
Using Regular Expressions
the REGEXP_EXTRACT
command allows us to extract values based on a regular expression. This can be handy for other extraction methods (for example filtering for numbers etc).
The syntax is:
REGEXP_EXTRACT(<string_value>, <regular_expression>, <group> [optional])
- string_value: The source string
- regular_expression: The regex used to identify the value to return
- group: optional argument, to identify which result to return (default 0)
SELECT REGEXP_EXTRACT('abc123XYZ789', '^.{4}', 0) AS left_4; SELECT REGEXP_EXTRACT('abc123XYZ789', '.{4}$', 0) AS right_4;
In the above examples, the number in curly brackets is the number of characters to return, I use a dot ‘.’ to mark any character (I’m not going into the details of regular expressions today). In the first example, I use the symbol ‘^’ to identify “the start of the string” and in the second I use “$” to identify the end.
Note: the above examples will return NULL if the string value is fewer than 4 characters in length. To return shorter strings, the values in the curly brackets needs to be:
{<minimum_length>,<maximum_length>}
To get shorter strings returned, the expression can be run as:
SELECT REGEXP_EXTRACT('A', '^.{1,4}', 0) AS left_4_or_fewer; SELECT REGEXP_EXTRACT('A', '.{1,4}$', 0) AS right_4_or_fewer;
Review
The regular expressions allow for more control with regards to shorter strings. But probably have higher computational overhead (I’ve not been able to test reliably).
WITH sample_data AS ( SELECT CAST('abc123xyz987' AS VARCHAR(50)) AS sample_string UNION ALL SELECT CAST('AAABBB' AS VARCHAR(50)) AS sample_string UNION ALL SELECT CAST('ZZZ' AS VARCHAR(50)) AS sample_string UNION ALL SELECT CAST('N' AS VARCHAR(50)) AS sample_string ) SELECT sample_string, SUBSTR(sample_string, 1, 4) AS substr_left_4, -- Allows for short strings SUBSTR(sample_string, -4, 4) AS substr_right_4, -- Doesn't allow for short strings REGEXP_EXTRACT(sample_string, '^.{4}' , 0) AS regex_left_4_A, -- Doesn't allow for short strings REGEXP_EXTRACT(sample_string, '^.{1,4}', 0) AS regex_left_4_B, -- Allows for short strings REGEXP_EXTRACT(sample_string, '.{4}$' , 0) AS regex_right_4_A, -- Doesn't allow for short strings REGEXP_EXTRACT(sample_string, '.{1,4}$', 0) AS regex_right_4_B -- Allows for short strings FROM sample_data ;
I have transposed the results below so they display better on the page for demonstrative purposes:
sample_string | abc123xyz987 | AAABBB | ZZZ | N |
substr_left_4 | abc1 | AAAB | ZZZ | N |
substr_right_4 | z987 | ABBB | ||
regex_left_4_a | abc1 | AAAB | ||
regex_left_4_b | abc1 | AAAB | ZZZ | N |
regex_right_4_a | z987 | ABBB | ||
regex_right_4_b | z987 | ABBB | ZZZ | N |