Marat BN — Marat Borisovich Nepomnyashy's home page

Utilizing Regular Expression Lookaheads for Validating String Length

Created on 2011-04-12

  • Background on Regex Lookaheads:

    Lookaheads are an advanced, rarely used feature of the regular expression syntax, that allows portions of evaluated text to be validated and matched internally by the regex engine algorithm, but yet remain excluded from the externally-returned match. However, if this internal lookahead match fails, then the externally returned portion of the match is immediately cancelled even if it would match otherwise.

    Lookaheads are called this way because the regex engine algorithm “looks ahead” at them to check if they match, but passes them over without inclusion into the externally-returned match.

    The syntax for a regex lookahead looks like this:

    (?=internally-matched-lookahead-regex)

    Example regular expression with a lookahead matching a dash-separated phone number with a 408 area code:

    /^(?=408)\d{3}-\d{3}-\d{4}$/g

    In the expression above, the 408 area code is matched 2 times, first internally, per the lookahead, as digits 4, 0, and 8 in that order, and then externally as part of a generic phone number — a set of 3 digits, followed by dash, followed by another 3 digits, followed by another dash, and then by 4 more digits.

    At the conclusion of the internal lookahead match, the regex engine algorithm returns to the point in the string where it started validating the lookahead (the beginning of the string in this case), and goes on from there to validate the conventional second match. The second match is just a generic phone number, but it would be cancelled if the number had an area code other than 408, because the lookahead would fail. Had the lookahead been removed from this expression, then the new expression would match any generic dash-separated 10-digit phone number regardless of its area code.

  • Test Regex Lookahead:

  • Regex Lookaheads Rarely Used:

    Regex lookaheads are rarely used because most validations can be accomplished without them. For example, the expression above can be simplified to just:

    /^408-\d{3}-\d{4}$/g
  • Test the Simplified Expression:

  • String Lengths Not Always Trivial to Validate Without Regex Lookaheads:

    However, certain situations related to enforcing string lengths can be very challenging to resolve without utilizing regex lookaheads.

    For a trivial hypothetical example, imagine that the objective is to validate a variable-length token, consisting of a series of digits separated by a single dash, and that the total length of such a token must not exceed 10 characters. So the following tokens would be considered valid:

    • 1-3
    • 1234567-90
    • 12345-7890
    • 12-4567

    While the upper limit on the number of characters is 10, the lower limit would be 3, as that’s the shortest possible length to accommodate 2 single digits with a dash in-between.

    And the following tokens below would be considered invalid, either because they contain characters other than digits, are not separated by a dash, have multiple dashes, and/or because their lengths exceed 10 characters:

    • not-digits
    • 12
    • 12-
    • -23
    • 123
    • 1234567-901234567
    • 1234567890-1234567890
    • 12-45678-90
    • 1234-67890-234-67890

    Enforcing just that the token must be composed only of digits, that must be separated by a single dash, is fairly easy. The corresponding regular expression would be:

    /^\d+-\d+$/g

    And enforcing just the 10 character length limit on such a token would also be easy, with this regular expression:

    /^[\d-]{3,10}$/g

    However, this last expression above does not enforce the requirement that one and only one dash separate the 2 groups of digits. Enforcing both the token format, and the token length with a single regex expression is not trivial. Consider for example, a regex like this:

    /^\d{1,8}-\d{1,8}$/g

    The above regex would enforce a maximum length of 17 rather than 10, which is not what we want. Here’s another pathetic attempt:

    /^(\d+-\d+){3,10}$/g

    The above would not work either, as it would validate multiple concatenated tokens, enforcing the limit on the number of these concatenated tokens, rather than on the number of characters in the token. — The curly brackets syntax enforces how many of the previous group, not how many of the characters in that group.

    As is typically the case with perplexing problems of all sorts, there usually exists a simple, but less than ideal brute force method that will break the impasse, and this case is no exception. Here it would involve the conditional OR operator denoted by the pipe character ‘|‘ syntax. It gives us this unwieldy monster expression that does work, but which can’t be considered practical for problems of this sort as it can grow exponentially large:

    /^((\d{1,5}-\d{1,4})|(\d{6}-\d{1,3})|(\d{7}-\d{1,2})|(\d{8}-\d)|(\d{1,4}-\d{5})|(\d{1,3}-\d{6})|(\d{1,2}-\d{7})|(\d-\d{8}))$/g
  • Test the “Brute-Force” Unwieldy Expression:

  • Applying Regex Lookaheads for String Length Validation:

    However, by using regex lookaheads, it is possible to combine multiple regular expressions into one. In essence, a regular expression lookahead can be used like a conditional AND operator, which is sadly missing from the regex syntax in explicit form. (Perhaps regex lookaheads would make an explicit AND operator superfluous.)

    Combining the two parts of this validation problem into a single expression with a regex lookahead would look like this:

    /^(?=[\d-]{3,10}$)\d+-\d+$/g
  • Test the Lookahead Using Expression:

  • Break-Down of the Expression Utilizing a Lookahead:

    1. In the expression above, the regex engine algorithm first checks to see if the token contains only digits and dashes (in any order), and that the token is between 3 and 10 characters long. If the length of the token is either less than 3 or greater than 10, then the lookahead fails, failing the rest of the match along with it. And so this takes care of the string length enforcement component of the problem.

    2. The second component of the problem is to enforce that there be one and only one dash separating the 2 groups of digits, which is handled by the rest of the expression trailing the lookahead.

  • Subtle Detail Critical to String Length Validation:

    A critical part of the solution above is the dollar sign $ at the conclusion of the lookahead (in addition to the regular dollar sign at the conclusion of the whole expression). This lookahead dollar sign is there to indicate to the regex engine that the string must terminate immediately following the 10th character. This little caveat is easy to miss, but doing so would break the enforcement of the 10-character limit in this case.

    If the purpose of any lookahead is to enforce a whole string length limit, then it must terminate with a dollar sign to tell the regex engine that the string must end there.

  • Conclusion and Beyond:

    In this blog, I demonstrated how the regular expression lookahead feature can be used as an implicit conditional AND operator, and utilized to validate whole string length with simplicity and elegance.

    And a single regular expression can contain multiple lookaheads, which can also be separated by the conditional OR operators. Such a scheme would allow varying the numeric length enforced depending on various conditions within the string.

    Beyond that, the usefulness of regex lookaheads is not limited to only validating whole string lengths. With a minor tweak, lookaheads can also validate lengths of sub-tokens within larger strings — this would involve inserting one or more lookaheads at various positions within the expression depending on the number and placement of the sub-tokens being validated, and replacing the lookahead terminating dollar sign used in the examples in this blog with an expression matching some other boundary within the string.

    Have fun with lookaheads!

Creative Commons License

Copyright (c) 2010-2018 Marat Nepomnyashy

Except where otherwise noted, this webpage is licensed under a Creative Commons Attribution 3.0 Unported License.

Background wallpaper by Patrick Hoesly, used under Creative Commons Attribution 2.0 Generic License.