The regular expressions we use in our daily lives are actually not that "regular." Most of the languages support some kind of extended regular expressions that are computationally more powerful than the "regular" regular expressions as defined by the formal language theory.
For instance, the so often used capture buffers add auxiliary storage to the regular expressions that allow them to match an arbitrary pattern repeatedly. Or look-ahead assertions that allow the regular expression engine to peek ahead before it making a decision. These extensions make regular expressions powerful enough to describe some context-free grammars.
The Perl programming language has an especially rich with regex engine. One of the engine's features is the lazy regular subexpressions. The lazy regular subexpressions are expressed as (??{ code })
, where the "code
" is arbitrary Perl code that gets executed when the moment this subexpression may match.
This allows us to construct something really interesting - we can define a regular expression that has itself in the "code
" part. The result is a recursive regular expression!
One of the classical problems that a regular expression can't match is the language 0n1n
, i.e., a string with a number of zeroes followed by an equal number of ones. Surprisingly, using the lazy regular subexpressions this problem becomes tractable!
Here is a Perl regular expression that matches 0n1n
:
$regex = qr/0(??{$regex})?1/;
This regular expression matches a 0
followed by itself zero or one time, followed by a one. If the itself part doesn't match, then the string this regular expression matches is 01
. If the itself part matches, the string this regular expression matches is 00($regex)?11
, which is 0011
if $regex
doesn't match or it's 000($regex)?111
if it matches, ..., etc.
Here is a Perl program that matches 050000150000
:
#!/usr/bin/perl $str = "0"x50000 . "1"x50000; $regex = qr/0(??{$regex})*1/; if ($str =~ /^$regex$/) { print "yes, it matches" } else { print "no, it doesn't match" }
Now let's look at the Yo Dawg regular expression in the picture above. Can you guess what it does? It matches a fully parenthesized expression such as (foo(bar())baz)
or balanced parentheses ((()()())())
.
$regex = qr/ \( # (1) match an open paren ( ( # followed by [^()]+ # (3) one or more non-paren character | # OR (??{$regex}) # (5) the regex itself )* # (6) repeated zero or more times \) # (7) followed by a close paren ) /x;
Here is how to think about this regular expression. For an expression to be fully parenthesized, it has to start with an open paren, so we match it (point (1)
in the regex). It also has to end with close paren, so we match a close paren at the end (point (7)
). Now we have to think what can be in-between the parens? Well, we can either have some text that is neither an open paren or closed paren (point (3)
) OR we can have another fully parenthesized expression! (point (5)
). And all this may be repeated either zero times (point (6)
) to match the smallest fully parenthesized expression ()
or more times to match a more complex expression.
Without the /x
flag (that allows multiline regexes), it can be written more compactly:
$regex = qr/\(([^()]+|(??{$regex}))*\)/;
But please don't use these regular expressions in production as they are too cryptic. Use Text::Balanced or Regexp::Common Perl modules.
And finally, in Perl 5.10 you can use recursive capture buffers instead of lazy code subexpressions to achieve the same result.
Here is a regular expression that matches 0n1n
and uses the recursive capture buffer syntax (?N)
:
my $rx = qr/(0(?1)*1)/
The (?1)*
says "match the first group zero or more times," where the first group is the whole regular expression.
You can try to rewrite the regular expression that matches balanced parens as an exercise.
Have fun!