This is the seventh part of a nine-part article on Perl one-liners. Perl is not Perl without regular expressions, therefore in this part I will come up with and explain various Perl regular expressions. Please see part one for the introduction of the series.
Perl one-liners is my attempt to create "perl1line.txt" that is similar to "awk1line.txt" and "sed1line.txt" that have been so popular among Awk and Sed programmers, and Unix sysadmins. I will release the perl1line.txt in the next part of the series.
The article on Perl one-liners consists of nine parts:
- Part I: File spacing.
- Part II: Line numbering.
- Part III: Calculations.
- Part IV: String creation and array creation.
- Part V: Text conversion and substitution.
- Part VI: Selective printing and deleting of certain lines.
- Part VII: Handy regular expressions (this part).
- Part VIII: Release of perl1line.txt.
- Part IX: Release of Perl One-Liners e-book.
After I am done with the next part of the article, I will release the whole article series as a pdf e-book! Please subscribe to my blog to be the first to get it. You can also follow me on Twitter.
Awesome news: I have written an e-book based on this article series. Check it out:
And here are today's one-liners:
109. Match something that looks like an IP address.
/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/
This regex doesn't guarantee that the thing that got matched is in fact a valid IP. All it does is match something that looks like an IP. It matches a number followed by a dot four times. For example, it matches a valid IP 81.198.240.140
and it also matches an invalid IP such as 923.844.1.999
.
Here is how it works. The ^
at the beginning of regex is an anchor that matches the beginning of string. Next \d{1,3}
matches one, two or three consecutive digits. The .
matches a dot. The $
at the end is an anchor that matches the end of the string. It's important to use both ^
and $
anchors, otherwise strings like foo213.3.1.2bar
would also match.
This regex can be simplified by grouping the first three repeated \d{1,3}.
expressions:
/^(\d{1,3}\.){3}\d{1,3}$/
110. Test if a number is in range 0-255.
/^([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$/
Here is how it works. A number can either be one digit, two digit or three digit. If it's a one digit number then we allow it to be anything [0-9]
. If it's two digit, we also allow it to be any combination of [0-9][0-9]
. However if it's a three digit number, it has to be either one hundred-something or two-hundred something. If it'e one hundred-something, then 1[0-9][0-9]
matches it. If it's two hundred-something then it's either something up to 249, which is matched by 2[0-4][0-9]
or it's 250-255, which is matched by 25[0-5]
.
111. Match an IP address.
my $ip_part = qr|([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])|; if ($ip =~ /^($ip_part\.){3}$ip_part$/) { say "valid ip"; }
This regexp combines the previous two. It uses the my $ip_part = qr/.../
operator compiles the regular expression and puts it in $ip_part
variable. Then the $ip_part
is used to match the four parts of the IP address.
112. Check if the string looks like an email address.
/.+@.+\..+/
This regex makes sure that the string looks like an email address. Notice that I say "looks like". It doesn't guarantee it is an email address. Here is how it works - first it matches something up to the code>@</code symbol, then it matches as much as possible until it finds a dot, and then it matches some more. If this succeeds, then it it's something that at least looks like email address with the code>@</code symbol and a dot in it.
For example, code>cats@catonmat.net</code matches but code>cats@catonmat</code doesn't because the regex can't match the dot .
that is necessary.
Much more robust way to check if a string is a valid email would be to use Email::Valid module:
use Email::Valid; print (Email::Valid->address('john@example.com') ? 'valid email' : 'invalid email');
113. Check if the string is a decimal number.
Checking if the string is a number is really difficult. I based my regex and explanation on the one in Perl Cookbook.
Perl offers \d
that matches digits 0-9. So we can start with:
/^\d+$/
This regex matches one or more digits \d
starting at the beginning of the string ^
and ending at the end of the string $
. However this doesn't match numbers such as +3
and -3
. Let's modify the regex to match them:
/^[+-]?\d+$/
Here the [+-]?
means match an optional plus or a minus before the digits. This now matches +3
and -3
but it doesn't match -0.3
. Let's add that:
/^[+-]?\d+\.?\d*$/
Now we have expanded the previous regex by adding .?\d*
, which matches an optional dot followed by zero or more numbers. Now we're in business and this regex also matches numbers like -0.3
and 0.3
.
Much better way to match a decimal number is to use Regexp::Common module that offers various useful regexes. For example, to match an integer you can use $RE{num}{int}
from Regexp::Common.
How about positive hexadecimal numbers? Here is how:
/^0x[0-9a-f]+$/i
This matches the hex prefix 0x
followed by hex number itself. The /i
flag at the end makes sure that the match is case insensitive. For example, 0x5af
matches, 0X5Fa
matches but 97
doesn't, cause it's just a decimal number.
It's better to use $RE{num}{hex}
because it supports negative numbers, decimal places and number grouping.
Now how about octal? Here is how:
/^0[0-7]+$/
Octal numbers are prefixed by 0
, which is followed by octal digits 0-7
. For example, 013
matches but 09
doesn't, cause it's not a valid octal number.
It's better to use $RE{num}{oct}
because of the same reasons as above.
Finally binary:
/^[01]+$/
Binary base consists of just 0
s and 1
s. For example, 010101
matches but 210101
doesn't, because 2
is not a valid binary digit.
It's better to use $RE{num}{bin}
because of the same reasons as above.
114. Check if a word appears twice in the string.
/(word).*\1/
This regex matches word
followed by something or nothing at all, followed by the same word. Here the (word)
captures the word in group 1 and \1
refers to contents of group 1, therefore it's almost the same as writing /(word).*word/
For example, silly things are silly
matches /(silly).*\1/
, but silly things are boring
doesn't, because silly is not repeated in the string.
115. Increase all numbers by one in the string.
$str =~ s/(\d+)/$1+1/ge
Here we use the substitution operator s///
. It matches all integers (\d+)
, puts them in capture group 1, then it replaces them with their value incremented by one $1+1
. The g
flag makes sure it finds all the numbers in the string, and the e
flag evaluates $1+1
as a Perl expression.
For example, this 1234 is awesome 444
gets turned into this 1235 is awesome 445
.
116. Extract HTTP User-Agent string from the HTTP headers.
/^User-Agent: (.+)$/
HTTP headers are formatted as Key: Value
pairs. It's very easy to parse such strings, you just instruct the regex engine to save the Value
part in $1
group variable.
For example, if the HTTP headers contain,
Host: localhost:8000 Connection: keep-alive User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
Then the regular expression will extract the Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US)
string.
117. Match printable ASCII characters.
/[ -~]/
This is really tricky and smart. To understand it, take a look at man ascii
. You'll see that space starts at value 0x20 and the ~
character is 0x7e. All the characters between a space and ~
are printable. This regular expression matches exactly that. The [ -~]
defines a range of characters from space till ~
. This is my favorite regexp of all time.
You can invert the match by placing ^
as the first character in the group:
/[^ -~]/
This matches the opposite of [ -~]
.
118. Match text between two HTML tags.
m|<strong>([^<]*)</strong>|
This regex matches everything between <strong>...</strong>
HTML tags. The trick here is the ([^<]*)
, which matches as much as possible until it finds a <
character, which starts the next tag.
Alternatively you can write:
m|<strong>(.*?)</strong>|
But this is a little different. For example, if the HTML is <strong><em>hello</em></strong>
then the first regex doesn't match anything because the <
follows <strong>
and ([^<])
matches as little as possible. The second regex matches <em>hello</em>
because the (.?)</strong>
matches as little as possible until it finds </strong>
, which happens to be <em>hello</em>
.
However don't use regular expressions for matching and parsing HTML. Use modules like HTML::TreeBuilder to accomplish the task cleaner.
119. Replace all <b> tags with <strong>
$html =~ s|<(/)?b>|<$1strong>|g
Here I assume that the HTML is in variable $html
. Next the <(/)?b>
matches the opening and closing <b>
tags, captures the optional closing tag slash in group $1
and then replaces the matched tag with either <strong>
or </strong>
, depending on if it was an opening or closing tag.
120. Extract all matches from a regular expression.
my @matches = $text =~ /regex/g;
Here the regular expression gets evaluated in the list context that makes it return all the matches. The matches get put in the code>@matches</code variable.
For example, the following regex extracts all numbers from a string:
my $t = "10 hello 25 moo 31 foo"; my @nums = $text =~ /\d+/g;
code>@nums</code now contains (10, 25, 30)
.
Perl one-liners explained e-book
I've now written the "Perl One-Liners Explained" e-book based on this article series. I went through all the one-liners, improved explanations, fixed mistakes and typos, added a bunch of new one-liners, added an introduction to Perl one-liners and a new chapter on Perl's special variables. Please take a look:
Have Fun!
Thanks for reading the article! In the next part I am releasing the perl1line.txt
that will contain all the one-liners in a single file.