Using Regular Expressions - Part 1 of 3 - Overview
Author: willem In: bash scripting, coding, javascript, php, python, toolsRegular expressions (also known as ‘regex’ or ‘regexp’) are short text strings for describing search patterns. Regex can be used to search for simple or extremely complex patterns through blocks of text, and most programming languages as well as several Unix command-line tools (such as grep, expr, sed, awk, and vi) and Mac OS X desktop applications (such as BBEdit, SubEthaEdit, and TextMate) can interpret and use them. As of version 3, the Bash shell also acquired its own regex-match operator that can be used in Bash Scripts.
A cheat sheet covering all the regular expression topics mentioned below can be downloaded here.
Introduction to Regular Expressions:
The most basic type of regular expression consists of a single literal character:
a
A search using this expression will match the first occurrence of the character ‘a‘ in a target string, so if the target string was:
This is a sentence
… the regular expression will match the ‘a‘ between ‘is ‘ and ‘ sentence‘.
This regular expression could match the second (and third, and fourth, etc.) occurrence of ‘a‘ too, but will only do so if you tell the regex engine to continue searching through the target string.
A more complex regular expression can contain optional and non-optional characters, repetition, etc., for example:
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z.]{2,4}\bThis regular expression describes a series uppercase letters, digits, dots, underscores, percentage signs, plusses and dashes, followed by an ‘@’ sign, followed by another series of uppercase letters, digits, dots, and dashes, followed by a single dot and between two to four uppercase letters or dots. In other words, this pattern describes an email address written in uppercase letters.
Using the above regular expression you could search through a file to find email addresses or verify if a given string looks like an email address.
Developers can use regular expressions to speed up searches. For example, this regular expression:
\bgeek(|[sy])\b
… will search through a target string for all instances of ‘geek‘, ‘geeks‘, and ‘geeky‘ in one search, whereas a plain-text search would have to search through the target string three times, once for each term.
Regular Expression Engines:
A regular expression engine (or regex engine) is a piece of software that can process regular expressions and match patterns to target strings. These engines are usually integrated into larger applications and will be invoked by the applications when you need them.
There are several different regular expression engines available, and they’re not fully compatible with each other. Examples include:
- Perl 5 (the most popular engine)
- PCRE (‘Perl-Compatible Regular Expressions’, open-source engine used by Apache, PHP, and other applications)
- ECMA (used by Javascript)
- POSIX BRE (‘Portable Operating System Interface for uniX’, ‘Basic Regular Expressions’, used by default by grep, oldest regex flavor in use today)
- POSIX ERE (‘Portable Operating System Interface for uniX’, ‘Extended Regular Expressions’, used by default by egrep, doesn’t have a lot of functionality)
A detailed comparison of regex engines can be found here.
Since the regular expression flavor used by Perl 5 is the most popular one available, this tutorial will focus on that engine except where mentioned otherwise.
Literal Characters
The most basic type of regular expression consists of a single literal character:
a
A search using this expression will match the first occurrence of the character ‘a‘ in a target string, so if the target string was:
This is a sentence
… the regular expression would match the ‘a‘ between ‘is ‘ and ‘ sentence‘.
Similarly, this regular expression:
geek
… will match the first occurrence of the character sequence ‘geek‘ in a target string.
These regular expressions could match the second (and third, and fourth, etc.) occurrence of their patterns too, but will only do so if you tell the regex engine to continue searching through the target string.
Regular expression engines are case sensitive by default, so if a file contained the following:
geek Geek GEEK This is a sentence containing the word 'geek'. This is a sentence containing the word 'Geek'. This is a sentence containing the word 'GEEK'.
… the pattern ‘geek‘ would match only the first line in the file (or the first and the fourth lines if the regex engine was told to continue searching through the target string).
Special Characters
Regex engines reserve certain characters (known as ‘metacharacters’) for special use:
- ‘[' (opening square bracket): used to define character sets
- ']‘ (closing square bracket): used to define character sets
- ‘\‘ (backslash): used to escape characters
- ‘^‘ (caret): marks the start of a string, or indicates negative functionality
- ‘$‘ (dollar sign): marks the end of a string
- ‘.‘ (dot): indicates any character except a line break
- ‘|‘ (pipe): used to separate alternating options
- ‘?‘ (question mark): makes the preceding character optional
- ‘*‘ (asterisk): matches the preceding character zero or more times
- ‘+‘ (plus): matches the preceding character one or more times
- ‘(‘ (opening round bracket): used to define groups
- ‘)‘ (closing round bracket): used to define groups
- ‘{‘ (opening curly brace): used to define repetition
- ‘}‘ (closing curly brace): used to define repetition
If you want to use any of these characters as a literal in a regular expression you need to escape them with a backslash. For example, if you wanted to match ‘$9.99‘ literally, you would need to use this regular expression:
\$9.99
Since backslashes used in combination with certain literal characters can create regex tokens with special meanings (e.g. ‘\d’ indicates a single digit from 0 to 9), non-metacharacters should not be escaped with backslashes.
The Perl and PCRE regex engines also support the ‘\Q…\E‘ escape sequence, where all characters between the two tokens will be interpreted as literal characters. For example, this regular expression:
\Q*\d+*\E
… will match the literal text ‘*\d+*‘. If the above was the entire regular expression, the ‘\E‘ could also be omitted:
\Q*\d+*
Non-Printable Characters
Certain tokens can be used to add non-printable characters to your regular expressions:
- ‘\t‘: a tab (ASCII 0×09)
- ‘\a‘: a bell (ASCII 0×07)
- ‘\e‘: a escape (ASCII 0×1B)
- ‘\f‘: a form feed (ASCII 0×0C)
- ‘\v‘: a vertical tab (ASCII 0×0B)
- ‘\r‘: a carriage return (ASCII 0×0D)
- ‘\n‘: a line feed (ASCII 0×0A) (used by Unix for line breaks)
- ‘\r\n‘: a carriage return and line feed (used by Windows for line breaks)
Any character can be included in a regular expression if you know its hexadecimal ASCII or ANSI code. For example, the code for the tab character in the Latin-1 character set is ‘0×09‘. To include it in a regular expression, you can use:
\x09
The code for a small letter ‘e‘ with an acute (‘é’) in the Latin-1 character set is ‘0xE9‘. To include it in a regular expression, you can use:
\xE9
The Perl and PCRE regex engines also support the ‘\cA‘ through ‘\cZ‘ tokens to insert ASCII control characters. The last letter in the token should be an uppercase letter from A to Z, and will indicate Control+A through to Control+Z.
Character Sets:
Character sets match only one out of several characters. For example, to match an ‘a‘ or ‘e‘, you would use:
[ae]
When used in a larger expression, this expression will match a single character out of the set. For example, the following regular expression:
gr[ae]y
… will match both ‘gray‘ and ‘grey‘. Character sets will never match multiple characters, so the above expression will never match ‘graay‘, ‘graey‘, or ‘greay‘.
Hyphens can be used inside character sets to specify a range of letters or digits. For example, this regular expression:
[0-9]
… will match a single digit between ‘0‘ and ‘9‘, while this regular expression:
[A-Z]
… will match a single uppercase letter between ‘A‘ and ‘Z‘. Single characters and ranges can be combined within character sets, for example:
[A-Za-z123]
… will match any uppercase or lowercase letter or the number ‘1‘, ‘2‘, or ‘3‘. The order of the characters and ranges within a character set does not matter.
To match the opposite of a character set, you need to add a caret after the opening square bracket. For example, this regular expression:
[^A-Z123]
… will match any character that is not an uppercase letter or the number ‘1‘, ‘2‘, or ‘3‘.
Unlike the dot (see below), negated character sets will also match invisible characters (such as line breaks).
Both normal and negated character sets don’t optionally match characters. For example, this regular expression:
c[^ha]
… means “find a ‘c’ followed by a character that is not a ‘h’ or an ‘a’“, not “find a ‘c’ not followed by a ‘h’ or an ‘a’“. In other words, when the above regular expression is applied to this target string:
ice drastic statically
… it will only match ‘ice‘ (because the ‘c’ is followed by a character other than a ‘h’ or an ‘a’) and not ‘drastic‘ (because the ‘c’ isn’t followed by a character) or ‘statically‘ (because the ‘c’ is followed by an ‘a’).
In character sets, the only metacharacters are the ‘]‘ (closing square bracket), ‘\‘ (backslash), ‘^‘ (caret) and ‘-‘ (dash). All other metacharacters function as literals within character sets and don’t need to be escaped with a backslash.
For example, to search for a ‘$‘ (dollar sign) or a ‘.‘ (dot), you would use the following regular expression:
[$.]
To include a ‘]‘, ‘\‘, ‘^‘, or ‘-‘ as a literal character within a character set you might need to escape it with a backslash, depending on where in the character set it’s used:
- ‘]‘: can be used without being escaped right after the opening square bracket or a caret indicating negation.
- ‘\‘: should always be escaped.
- ‘^‘: can be used without being escaped anywhere except right after the opening square bracket.
- ‘-‘: can be used without being escaped right after the opening square bracket or a caret indicating negation, or right before the closing square bracket.
The following examples are not properly escaped regular expression character sets:
[x]] [^x]] [\x] [x\]
… while the following examples are properly escaped regular expression character sets:
[]x] - (matches either ']' or 'x') [^]x] - (matches any character not ']' or 'x') [x\]] - (matches either 'x' or ']') [\\x] - (matches either '\' or 'x') [\^x] - (matches either '^' or 'x') [^^x] - (matches any character not '^' or 'x') [-x] - (matches either '-' or 'x') [^x-] - (matches any character not 'x' or '-')
Since character sets are used often, a series of shorthand tokens are available:
- ‘\d‘: any digit from ‘0′ to ‘9′.
- ‘\w‘: any digit from ‘0′ to ‘9′, any uppercase or lowercase letter from ‘a’ to ‘z’ or ‘A’ to ‘Z’, or an ‘_’ (underscore).
- ‘\s‘: any space, tab, or line break.
Shorthand character set tokens also have negated versions:
- ‘\D‘: any character not a digit from ‘0′ to ‘9′.
- ‘\W‘: any character not a digit from ‘0′ to ‘9′, an uppercase or lowercase letter from ‘a’ to ‘z’ or ‘A’ to ‘Z’, or an ‘_’ (underscore).
- ‘\S‘: any character not a space, tab, or line break.
Shorthand character set tokens can be used inside or outside of character sets:
- ‘\s\d‘: matches a space, tab, or line break followed by a digit from ‘0′ to ‘9′.
- ‘[\s\d]‘: matches either a space, tab, or linebreak, or a digit from ‘0′ to ‘9′.
If character sets are repeated by using the ‘?‘, ‘*‘, or ‘+‘ operators, the entire character set will be repeated, not just the character that it matched. In other words, this regular expression:
[0-9]+
… will match ‘123‘ as well as ‘111‘. To repeatedly match the matched character, you need to use backreferences. For example, this regular expression:
([0-9])\1+
… will match ‘111‘ but not ‘123‘.
The Dot:
The dot (‘.’) matches any single character except line breaks. For example, the following regular expression:
ge.k
… will match ‘geek‘, ‘gerk‘, ‘ge$k‘, etc.
Because the dot can match any character except a line break, it should be used sparingly. For example, if you wanted to match all tags in a block of HTML code using this regular expression (the asterisk means that the preceding character can repeat zero to infinite times):
<.*>
… on this target string:
<p>This is a paragraph.</p>
… the result will be “<p>This is a paragraph.</p>“, not “<p>” and “</p>“. In cases like this the dot should be replaced with a character set:
<[^<>\r\n]*>
Anchors
In regular expressions, anchors match positions, not characters:
- ‘^‘ (caret): matches at the start of a string.
- ‘$‘ (dollar sign): matches at the end of a string.
- ‘\b‘: matches at a word boundary (a position between a character than can be matched by ‘\w’ and a character that cannot be matched by ‘\w’, also matches at start and / or end of a string if the first and / or last characters of the string are word characters).
- ‘\B‘: matches at any position where ‘\b’ will not match.
For example, the following regular expression:
^a
… applied to this target string:
abc
… will match ‘a‘ because ‘a‘ is at the start of the string. When applied to the same target string, the following regular expression:
^b
… will not match ‘b‘ because ‘b‘ is not at the start of the string.
Similarly, the regular expression ‘c$‘ will match ‘c‘ because ‘c‘ is at the end of the string, but the regular expression ‘b$‘ will not match ‘b‘ because ‘b‘ is not at the end of the string.
Alternation
Alternation is the regular expression equivalent of ‘or‘. The following regular expression:
(geeky|blog)
… applied to this target string:
This is a geeky blog.
… will match both ‘geeky‘ and ‘blog‘.
Repetition:
Regular expressions can use one of three repetition tokens:
- ‘?‘ (question mark): makes the preceding character optional.
- ‘*‘ (asterisk): matches the preceding character zero or more times.
- ‘+‘ (plus): matches the preceding character one or more times.
The ‘?‘ (question mark) token can also be used to make several characters optional by grouping them with round brackets and placing the question mark after the closing bracket. For example, the following regular expression:
10(th)? Str(eet)?
… will match ‘10th Street‘, ‘10th Str‘, ‘10 Str‘, and ‘10 Street‘.
Additionally, curly braces can be used to specify a specific amount of repetition. For example, the following regular expression:
\b[0-9]{2}\b… will match any number of two characters long, while the following regular expression:
\b[0-9]{2,4}\b… will match any number between two and four characters long.
Backreferences:
Backreferences can be used in matches to rematch a value. For example, if you wanted to match a pair of opening and closing HTML tags and the text inbetween, you would use this regular expression:
<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>
… where “\1” indicates that the value matched by the first set of round brackets should be repeated.
An infinite amount of backreferences can be used in a regular expression, so a regex with three round-bracketed expressions could use “\1“, “\2“, and “\3“.
Regex Matching Modes
Most regular expression engines support the following four matching modes:
- ‘/i‘: makes the regex match case insensitive.
- ‘/s‘: enables ’single-line’ mode, makes the ‘.’ (dot) match newlines.
- ‘/m‘: enables ‘multi-line’ mode, makes the ‘^’ (caret) and ‘$’ (dollar) match before and after newlines.
- ‘/x‘: enables ‘free-spacing’ mode, causes whitespace between regex tokens to be ignored, and allows an unescaped ‘#’ (hash) to start a comment.
Most programming languages (e.g. Perl) allow you to specify option flags at the end of regular expressions. For example, in this regular expression:
m/abc/i
… the ‘/i‘ turns on case insensitivity, causing the regex to match both ‘abc‘ and ‘ABC‘.
In other languages, modes can be enabled by adding a mode modifier to the start of a regular expression. For example, in this regular expression:
(?i)abc
… the ‘(?i)‘ turns on case insensitivity, causing the regex to match both ‘abc‘ and ‘ABC‘.
Modes can also be enabled for only parts of regular expressions by using a ‘-‘ (dash). For example, in this regular expression:
(?i)ab(?-i)c
… case insensitivity will only be turned on for the first two characters (‘a’ and ‘b’), causing only ‘abc‘, ‘Abc‘, ‘aBc‘, and ‘ABc‘ to be matched, not ‘abC‘, ‘AbC‘, etc.
Instead of using two modifiers to turn an option on and off, you can use a modifier span. For example, this regular expression:
(?i)ab(?-i)cd(?i)ef
… could be shortened to:
(?i)ab(?-i:cd)ef
Capturing Groups vs. Atomic Groups
Capturing groups and atomic groups are two methods of processing groups of alternating values in regular expressions.
An example of a capturing group in a regular expression is:
1(23|2)3
This regular expression will match both ‘1233‘ and ‘123‘.
An example of an atomic group in a regular expression is:
1(?>23|2)3
This regular expression will match ‘1233‘ but not ‘123‘.
When applied to the target string ‘123‘ (where position #1 = ‘1′, position #2 = ‘2′, and position #3 = ‘3′), both of the above regular expressions will match ‘1‘ to ‘1‘ (position #1), ‘23‘ to ‘23‘ (position #2 and position #3), and then ‘3‘ will fail to match. At this point the difference between capturing and atomic groups come into play:
The capturing group remembered a backtracking position for the alternation (where in the target string the alternation was matched), so the group will give up its match, try to match the next alternation option (‘2′) to ‘2‘ (position #2), then match ‘3‘ to ‘3‘ (position #3), and return a positive result.
The atomic group discarded its backtracking position for the alternation (where in the target string the alternation was matched) after it found a match from the alternation (in this case, ‘23′ matched against ‘23′ in position #2 and position #3). Because the backtracking position was discarded, the second alternation option (‘2′) won’t be tried.
Atomic grouping can be used to optimize regular expressions. For example, when this regular expression:
\b(here|her|he)\b
… is applied to this target string:
hereby
… the regex engine will match ‘\b‘ at the start of the target string, then match the first alternation option (‘here’) to the target string, but then fail to match ‘\b‘ after the first alternation option match. Using backtracking, the regex engine will then match the second alternation option (‘her’) in the target string, but fail to match ‘\b‘ after the second alternation option match. Finally, using backtracking the regex engine will then match the third alternation option (‘he’) in the target string, but fail to match ‘\b‘ after the third alternation option match and return a negative result for the entire expression.
All this processing can be optimized by using an atomic grouping to indicate that if the regex engine can’t match ‘\b‘ after matching the first alternation option (‘here’), it shouldn’t bother trying the rest of the options (to make this work properly, the alternation options were ordered from greatest to smallest length - hence, if the option with the greatest length couldn’t match, the rest won’t either). When this regular expression:
\b(?>here|her|he)\b
… is applied to this target string:
hereby
… the regex engine will match ‘\b‘ at the start of the target string, then match the first alternation option (‘here’) to the target string. Because an alternation option was matched and this is an atomic group, backtracking will be discarded. The regex engine will fail the match the ‘\b‘ after the first alternation option match, and because backtracking was discarded, a negative result will be returned for the entire expression.
Lookahead and Lookbehind
The following regular expression using a character set:
123[^4]
… will match instances of ‘123‘ followed by any character other than ‘4‘, and will include the additional character in the returned match.
The following regular expression using a character set:
123[4]
… will match instances of ‘123‘ followed by ‘4‘, and will include the additional character in the returned match.
To perform matches like the above but exclude the additional character, you need to use positive or negative lookahead.
The following regular expression using negative lookahead:
123(?!4)
… will match instances of ‘123‘ not followed by ‘4‘, but will only return ‘123‘ as the match.
The following regular expression using positive lookahead:
123(?=4)
… will match instances of ‘123‘ followed by ‘4‘, but will only return ‘123‘ as the match.
Lookbehind matches work the same as lookahead matches, with the exception that the required syntax is added before the match characters:
(?<!4)123
… or:
(?<=4)123
Comments:
Since regular expressions can become complex very quickly, comments are extremely useful.
The syntax for a comment in a regular expression is:
(?#abc)
… where ‘abc‘ is your comment and can contain any characters except a ‘)‘ (closing round bracket).
The following is an example of a regular expression that contains a comment:
\b[0-9]{2,4}(?#any number from 00 to 9999)\bRelated posts:
- Using Regular Expressions - Part 2 of 3 - Regex in PHP
- Using Regular Expressions - Part 3 of 3 - Examples
- Checking your internal and external IP Addresses on a Unix machine
- A Python script to check Google rankings for a specific domain and search term
- Extract and sort email addresses from text files with Automator
Like this post? Subscribe to the Geekology RSS 2.0 feed!













Tab Hockamier
October 12th, 2009 at 22:55
excellent post - nicely done.
Willem
October 13th, 2009 at 00:57
Thanks, Tab!
Anish Sana (Anish_Sana) « Using Regular Expressions - Part 1 of 3 - Overview | Geekology « Chat Catcher
October 13th, 2009 at 18:57
[...] 2009-10-13T09:57:51 Using Regular Expressions - Part 1 of 3 - Overview [link to post] [...]