Using Regular Expressions – Part 2 of 4 – Regex in PHP

php development

General

PHP has three sets of functions that allow you to work with regular expressions:

Functions that start with “ereg”

POSIX Extended Regular Expressions, mainly supported for backward compatibility with PHP 3 and officially deprecated as of PHP 5.3.0.

Functions that start with “mb_ereg”

POSIX Extended Regular Expressions that can work with multibyte characters. The “mb_ereg” functions are available in PHP 4.2.0 and later, but since they use the older POSIX regular expression flavor, they should also be considered deprecated.

Functions that start with “preg”

PHP wrappers for the PCRE (Perl-Compatible Regular Expressions) Library. PCRE offers significant advantages and advances over POSIX Extended Regular Expressions, so the “preg” functions should be used in all PHP code that uses regular expressions (the “/u” modifier should be added where multibyte functionality is needed). PHP includes PCRE by default as of PHP 4.2.0.

All of PHP’s “preg” functions require that you specify regular expressions as strings in the Perl syntax: initialized and terminated with the same character. For example:

$matchCount = preg_match('/geekology/', $haystack);

… or:

$matchCount = preg_match('%geekology%', $haystack);

The characters used to initialize and terminate a regular expression string should always be escaped with a backslash within the rest of the string:

$matchCount = preg_match('/http:\/\/www\.geekology\.co\.za\/blog\//', $haystack);

The PHP “preg” Functions

preg_filter():

mixed preg_filter ( mixed $pattern , mixed $replacement , mixed $subject [, int $limit = -1 [, int &$count ]] )

Perform a regular expression search and replace.

Returns an array if the $subject parameters is an array, or a string otherwise. If matches are found, the new $subject will be returned, otherwise $subject will be returned unchanged or NULL if an error occured.

  • pattern: The pattern to search for, as a string.
  • replacement: The string or an array with strings to replace. If this parameter is a string and the $pattern parameter is an array, all patterns will be replaced by that string. If both $pattern and $replacement parameters are arrays, each $pattern will be replaced by the $replacement counterpart. If there are fewer elements in the $replacement array than in the $pattern array, any extra $pattern will be replaced by an empty string.
  • subject: The input string.
  • limit: The maximum possible replacements for each pattern in each $subject string. Defaults to -1 (no limit).
  • count: If specified, this variable will be filled with the number of replacements done.

preg_grep():

array preg_grep ( string $pattern , array $input [, int $flags = 0 ] )

Returns the array consisting of the elements of the $input array that match the given $pattern.

Returns an array indexed using the keys from the $input array.

  • pattern: The pattern to search for, as a string.
  • input: The input array.
  • flags: If set to PREG_GREP_INVERT, this function returns the elements of the input array that do not match the given $pattern.

preg_last_error():

int preg_last_error ( void )

Returns the error code of the last PCRE regex execution.

Returns one of the following constants:

  • PREG_NO_ERROR
  • PREG_INTERNAL_ERROR
  • PREG_BACKTRACK_LIMIT_ERROR
  • PREG_RECURSION_LIMIT_ERROR
  • PREG_BAD_UTF8_ERROR
  • PREG_BAD_UTF8_OFFSET_ERROR

preg_match_all():

int preg_match_all ( string $pattern , string $subject , array &$matches [, int $flags [, int $offset ]] )

Searches $subject for all matches to the regular expression given in $pattern and puts them in $matches in the order specified by $flags. After the first match is found, the subsequent searches are continued on from the end of the last match.

Returns the number of full pattern matches (which might be zero), or FALSE if an error occured.

  • pattern: The pattern to search for, as a string.
  • subject: The input string.
  • matches: Array of all matches in multidimensional array ordered according to flags.
  • flags: Can be a combination of the flags defined below (note that it doesn’t make sense to use PREG_PATTERN_ORDER together with PREG_SET_ORDER).
  • offset: Normally, the search starts from the beginning of the subject string. The optional parameter offset can be used to specify the alternate place from which to start the search (in bytes).

PREG_PATTERN_ORDER: Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on.

PREG_SET_ORDER: Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on.

PREG_OFFSET_CAPTURE: If this flag is passed, for every occurring match the appendant string offset will also be returned. Note that this changes the value of $matches in an array where very element is an array consisting of the matched string at offset 0 and its string offset into $subject at offset 1.

preg_match():

int preg_match ( string $pattern , string $subject [, array &$matches [, int $flags [, int $offset ]]] )

Searches $subject for a match to the regular expression given in $pattern.

Returns the number of times $pattern matches. That will be either 0 times (no match) or 1 time because preg_match() will stop searching after the first match. preg_match_all() on the contrary will continue until it reaches the end of $subject. preg_match() returns FALSE if an error occurred.

  • pattern: The pattern to search for, as a string.
  • subject: The input string.
  • matches: Array of all matches in multidimensional array ordered according to flags.
  • flags: Can be PREG_OFFSET_CAPTURE – If this flag is passed, for every occurring match the appendant string offset will also be returned. Note that this changes the return value in an array where every element is an array consisting of the matched string at index 0 and its string offset into $subject at index 1.
  • offset: Normally, the search starts from the beginning of the subject string. The optional parameter offset can be used to specify the alternate place from which to start the search (in bytes).

preg_quote():

string preg_quote ( string $str [, string $delimiter = NULL ] )

preg_quote() takes $str and puts a backslash in front of every character that is part of the regular expression syntax. This is useful if you have a run-time string that you need to match in some text and the string may contain special regex characters. The special regular expression characters are: . \ + * ? [ ^ ] $ ( ) { } = ! < > | : –

Returns the quoted string.

  • str: The input string.
  • delimiter: If the optional $delimiter is specified, it will also be escaped. This is useful for escaping the delimiter that is required by the PCRE functions. The / is the most commonly used delimiter.

preg_replace_callback():

mixed preg_replace_callback ( mixed $pattern , callback $callback , mixed $subject [, int $limit = -1 [, int &$count ]] )

The behavior of this function is almost identical to preg_replace(), except for the fact that instead of a $replacement parameter, one should specify a $callback.

preg_replace_callback() returns an array if the $subject parameter is an array, or a string otherwise. On errors the return value is NULL. If matches are found, the new subject will be returned, otherwise $subject will be returned unchanged.

  • pattern: The pattern to search for, as a string or an array with strings.
  • callback: A callback that will be called and passed an array of matched elements in the $subject string. The callback should return the replacement string.
  • subject: The string or an array with strings to search and replace.
  • limit: The maximum possible replacements for each pattern in each $subject string. Defaults to -1 (no limit).
  • count: If specified, this variable will be filled with the number of replacements done.

preg_replace():

mixed preg_replace ( mixed $pattern , mixed $replacement , mixed $subject [, int $limit = -1 [, int &$count ]] )

Searches $subject for matches to $pattern and replaces them with $replacement.

preg_replace() returns an array if the $subject parameter is an array, or a string otherwise. If matches are found, the new $subject will be returned, otherwise $subject will be returned unchanged or NULL if an error occurred.

  • pattern: The pattern to search for, as a string or an array with strings. The “e” modifier makes preg_replace() treat the $replacement parameter as PHP code after the appropriate references substitution is done.
  • replacement: The string or an array with strings to replace. If this parameter is a string and the $pattern parameter is an array, all patterns will be replaced by that string. If both $pattern and $replacement parameters are arrays, each $pattern will be replaced by the $replacement counterpart. If there are fewer elements in the $replacement array than in the $pattern array, any extra $pattern will be replaced by an empty string.
  • subject: The string or an array with strings to search and replace. If $subject is an array, then the search and replace is performed on every entry of $subject, and the return value is an array as well.
  • limit: The maximum possible replacements for each pattern in each $subject string. Defaults to -1 (no limit).
  • count: If specified, this variable will be filled with the number of replacements done.

preg_split():

array preg_split ( string $pattern , string $subject [, int $limit = -1 [, int $flags = 0 ]] )

Split the given string by a regular expression.

Returns an array containing substrings of $subject split along boundaries matched by $pattern.

  • pattern: The pattern to search for, as a string.
  • subject: The input string.
  • limit: If specified, then only substrings up to $limit are returned with the rest of the string being placed in the last substring. A $limit of “-1″, “0″, or NULL means “no limit” and, as is standard across PHP, you can use NULL to skip to the $flags parameter.
  • flags: Can be a combination of the flags defined below (combined with the bitwise “|” operator).

PREG_SPLIT_NO_EMPTY: If this flag is set, only non-empty pieces will be returned by preg_split().

PREG_SPLIT_DELIM_CAPTURE: If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.

PREG_SPLIT_OFFSET_CAPTURE: If this flag is set, for every occurring match the appendant string offset will also be returned. Note that this changes the return value in an array where every element is an array consisting of the matched string at offset 0 and its string offset into $subject at offset 1.

“preg” Modifiers

Regex modifiers are specified in the same way as they are in Perl. For example, “/geekology/i” applies case insensitivity, while “/geekology/u” enables multibyte (Unicode) matching instead of 8-bitmatching.

i (PCRE_CASELESS):

If this modifier is set, letters in the pattern match both upper and lower case letters.

m (PCRE_MULTILINE):

By default, PCRE treats the subject string as consisting of a single “line” of characters (even if it actually contains several newlines). The “start of line” metacharacter (^) matches only at the start of the string, while the “end of line” metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl. When this modifier is set, the “start of line” and “end of line” constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl’s /m modifier. If there are no “\n” characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.

s (PCRE_DOTALL):

If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl’s “/s” modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

x (PCRE_EXTENDED):

If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped “#” outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl’s “/x” modifier, and makes it possible to include comments inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence “(?(” which introduces a conditional subpattern.

e (PREG_REPLACE_EVAL):

If this modifier is set, preg_replace() does normal substitution of backreferences in the replacement string, evaluates it as PHP code, and uses the result for replacing the search string. Single quotes, double quotes, backslashes (\) and NULL chars will be escaped by backslashes in substituted backreferences.

Only preg_replace() uses this modifier; it is ignored by other PCRE functions.

A (PCRE_ANCHORED):

If this modifier is set, the pattern is forced to be “anchored”, that is, it is constrained to match only at the start of the string which is being searched (the “subject string”). This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl.

D (PCRE_DOLLAR_ENDONLY):

If this modifier is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl.

S:

When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character. When the “S” modifier is set, PHP calls the pcre_study() function from the PCRE API before executing the regexp. The result from the function will be passed directly to pcre_exec().

U (PCRE_UNGREEDY):

This modifier inverts the “greediness” of the quantifiers so that they are not greedy by default, but become greedy if followed by “?”. It is not compatible with Perl. It can also be set by a “(?U)” modifier setting within the pattern or by a question mark behind a quantifier (e.g. .”*?”).

X (PCRE_EXTRA):

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Any backslash in a pattern that is followed by a letter that has no special meaning causes an error, thus reserving these combinations for future expansion. By default, as in Perl, a backslash followed by a letter with no special meaning is treated as a literal. There are at present no other features controlled by this modifier.

J (PCRE_INFO_JCHANGED):

The “(?J)” internal option setting changes the local PCRE_DUPNAMES option. Allow duplicate names for subpatterns.

u (PCRE8):

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Mike The Situation

Mike The Situation

Geek Master at Geekology
Master of programming and general geekness.
Mike The Situation
About Mike The Situation
Master of programming and general geekness.

No Comments, Be The First!

Your email address will not be published.