Website development and design blog, tutorials and inspiration

Regular Expressions Explained

All about Regular Expressions made easy

By , 18th October 2005 in Software Engineering

I love it when people say "a simple way to do xyz is to use regular expressions" and then offer what amounts to a string of indecipherable hieroglyphics to answer the question. However, once you know how to leverage the power of regular expressions, they can be very useful tools.

Firstly, let's talk about what exactly a regular expression is. In a nutshell, a regular expression is a string of characters that form a pattern. This pattern can then be used to match a part, or parts, of another string. There are usually start and end characters to indicate where the pattern starts and stops, and a seemly random bunch of characters in between. These random characters are in fact representations of different smaller patterns to match, for example, letters, numbers, punctuation or whitespace.

Regular expressions are a very fast and efficient method for string manipulation and can save tens of lines of code for complex operations, for example, an email address can be validated in just 4 lines of code, and that's splitting the lines up. You could do it in one line!

One thing that is very frustrating is that while regular expressions are fairly generic, each application "engine" has its own implementation so it is rarely a simple case of copy and paste and it'll work. Examples of different applications engines using regular expressions are Perl, PHP, .NET, Java, JavaScript, Python, Ruby and POSIX.

The term regular expression is often shortened to just regex, it's easier to say and type so I'm going to use that from now on.

Let's start with a basic example and see how a regex is constructed. In these examples, I'm going to be using the PHP engine. I also have a tutorial on using Regular Expressions in C# / ASP.Net which you may find useful as well.

In the PHP built-in regular expression functions, each regex starts and ends with a forward slash, with the pattern in between. In this example, I'm going to use "The quick brown fox jumps over the lazy dog" as a string to search in, and I want to find the word "the".

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/the/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);

The result is:

array (size=1)
  0 => 
    array (size=1)
      0 => string 'the' (length=3)

Just the one instance of "the" contained in our input string. But hold on I hear you cry, there are TWO the's in the string! And you are correct, there are, but one has a capital T, the other does not. If we change the pattern to "/The/" then we match the other one. How do we get both? Easy, we can either make the whole pattern case insensitive (will match upper and lower case equally) or we can just match an upper or lower T.

Making regex case insensitive
To make our initial pattern case insensitive, it is a just a matter of placing an i after the closing forward slash like so:

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/the/i"; // Note the position of i
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);

The result is now:

array (size=1)
  0 => 
    array (size=2)
      0 => string 'The' (length=3)
      1 => string 'the' (length=3)

Matching combinations of characters
Regex can also match combinations of characters. Let's start simple and match an upper case or lower case T. A square bracket is used to contain a combination of characters, numbers or symbols. Inside the square bracket, you can insert the characters to match. In this example, we want to find The and the, so we need to change the pattern to look for T and t.

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/[Tt]he/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);

This will now match an uppercase T or a lower case t followed by 'he'.

array (size=1)
  0 => 
    array (size=2)
      0 => string 'The' (length=3)
      1 => string 'the' (length=3)

This method should only be used if the first method cannot be used.

Matching strings starting with
You can also use a special character to indicate "at the start" and this character is a caret (^). Changing the pattern from the example, the addition of the caret will find a match only if the pattern is at the start of the string.

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/^The/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=1)
  0 => 
    array (size=1)
      0 => string 'The' (length=3)

Searching for a lower case t at the start of the string does not find any matches.

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/^the/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=1)
  0 => 
    array (size=0)
      empty

Matching strings ending with
As with matching patterns at the start, there is another special character to match a pattern at the end. This is the dollar symbol and it goes right before the end forward slash.

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/the$/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=1)
  0 => 
    array (size=0)
      empty

The word 'the' is not matched at the end of the string, so there are no matches. Dog is however at the end of the string so we can match that.

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/dog$/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=1)
  0 => 
    array (size=1)
      0 => string 'dog' (length=3)

Matching strings starting and ending with (exact match)
We can combine both start and end to form an exact pattern match.

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/^The quick brown fox jumps over the lazy dog$/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=1)
  0 => 
    array (size=1)
      0 => string 'The quick brown fox jumps over the lazy dog' (length=43)

Although this is not a particularly good example, you may as well just do a string compare, but it does lead nicely onto...

Matching strings starting and ending with (wildcard match)

What if you wanted to find a match between two known words. In this example, let's find the fox. In the pattern, we can replace the word fox with a wildcard pattern that will match anything in its place.

A wildcard matches anything and takes the form of '(.*)'

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/^The quick brown (.*) jumps over the lazy dog$/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=2)
  0 => 
    array (size=1)
      0 => string 'The quick brown fox jumps over the lazy dog' (length=43)
  1 => 
    array (size=1)
      0 => string 'fox' (length=3)

Multiple matches
We can also match multiple animals in this example, just by adding another wild card.

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/^The quick brown (.*) jumps over the lazy (.*)$/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=3)
  0 => 
    array (size=1)
      0 => string 'The quick brown fox jumps over the lazy dog' (length=43)
  1 => 
    array (size=1)
      0 => string 'fox' (length=3)
  2 => 
    array (size=1)
      0 => string 'dog' (length=3)

Being specific
Wildcards are all well and good, but they will match anything and everything, so it is best to be as specific as possible especially when dealing with user input.

  1. $string = 'The quick brown DELETE ALL MY DATA jumps over the lazy dog';
  2. $pattern = "/^The quick brown (.*) jumps over the lazy (.*)$/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=3)
  0 => 
    array (size=1)
      0 => string 'The quick brown fox jumps over the lazy dog' (length=43)
  1 => 
    array (size=1)
      0 => string 'DELETE ALL MY DATA' (length=3)
  2 => 
    array (size=1)
      0 => string 'dog' (length=3)

By swapping the wildcard for a more accurate range of values or limited characters, you can be sure of a better and safer match. If the value you are expecting is going to be a single word then you can match the alphabet range a-z and if upper case characters could be involved you can also match A-Z as well. Like matching individual upper case and lower case letters, ranges go in square brackets, which are within parenthesis "(" and ")" along with curly brackets. Now were starting to get Egyptian hieroglyphs!

OK, the pattern for matching any letter, uppercase or lowercase of one character or more is this:

([A-Za-z]{1,})

Let's break it down. The parenthesis indicates "match anything in here". The square brackets match a single character matching the pattern inside. A-Z is the alphabetical range in upper case; likewise, a-z is lowercase. Finally, the curly bracket indicates that there should be one or more of the preceding pattern, which in this case is our square bracket. The curly brackets take two numbers, the minimum length of the string to match and the maximum. If the maximum is omitted as in our example, then there is no limit. For example, to match a word between 3 and 5 characters long the rule would be {3,5} and to match a word 6 characters or longer it would be {6,}.

Now our pattern looks like this:

  1. $pattern = "/^The quick brown ([A-Za-z]{1,}) jumps over the lazy ([A-Za-z]{1,})$/";

And the results are much safer, and only contain expected results. If an input string contains invalid, or non-matched patterns then it will return empty.

  1. $string = 'The quick brown INSERT MALICIOUS CODE HERE jumps over the lazy dog';
  2. $pattern = "/^The quick brown (.*) jumps over the lazy ([A-Za-z]{1,})$/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=3)
  0 => 
    array (size=0)
      empty
  1 => 
    array (size=0)
      empty
  2 => 
    array (size=0)
      empty

But for valid input:

  1. $string = 'The quick brown fox jumps over the lazy dog';
  2. $pattern = "/^The quick brown ([A-Za-z]{1,}) jumps over the lazy ([A-Za-z]{1,})$/";
  3. preg_match_all($pattern, $string, $matches);
  4. var_dump($matches);
array (size=3)
  0 => 
    array (size=1)
      0 => string 'The quick brown fox jumps over the lazy dog' (length=43)
  1 => 
    array (size=1)
      0 => string 'fox' (length=3)
  2 => 
    array (size=1)
      0 => string 'dog' (length=3)

Regex Quick Reference

Here is a quick reference guide and example regexes for PHP.

PatternMeaning
[abc]A single character: a, b or c
[^abc]Any single character but a, b, or c
[a-z]Any single character in the range a-z
[a-zA-Z]Any single character in the range a-z or A-Z
^Start of line
$End of line
.Any single character
sAny whitespace character
SAny non-whitespace character
dAny digit
DAny non-digit
wAny word character (letter, number, underscore)
WAny non-word character
bAny word boundary character
(...)Capture everything enclosed
(a|b)a or b
a?Zero or one of a
a+One or more of a
a{3}Exactly 3 of a
a{3,}3 or more of a
a{3,6}Between 3 and 6 of a

Useful PHP Functions

FunctionUsage
preg_matchPerform a regular expression match
preg_match_allPerform a global regular expression match
preg_splitSplit string by a regular expression
preg_replacePerform a regular expression search and replace
Comments

There are no comments for this post. Be the first!

Leave a Reply

Your email address will not be published.