Simple Regular Expressions by Example

Whenever I ask a group of developers if they are familiar with regular expressions, I seem to get at least half the responses along the lines of “I’ve used them, but I don’t like them”. Call me a geek if you like, but I quite like regex; I think often it seems unfriendly because it’s used inappropriately or just thrown into code with “here be dragons” type comments rather than documentation about what should match, and what shouldn’t!

As with most things, it’s pretty easy when you know how, so here’s my one-step-at-a-time approach to regex (stolen from my ZCE preparation tutorial slides). Let’s begin at the very beginning: regular expressions have delimiters, usually a slash character, and these contain a pattern that describes a string.

pattern notes
/b[aeiou]t/ Matches “bat”, “bet”, “bit”, “bot” and “but”
Also matches “cricket bat”, “bitter lemon”

Here, we’ve got an expression that describes something containing the letter “b”, followed by any one of the vowels (a,e,i,o and u), followed by the letter “t”; so long as those three items appear in a string, then it will match this expression.

Use a hypen to denote ranges of characters:

pattern notes
/[0-9a-f]*/ Will match hex
/[0-9a-zA-Z]/ Upper and lower case are distinct; this matches alphanumeric strings
/./ The dot ‘.’ matches any character
/\./ If you actually want to match a dot, escape it

These examples show matching a selection of characters, including character ranges. Look out for the wildcard, which is a dot – if you wanted to match a dot (for example in a domain name), you will need to escape it with the backslash.

The quantifier goes after the character (or character range) to say how many of something there should be. You can give precise numbers of occurrences, ranges, or use the ? (0 or 1), + (1 or more) or * (0 or more) characters:

pattern notes
/b[aeiou]+t/ Matches “bat” and “bit” etc, but also “boot” and “boat”

We can also anchor our patterns to the beginning and ends of lines using ^ and $ respectively (or \A and \Z for multiline strings). This means that:

pattern notes
/^b[aeiou]t/ Will match “battering ram” but not “cricket bat”

To remember all the various characters, I use this excellent regex cheat sheet from addedbytes.com, which has been printed out and pinned over my desk pretty much everywhere I’ve worked in recent memory.

14 thoughts on “Simple Regular Expressions by Example

  1. I’ve always used http://regex.powertoy.org/ as a cheat sheet. It’s great for writing and testing regex, and includes a cheat sheet of every regex command.

    I just checked out the regex cheat sheet you posted as well, and now I know why so many websites tell me my e-mail address isn’t valid. It doesn’t allow a hyphen in a domain name, even though that’s a valid character.

  2. Regular expressions are probably the single most useful thing a developer can know.

    Not just for coding things, but also for doing otherwise-impossible search and replace or reformatting operations in text editors. Want to build some SQL from a data in a CSV or text file? No problem, run it through “sed” with some appropriate regex. Got a mixture of hard and soft tabs? No problem, “s/ {4}/\t/g” will fix them.

  3. Something worth noting for those PHP developers out there; remember that you need to escape backslashes in PHP.

    So if you see:

    preg_match(‘/^([a-z]+)\\.html/’, $subject, $matches);

    The regular expression that this actually uses is:
    /^([a-z]+)\.html/

    As the first \ is a PHP escape character.

    Always gets me…

    • A less obvious (but sometimes useful) way to escape regexp metacharacters is using the character class construct. For the example above, you could instead do this: /^([a-z]+)[.]html/

  4. I didn’t use regular expressions much until I found out:
    1. how to use them inside Bash
    2. how to use sub-patterns (the bits with parenthesis round them).
    The latter are invaluable for writing simple fuzzy parsers, renaming lots of files that don’t sort properly …

  5. Am I reading this right or does /b[aeiou]+t/ just not work as intended till you remove the forwardslashes ?

  6. [code]/b[aeiou]t/[/code] As far as I can see this does not match cricket bat, only bat unless other regex is added to extend it. I am only learning but see this as incorrect. I expected from the description that because the agreement was in the text that the whole phrase to match would be made. It will okay search of course but not pull up the whole thing.

Leave a Reply

Please use [code] and [/code] around any source code you wish to share.

This site uses Akismet to reduce spam. Learn how your comment data is processed.