Simple Regular Expressions by Example
As with most things, it’s pretty easy when you know how, so here’s my one-step-at-a-time approach to regex (stolen from my ZCE preparation tutorial slides). Let’s begin at the very beginning: regular expressions have delimiters, usually a slash character, and these contain a pattern that describes a string.
pattern | notes |
---|---|
/b[aeiou]t/ |
Matches “bat”, “bet”, “bit”, “bot” and “but” Also matches “cricket bat”, “bitter lemon” |
Here, we’ve got an expression that describes something containing the letter “b”, followed by any one of the vowels (a,e,i,o and u), followed by the letter “t”; so long as those three items appear in a string, then it will match this expression.
Use a hypen to denote ranges of characters:
pattern | notes |
---|---|
/[0-9a-f]*/ |
Will match hex |
/[0-9a-zA-Z]/ |
Upper and lower case are distinct; this matches alphanumeric strings |
/./ |
The dot ‘.’ matches any character |
/\./ |
If you actually want to match a dot, escape it |
These examples show matching a selection of characters, including character ranges. Look out for the wildcard, which is a dot – if you wanted to match a dot (for example in a domain name), you will need to escape it with the backslash.
The quantifier goes after the character (or character range) to say how many of something there should be. You can give precise numbers of occurrences, ranges, or use the ?
(0 or 1), +
(1 or more) or *
(0 or more) characters:
pattern | notes |
---|---|
/b[aeiou]+t/ |
Matches “bat” and “bit” etc, but also “boot” and “boat” |
We can also anchor our patterns to the beginning and ends of lines using ^
and $
respectively (or \A
and \Z
for multiline strings). This means that:
pattern | notes |
---|---|
/^b[aeiou]t/ |
Will match “battering ram” but not “cricket bat” |
To remember all the various characters, I use this excellent regex cheat sheet from addedbytes.com, which has been printed out and pinned over my desk pretty much everywhere I’ve worked in recent memory.
Holy Cow!
This is the first time I’ve read something about regex that makes any sense…
Thank you so much!
I’ve always used http://regex.powertoy.org/ as a cheat sheet. It’s great for writing and testing regex, and includes a cheat sheet of every regex command.
I just checked out the regex cheat sheet you posted as well, and now I know why so many websites tell me my e-mail address isn’t valid. It doesn’t allow a hyphen in a domain name, even though that’s a valid character.
http://regex.powertoy.org/ seems to have dropped off the web.
a. has it changed identity?
b. do I need my glasses checked?
c. Other
Regular expressions are probably the single most useful thing a developer can know.
Not just for coding things, but also for doing otherwise-impossible search and replace or reformatting operations in text editors. Want to build some SQL from a data in a CSV or text file? No problem, run it through “sed” with some appropriate regex. Got a mixture of hard and soft tabs? No problem, “s/ {4}/\t/g” will fix them.
Love the tabs/spaces example, I certainly use that one a lot!
Awesome post Lorna :)
If anyone is interested in getting to grips with regular expressions the book ‘Sams Teach Yourself Regular Expressions in 10 Minutes’ is well worth a read. It fits into your back pocket and is nice and cheap too!
Here is the Amazon link:
http://www.amazon.co.uk/Teach-Yourself-Regular-Expressions-Minutes/dp/0672325667
Thanks for the kind words and the very helpful book recommendation, that’s excellent :)
Something worth noting for those PHP developers out there; remember that you need to escape backslashes in PHP.
So if you see:
preg_match(‘/^([a-z]+)\\.html/’, $subject, $matches);
The regular expression that this actually uses is:
/^([a-z]+)\.html/
As the first \ is a PHP escape character.
Always gets me…
A less obvious (but sometimes useful) way to escape regexp metacharacters is using the character class construct. For the example above, you could instead do this: /^([a-z]+)[.]html/
I didn’t use regular expressions much until I found out:
1. how to use them inside Bash
2. how to use sub-patterns (the bits with parenthesis round them).
The latter are invaluable for writing simple fuzzy parsers, renaming lots of files that don’t sort properly …
Am I reading this right or does /b[aeiou]+t/ just not work as intended till you remove the forwardslashes ?
Holy Cow!
This is the first time I’ve read something about regex that makes any sense…
Thank you so much!
Looks like that great regex cheat sheet you mentioned has been moved from addedbytes.com to cheatography.com and can now be found here: https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/
[code]/b[aeiou]t/[/code] As far as I can see this does not match cricket bat, only bat unless other regex is added to extend it. I am only learning but see this as incorrect. I expected from the description that because the agreement was in the text that the whole phrase to match would be made. It will okay search of course but not pull up the whole thing.