(a) Regex CRASH Course! (Pt. 1)

(originally from: http://forums.devnetwork.net/viewtopic.php?f=38&t=33147)

Firstly can I just say that when researching regex (or Regular Expressions) you will notice a lot of reference to Perl. Perl was “one of” the first languages to heavily use regex, after Grep (unix tool) and so you'll find the most complete documentation for it here. Secondly, regular-expressions.info is a great resource for beginners.

Hold on tight this is going to be a fast paced, but to-the-point tutorial ;-)

Ready? Ok let's do it….

Line number On/Off | Expand/Contract

 1.  
 2. /[\w\s]+\d{1,3}\t\W/
 3.  

This is what makes developers cry. Look at that mess! What's all this \d, \s \w etc etc???

Lets start with metacharacters. Those \d, \s etc are what we refer to as “metacharacters”. Metacharacters are characters which represent a particular group of real characters (some exceptions - see below).

The metacharacters and what they stand for: Line number On/Off | Expand/Contract

 1.  
 2. Character         Matching
 3.  
 4. . (dot)           ANY single character at all
 5. \w                Any single alphanumeric character (a-z, 0-9) and underscores
 6. \d                Any single digit (0-9)
 7. \s                Any single whitespace character
 8.  
 9. <<Uppercase negates the metacharcter>>
10.  
11. \W                Any single non-alphanumeric character
12. \S                Any single non-whitespace character
13. \D                Any single non-digit
14.  
15. <<Something else to note>>
16. [x-y]             Any single character in the range x to z (e.g. [A-Z])
17. [abc123]          Any single character from a, b, c, 1, 2 or 3
18. [a-z125-9]        Any single character from a to z, 1, 2 or 5 to 9
19.  
20. <<Negate these with caret "^" at the VERY start of the bracket>>
21. [^abc0-9]         Any single character EXCEPT a, b, c and 0 to 9
22.  

Regex are case sensitive unless you specify otherwise. See further down for more info.

You'll see some other metacharacters which don't actually match anything thats really there. They match invisible boundaries so we call them “zero-width assertions”.

Line number On/Off | Expand/Contract

 1.  
 2. Assertion        Matching
 3.  
 4. ^ (caret)        The start of the string
 5. $                The end of the string
 6. \b               A word boundary (the point between a non-alphanumeric character and an alphanumeric character)
 7.  

There are others but you don't ever use them really…. read the Perl documentation if you want to know more.

Next, we can specify how many times a character should occur. We could do this to match a string of four digits: Line number On/Off | Expand/Contract

 1. /\d\d\d\d/

Or we could write this: Line number On/Off | Expand/Contract

 1. /\d{4}/

Lets cover the “quantifiers”. The quantifier follows the character it applies to.

Line number On/Off | Expand/Contract

 1.  
 2. Quantifier         Meaning
 3.  
 4. +                  One or more times
 5. *                  None or more times
 6. ?                  None or one time
 7. {n,m}              Between n and m times
 8. {y,}               y or more times
 9. {x}                x times only
10.  

One last thing before we build our first regex. Regex needs to be delimited if using Perl style regular expressions (preg_match()) which I strongly advise you do (Note: ereg_…() is not perl style).

To delimit a regex we start and end with the EXACT same character. The two standards are (but you can use most non-alphanumeric characters): Line number On/Off | Expand/Contract

 1.  
 2. /pattern/
 3. #pattern#
 4.  

Lets look at a regular expression before we move on further. We'll use preg_match() to execute the regex here (I'll explain after).

Line number On/Off | Expand/Contract

 1.  
 2. $string = "Hello, I'm d11wtq and I'm 22 years old!";
 3. if (preg_match("/\w+\W I'm \w\d{2}wtq and I'm \d+ years old\W/", $string)) {
 4.     echo "d11wtq is 22";
 5. } else {
 6.     echo "d11wtq didn't tell me his age";
 7. }
 8.  

I'll explain what it does. “\w+” matches an alphanumeric or underscore character one or more times Hello “\W” matches any single non-alphanumeric character Hello, “ I'm ” is just plain old string Hello, I'm “\w” is any single alphanumeric character Hello, I'm d “\d{2}” is two digits Hello, I'm d11 “wtq and I'm ” is just plain old string again Hello, I'm d11wtq and I'm “\d+” is one or more digits Hello, I'm d11wtq and I'm 22 “ years old” is plain old string Hello, I'm d11wtq and I'm 22 years old “\W” is any single non-alphanumeric charactcer Hello, I'm d11wtq and I'm 22 years old!

If you understand that then let's move onto some “modifiers”. If not, then read it again, and if you still don't get it, read it again…..

Note: When starting out in regex don't try and jump in with both feet. Match a tiny part of the string, then test it. Then add some more to your regex to match more of the string and test again. Repeat until the regex works.

Regex modifiers: Line number On/Off | Expand/Contract

 1. /^pattern$/mis

“mis” here are all modifiers. They tell the regex how to behave.

Line number On/Off | Expand/Contract

 1.  
 2. Modifier         Effect
 3.  
 4. i                Case insensitive
 5. s                Ignore whitespace
 6. g                Global search (not valid in PHP [use preg_match_all()] but handy if you're using JS or Perl). Tells the regex to keep looking after it's matched once
 7. m                Multi-line mode (^ and $ now match start and end of LINE not start and end of STRING)
 8.  

Again, there are others but you don't really use them.

Modifiers go on the right hand side of the closing delimiter.

Quick example: Line number On/Off | Expand/Contract

 1.  
 2. $string = "Hello World!";
 3. if (preg_match('/^[a-z]/i', $string)) {
 4.     echo "Starts with a letter";
 5. } else {
 6.     echo "Doesn't start with a letter";
 7. }
 8.  

“^” means match the very start of the string (not a character itself) “[a-z]” means match a lowercase a to z Nothing matched - BUT The “i” modifier makes the regex case insensitive - SO H is all that is matched but this means it returns true anyway.

There are some things you should remember when working with regular expressions. 1. Escape characters with a backslash 2. Remeber to use quantifiers to match multiple times 3. Remember to match a dot “.” you need to escape it “\.” because dot “.” is a metacharacter itself 4. Regex are case sensitive by default 5. “*” and “+” are what we call “greedy” (Read the follow up to this tutorial to learn more)

Next… Parentheses have more than one use in regex. They: a) Group characters together b) Extract the characters they surround into memory (to match a parenthesis itself you must escape it “\(” )

Something useful: Line number On/Off | Expand/Contract

 1.  
 2. //Check string represents a URL
 3. $string = "http://www.foo.bar/";
 4. if (preg_match("#^\w+://(www\.)?\w+\.\w+#i", $string)) {
 5.     echo "String is a URL";
 6. } else {
 7.     echo "String isn't a URL";
 8. }
 9.  

This matches the “http://www.foo.bar” part of the URL above so it returns true. I'll let you break it down yourself and see how it works (remember the parentheses “(….)” group the characters together ).

A vertical bar character “|” is used to mean OR.

Line number On/Off | Expand/Contract

 1.  
 2. $string = "abcdefg123456";
 3. //abcdefgh23456   OR   abcdefg123456
 4. if (preg_match("/abcdefg(h|1)23456/", $string)) {
 5.     //True
 6. } else {
 7.     //False
 8. }
 9.  

Ok we've nearly covered all the “basics” now. One last thing to cover in the scope of the crash course is extracting parts of the string into memory (then I'll finish up by briefly overviewing the PHP functions).

Sometimes you'll need to match part(s) of a string and extract them to use elsewhere. You do this using parentheses. Indexing starts at 1 and goes up by one for each parens used. The order follows this pattern with regards to nesting parens together:

Line number On/Off | Expand/Contract

 1.  
 2. ( 1 ( 2 ) ( 3 ( 4 ) ) ( 5 ( 6 ( 7 ) ( 8 ) ) ) ) ( 9 )
 3.  

Essentially, you go deeper into the nest before moving further to the right.

The best way to refer to an extracted part of a string is by the dollar sign “$” followed by the index of the part you extracted. (e.g. “$4” ). However, that said, PHP handles things slightly differently with the preg_match() function. Indexing starts at zero (the entire string) and then from 1 as expected for the extracted parts. preg_match() also requires a third parameter to do this so that it can dump “$1”, “$2”, “$3” etc into an array.

Line number On/Off | Expand/Contract

 1.  
 2. $string = "There's a number in here 123456 somewhere but I don't know what it is!";
 3. preg_match("/[a-z\s]+(\d+)[a-z\s]+/i", $string, $matches); //s a number 123456 in here somewhere but I don
 4. echo "The number in the string is " . $matches[1]; //The number in the string is 123456
 5.  

PHP functions overview:

preg_match() - I guess I have that one covered. Tests if the pattern is matched in the string. Returns TRUE if matched, FALSE if not. If the optional third parameter is given the function extracts parentheses enclosed parts of the pattern into a given array.

preg_match_all() - Same as preg_match() except that the regex doesn't stop when a match is found… it continues to find as many matches as exist in the string. The extracted array is a multi-dimensional array where all occurences of $1 are placed in $array[1] and all occurrences of $2 in $array[2] etc…

preg_replace() - Like str_replace() except it takes regex patterns as arguments: Line number On/Off | Expand/Contract

 1.  
 2. $string = "This is foo and that is bar";
 3. $new_string = preg_replace('/f(\w+)/', "g$1", $string); //This is goo and that is bar
 4.  

preg_split() - Like explode() except it takes a regex pattern as the point at which to split the string: Line number On/Off | Expand/Contract

 1.  
 2. $string = "lots of *@><&amp; symbols &amp;^% in this