(a) Regex CRASH Course! (Pt. 1)
(originally from: http://forums.devnetwork.net/viewtopic.php?f## 38&t33147)
Firstly can I just say that when researching regex (or Regular Expressions) you will notice a lot of reference to Perl. Perl was "one of" the first languages to heavily use regex, after Grep (unix tool) and so you'll find the most complete documentation for it here. Secondly, regular-expressions.info is a great resource for beginners.
Hold on tight this is going to be a fast paced, but to-the-point tutorial ;-)
Ready? Ok let's do it....
Line number On/Off | Expand/Contract
This is what makes developers cry. Look at that mess! What's all this \d, \s \w etc etc???
Lets start with metacharacters. Those \d, \s etc are what we refer to as "metacharacters". Metacharacters are characters which represent a particular group of real characters (some exceptions - see below).
The metacharacters and what they stand for: Line number On/Off | Expand/Contract
Regex are case sensitive unless you specify otherwise. See further down for more info.
You'll see some other metacharacters which don't actually match anything thats really there. They match invisible boundaries so we call them "zero-width assertions".
Line number On/Off | Expand/Contract
There are others but you don't ever use them really.... read the Perl documentation if you want to know more.
Next, we can specify how many times a character should occur. We could do this to match a string of four digits: Line number On/Off | Expand/Contract
Or we could write this: Line number On/Off | Expand/Contract
Lets cover the "quantifiers". The quantifier follows the character it applies to.
Line number On/Off | Expand/Contract
One or more times
None or more times
One last thing before we build our first regex. Regex needs to be delimited if using Perl style regular expressions (preg_match()) which I strongly advise you do (Note: ereg_...() is not perl style).
To delimit a regex we start and end with the EXACT same character. The two standards are (but you can use most non-alphanumeric characters): Line number On/Off | Expand/Contract
Lets look at a regular expression before we move on further. We'll use preg_match() to execute the regex here (I'll explain after).
Line number On/Off | Expand/Contract
echo "d11wtq is 22";
echo "d11wtq didn't tell me his age";
I'll explain what it does. "\w+" matches an alphanumeric or underscore character one or more times Hello "\W" matches any single non-alphanumeric character Hello, " I'm " is just plain old string Hello, I'm "\w" is any single alphanumeric character Hello, I'm d "\d{2}" is two digits Hello, I'm d11 "wtq and I'm " is just plain old string again Hello, I'm d11wtq and I'm "\d+" is one or more digits Hello, I'm d11wtq and I'm 22 " years old" is plain old string Hello, I'm d11wtq and I'm 22 years old "\W" is any single non-alphanumeric charactcer Hello, I'm d11wtq and I'm 22 years old!
If you understand that then let's move onto some "modifiers". If not, then read it again, and if you still don't get it, read it again.....
Note: When starting out in regex don't try and jump in with both feet. Match a tiny part of the string, then test it. Then add some more to your regex to match more of the string and test again. Repeat until the regex works.
Regex modifiers: Line number On/Off | Expand/Contract
"mis" here are all modifiers. They tell the regex how to behave.
Line number On/Off | Expand/Contract
Again, there are others but you don't really use them.
Modifiers go on the right hand side of the closing delimiter.
Quick example: Line number On/Off | Expand/Contract
echo "Starts with a letter";
echo "Doesn't start with a letter";
"^" means match the very start of the string (not a character itself) "[a-z]" means match a lowercase a to z Nothing matched - BUT The "i" modifier makes the regex case insensitive - SO H is all that is matched but this means it returns true anyway.
There are some things you should remember when working with regular expressions.
Next... Parentheses have more than one use in regex. They: a) Group characters together b) Extract the characters they surround into memory (to match a parenthesis itself you must escape it "(" )
Something useful: Line number On/Off | Expand/Contract
echo "String is a URL";
echo "String isn't a URL";
This matches the "http://www.foo.bar" part of the URL above so it returns true. I'll let you break it down yourself and see how it works (remember the parentheses "(....)" group the characters together ).
A vertical bar character "|" is used to mean OR.
Line number On/Off | Expand/Contract
//True
//False
Ok we've nearly covered all the "basics" now. One last thing to cover in the scope of the crash course is extracting parts of the string into memory (then I'll finish up by briefly overviewing the PHP functions).
Sometimes you'll need to match part(s) of a string and extract them to use elsewhere. You do this using parentheses. Indexing starts at 1 and goes up by one for each parens used. The order follows this pattern with regards to nesting parens together:
Line number On/Off | Expand/Contract
Essentially, you go deeper into the nest before moving further to the right.
The best way to refer to an extracted part of a string is by the dollar sign "$" followed by the index of the part you extracted. (e.g. "$4" ). However, that said, PHP handles things slightly differently with the preg_match() function. Indexing starts at zero (the entire string) and then from 1 as expected for the extracted parts. preg_match() also requires a third parameter to do this so that it can dump "$1", "$2", "$3" etc into an array.
Line number On/Off | Expand/Contract
PHP functions overview:
preg_match() - I guess I have that one covered. Tests if the pattern is matched in the string. Returns TRUE if matched, FALSE if not. If the optional third parameter is given the function extracts parentheses enclosed parts of the pattern into a given array.
preg_match_all() - Same as preg_match() except that the regex doesn't stop when a match is found... it continues to find as many matches as exist in the string. The extracted array is a multi-dimensional array where all occurences of $1 are placed in $array[1] and all occurrences of $2 in $array[2] etc...
preg_replace() - Like str_replace() except it takes regex patterns as arguments: Line number On/Off | Expand/Contract
preg_split() - Like explode() except it takes a regex pattern as the point at which to split the string: Line number On/Off | Expand/Contract