(b) Regex Advanced tutorial - (CRASH Course Pt. 2)

(original : http://forums.devnetwork.net/viewtopic.php?f## 38&t40169)

About time I got around to doing this :)

If you didn't read the crash course before you jumped into this tutorial it may be a good idea to do so unless you already have a grasp of regex basics.

A few sites I need to point out (again): http://www.regular-expressions.info/ (Tutorials for regex in a few programming languages) http://www.perl.com/doc/manual/html/pod/perlre.html (Perl Documentation for TRUE perl style regex) http://www.weitz.de/regex-coach/ (Fantastic application called Regex-Coach!)

From this point on, we might as well consider everything written to be perl-style regex since the rest don't get particularly advanced.

A quick re-cap:

In the super-speedy paced crash course we looked at the metacharacters, the quantifiers, the modifiers and some PHP functions.

That gave us enough information to start constructing some simple regex. However, even with that basic knowledge you may still hit a few hurdles trying to do some fancy things with regex.

Ready? We're off!

How does the regex engine work?

In a pure technical sense I really wouldn't have a clue. But in a conceptual sense it works like this...

The regex engine reads the regex as it reads the string it's checking against. If the regex engine is satisfied that everything in the pattern has been matched it does not look any further into the remaining string (without modifiers).

By default... the regex engine will try to match everything the pattern tells it to match.

Quantifiers can change the behaviour of the regex engine quite significantly and can cuase hours of confusion and annoyance among developers. This is to do with something we call "greediness" in regex terms. Lets look at this more closely.

Pattern Greediness in Regular Expressions:

If you use a quantifier which allows matching of characters up to any number of times, the regex engine will try to fulfill that requirement as best it can.

%%(language-ref) String: Foo ###123 bar Line number On/Off | Expand/Contract /^[a-z]+.*(\d+)/i %%

The above regex, to anybody not looking closely appears to extract the "123" from the string. In actual fact, it does not do this...

%%(language-ref) Array ( [0] => Foo ###123 [1] => 3 ) %%

So what happened?

"[a-z]+" .. OK that's good ".*" .. Any character any number of times. This is whaere it collapses.

The dot-star combination is the evilest of the greedy patterns because it really will just match everything it can (except newline chars, without the "s" modifier).

Foo was picked up by the character class [a-z] since our pattern used the "i" modifier. The .* consumed the rest of our string less one number because the next metacharacter in the sequence used the "+" quantifier which allows at least one character to be matched.

So how do you fix that issue? -- Answer: You combine the greedy quantifier with the "?" quantifier. This makes that part of the pattern "ungreedy".

%%(language-ref) /^[a-z]+.*?(\d+)/i %%

Produces

%%(language-ref) Array ( [0] => Foo ###123 [1] => 123 ) %%

Essentially, we've told the regex engine to always check if the next part the pattern can feasibly match the following character.

Note: There is a "U" pattern modifier which makes the entire pattern ungreedy by default... use with caution!

From this point on... the tutorial is covering some advanced concepts. You'll only really need to use this stuff when you are writing very long patterns etc but anyway...

Special commands:

Regex can do some really clever things using instructions in the middle of the pattern. The syntax for providing these instructions is

%%(language-ref) (?instruction) %%

We'll bring this into play from here on.

Mid-pattern modifiers:

You've seen that you can modify the behaviour of a pattern by adding some letters after the closing delimiter. Brilliant! Guess what, we can twist and bend the behaviour of our regex mid-pattern by doing something similar ;) You'll like this.

The basic syntax is like this

%%(language-ref) <<< Modify the part in parens to TURN ON the modifier >>>

(?i ... ) (?m ... ) (?s ... ) (?U ... )

<<< Modify the part in parens to TURN OFF the modifier >>>

(?-i ... ) (?-m ... ) (?-s ... ) (?-U ... ) %%

Those letters, "i", "m" etc are the same pattern modifiers we used in the crash course... but now we can use them inside our pattern. This is only really handy if you have some very non-uniform string to match or you are writing a very long pattern.

An example usage:

String 1: Where IS the UK? String 2: where is the UK?

%%(language-ref) /[a-z\s]*?(?-iUK)/i %%

Lets say we always want UK to be in uppercase but the rest of the string is likely to have uppercase and lowercase characters in different places depending who typed it. We use the "i" modifier to account for the differences in the way people write... but what about UK being uppercase? Here we have disabled the "i" modifier for that specific part of the pattern so it matches uppercase UK specifically.

Lookaheads:

Lookaheads come in two flavours. Positive and negative. What they do, is check if a particular string follows part of the pattern. You wont often need these since you can normally just put the string itself into the pattern.

Syntax: %%(language-ref) pattern(?= ... ) %%

In the above, "pattern" must be followed by whatever is after the "?=" in the parens.

String: Sunshine %%(language-ref) /[a-z]+(?=shine)/i %%

The above pattern matches the word "Sun"... it would also match the "Moon" in Moonshine.

Negative lookaheads mean that part of the pattern must not be followed by the lookahead.

Syntax: %%(language-ref) pattern(?! ... ) %%

Example: %%(language-ref) /sun(?!shine)[a-z]*/i %%

The above pattern will match any word starting with "sun" but NOT starting with "sunshine".

Fixed-width Lookbehinds:

These can prove very useful. They have one drawback however. You need to know the size of whatever goes in the lookbehind due to the way the regex engine works. What they do is exactly the same as lookaheads except that they are looking backwards. The pattern it applies to must, or must not follow the lookbehind depending upon whether it is positive or negative.

Syntax for positive: %%(language-ref) (?<= ... )pattern %%

Notice that the lookbehind physically goes before the pattern?

Example positive lookbehind: %%(language-ref) /(?<=sun)[a-z]+/i %%

The above pattern will the end of any word starting with "sun", such as "shine" or "light".

Negative lookbehind syntax: %%(language-ref) (?<! ... )pattern %%

Example negative lookbehind: %%(language-ref) /\b(?<!a)[a-z]+/i %%

The above matches any word which does NOT start with a letter "a". The \b assertion just makes sure were at the start of a word.

Grouping with parens without extracting:

If you surround parts of your pattern in parens they will end up in backreferences as you've seen. These special commands with (? ... ) don't behave in that way however. There's a little command that simpy tells the regex engine to group characters, but not extract them.

Syntax: %%(language-ref) (?: ... ) %%

Example: %%(language-ref) /Foo(?:bar)+/ %%

That matches Foobar, Foobarbar Foobarbarbarbarbarbar ... etc etc. The advantages of using that little command are that you'll save a neglible amount of memory and speed up the matching slightly. In the real-world you'll use these a fair bit and they prove to be very handy at preveinting things from getting cluttered in larger patterns.

Extracting named backreferences:

This is nice and all, but it may confuse anyone who only knows pretty standard regex. It basically allows you to name all your backreferences (extracted parts) so that you can make more readable code.

Syntax: %%(language-ref) (?P<Name> ... ) %%

That's an UPPERCASE "P" and those less-than/greater-than symbols really are supposed to be there!

Example: Lets use our first one %%(language-ref) String: Foo ###123 bar

/^[a-z]+.*?(?P\d+)/i %%

This produces the following %%(language-ref) Array ( [0] => Foo ###123 [thenumber] => 123 [1] => 123 ) %%

Notice that it doesn't replace the numeric backreference altogether, it simply adds a named one too.

I feel like I've taken you far enough with this now and all you can do is to keep practising and using regex.

Don't forget you can nest these little commands too ;) ...

%%(language-ref) /[a-z]+\d(?-iFOO(?=(?ibar)))/i %%

Enjoy playing with those advanced features!

If I've made any mistakes please have a whinge so that I can correct them :D

Have fun!