As one passes from the stages of beginner to intermediate in ones knowledge of Regular Expression Parsing, the if statement (? () () | () ) really opens up what one can do with Regular Expressions. This is a short article on how to use the if statement and how it is valuable.
One of the forum posts that I had responded to, the user had parsed a sentence using the string functionality and then wanted to use a regular expression to parse some of tokens from the parsing. My suggestion was to use regex on the whole sentence dividing up what he wanted into named capture groups. To do this, the regex pattern has to work on two levels, one to tokenize the sentence and two to processes certain items.
That scenario cried out for the if conditional because the user wanted to split out numeric numbers from other words from the sentence and the subsequently split out numbers of different sizes. The sentence the user has in mind was
This is a test 100 9999 22
His goal was to break out the numbers, three digits or smaller save them, but ignore bigger series and words. So in my mind the rules were like this
- Break the sentence up in word boundaries.
- If it is a word place it into the words category.
- If it is a number place it into the numbers category. But check if it a small number or a too big number.
Which leads itself to the need to answer those basic questions in the regex. That is where the If conditional comes into play:
(? (Match Conditional) (If true then match this) | (else match this when no conditional match) )
Above, I have sprinkled actual syntax: parenthesis, question mark and pipe with textual operations in yellow. What it says is, if the first match condition, matches, do the next match condition. If it is not, skip to the following next match condition, after the pipe, and do the match. The pipe is of course unix for or.
So my regex, with IngoreWhiteSpace turned on for the comments looked like this (No line numbers 1: 2: etc just for show)
1: (?(\\b\\d+\\b) # If it is a number
2: () # Removed for now.
3: | # Or from the first if
4: (?<Word>[^\\s.!?]*) # It is a word
5: )
- Line 1 : I have declared the if conditional and have created a match to look for a number.
- Line 2 : I have removed what I have done with the number at this time, but when the conditional is true, this is the match target.
- Line 3 : This tells the regex that we have an else condition to do if the if condition match fails.
- Line 4 : When the number 2′s if conditional fails, processing goes to here. At this point regex parser will do a true match and place the word into the Word group. Capturing words like This is …
- Line 5 : Ending syntactic grouping for the if conditional.
So we have accomplished the routing of either a digit word or an alpa word with the above text.
Always remember the if’s conditional match is not kept! That is a good thing. That allows to the freedom to
- Recreate the if conditional if needed, so we can match it.
- Or more importantly create a new match condition which frees us from the if conditionals match. What that means is that we can do a basic match on the if, but a more robust one to do the actual capture. That is what we will do to fill in #2 above.
We will now work on the submatch for numbers. The rule on a number is that either the number fits or it is to big…does this sound like another if conditional? Yes it does. Here is the full match for #2 which contains another if conditional to split out the numbers
1: (?(\\b\\d+\\b) # If it is a number
2: (?(\\b\\d{1,3}\\b) # Then If it is 1-3 digits
3: (?<Number>\\b\\d+\\b) # Capture to the Number named group
4: |
5: (?<TooBig>[^\\s]*) # Too big of a number.
6: )
7: | # Or from the first if
8: (?<Word>[^\\s.!?]*) # It is a word
9: )
10: (?:\\s?|$|[.?!]) # Capture but don't match the nonwords.
- Line 2 : Now that we know its a number, let us check the size in a new if conditional. That condition is 1-3 digits in size.
- Line 3 : If Line 2′s conditional is true we are going to actively match and place the number in to the Number named match group.
- Line 4 : This is the or that goes with 2 and not one.
- Line 5 : Line 2′s conditional failed so we have a number that is too big. Place into the named capture group TooBig.
- Line 8 : This was explained above and has not changed nor does it associate with the new processing above.
- Line 10 : This is outside all of the conditionals and simply picks up the next whitespace after the word or number. I have now added it to complete the actual regex to use on the pattern.
So one can see how if conditional in regexes can break down large issues. Here are some thoughts on If conditionals
- Always use IgnoreWhiteSpace option so you can visually see the operations unfold.
- Remember the if match condition does not have to be repeated, though I have had to in cases and wished I could capture it…but more often than not it is a good thing. One can then do things outside of the conditional, so you are not limited to its conditional match.
- One does not have to have the else target, which might come into play in certain circumstances.
#1 by james peckham on June 19, 2007 - 1:32 pm
Quote
having a hard time reading your page, can you get a little more contrast in the text?
#2 by omegaman on June 19, 2007 - 3:24 pm
Quote
Hi James, I am unable to tweak the format at this time, its someone else’s template and I would have a learning curve…One suggestion, click on the the RSS feed in IE7. The background will be white with black letters. The titles will be in yellow, but that should give you the contrast you need. Or highlight the text, sorry I know that is a pain, in the article and it should inverse the colors. – Thanks.
#3 by Alok Diwan on April 10, 2009 - 5:27 am
Quote
Good article Keep it up
#4 by nima dilmaghani on October 27, 2009 - 11:47 am
Quote
Nice article. Try @-quoting your strings. Then you will be able to single escape instead of doubling your escapes and your regex’s would be much easier on the eyes.
string s1 = @”c:\Docs\Source\a.txt”; // rather than
string s2 = “c:\\Docs\\Source\\a.txt”
#5 by omegaman on October 28, 2009 - 3:13 pm
Quote
Excellent suggestion! I actually do that while coding, by using the literal @. What happened when I wrote this article, I was fighting with WordPress and the tool I used to display code. WordPress would strip out special characters and would take the away in my code snippet. Grrr. So I had to resort to other methods. I couldn’t edit a post one published in the admin pages….
Long story short WordPress no longer strips characters willy-nilly and now I use Syntax Highlighter to display code in non-touched pre tag blocks.
Since this article is still very much read and resourced by the internet, I will update the code to reflect your suggestion.
Thanks!
#6 by Bharani Chowdary Chunduri on February 11, 2010 - 5:43 pm
Quote
Very valuable article. It helped me a lot in getting used to if then else conditionals with regex. Thanks heaps for this.