(Updated 3/26/2011 This is geared towards .Net Regex v1-4.0)

As one passes from the stages of beginner to intermediate of Regular Expression Parsing, the if statement (? () () | () ) really opens up what one can do with Regular Expressions. This is a short article on how to use the if statement and how it is valuable.

Anecdotal Evidence…Kind of

On MSDN where I am moderator of the .Net Regular Expression Forum, one of the forum posts that I had to responded to, the user was asking a question concerning string parsing. He had a sentence and was using the basic string functionality to parse it but wanted to use a regular expression instead to break out tokens from the parsing. My suggestion was to use regex on the whole sentence dividing up what was needed into regex named capture groups. To do this, the regex pattern has to work on two levels, one to tokenize the sentence and two to processes certain items and the if conditional was on its way to be used.

IF Usage Plan of Attack

That scenario cried out for the if conditional because the user wanted to split out numeric numbers from other words from the sentence and the subsequently tokenize out numbers of different sizes. The sentence looked like this:

This is a test 100 9999 22

His goal was to break out the numbers, three digits or smaller save them, but ignore the bigger series and words. So in my mind to do this in regex this must be done

  1. Break the sentence up in word boundaries.
  2. If it is a word place it into the words category.
  3. If it is a number place it into the numbers category.
  4. For a number check to see if it a small number or a too big number.

See the If’s above…that is an indicator that a regex If is needed as well.

Regex If Overview

Here is how the if conditional works.:

(? (Match Conditional Check no capture) (If true then match/capture this) | (else match/cpature this when conditional match is false) )

Above, I have sprinkled actual regex syntax: parenthesis, question mark and pipe with textual operations in red. What it says is,

if the first match check condition, matches, operationally do the next match condition.

When it is not a match, skip to the following true next match condition, then after the pipe, and do the false match.

The pipe is of course unix speak for or but works in .Net regex.

So my regex, with IngoreWhiteSpace turned on to allow the comments (# …);

(?(\b\d+\b)            # If it is a number
  ()                   # Removed for now.
|                      # Or from the first if
  (?<Word>[^\s.!?]*)   # It is a word
)
  • Line 1 : I have declared the if conditional and have created a match to look for a number.
  • Line 2 : This is blank ( … ) , but when the conditional is true, this is the match target which will be captured. I show the actual match capture below.
  • Line 3 : This tells the regex that we have an else condition to do if the if condition match fails.
  • Line 4 : When the number 2′s if conditional fails, processing goes to here. At this point regex parser will do a true match and place the word into the Word group. Capturing words like This is
  • Line 5 : Ending syntactic grouping for the if conditional.

Back to The User’s Request

So  the above example we have created a framework for the routing of either a digit word or an alpa word with the above example text to be used.

Just to clarify again, remember the if’s conditional match (the check) is not kept! That is a good thing. That allows to the freedom to

  1. Recreate the if conditional if needed, so we can match it.
  2. Or more importantly create a new match condition which frees us from the if conditionals match.

What that means is that we can do a basic find fromthe if, but a more robust match for the actual capture. That is what we will do to fill in #2 above, we will find/match in the if, but create a more robust match/capture below it for the true conditional.

We will now work on the submatch for numbers. The rule provided by the original user was that on a number it either fits or it is to big…again does this sound like another if conditional?

Yes it does.

Here is the full match/capture for #2 missing above which contains another if conditional to split out the numbers since we don’t have to adhere ot the if match/find

(?(\b\d+\b)            # If it is a number (The match / Find )
   (?(\b\d{1,3}\b)      # Then If it is 1-3 digits  ->A new if in our original Match / Capture for True
    (?<Number>\b\d+\b) # Capture to the Number named group
    |
     (?<TooBig>[^\s]*) # Too big of a number.
    )
  |                    # Or from the first if
   (?<Word>[^\s.!?]*)  # It is a word
 )
 (?:\s?|$|[.?!])       # Capture but don't match the nonwords which will not be presented in the final output (this is not a part of the if)
  • Line 2 : Now that we know its a number, let us check the size in a new if conditional. That condition is 1-3 digits in size.
  • Line 3 : If Line 2′s conditional is true we are going to actively match and place the number in to the Number named match group.
  • Line 4 : This is the or that goes with our sub if from line 2 above.
  • Line 5 : Line 2′s conditional failed so we have a number that is too big. Place into the named capture group TooBig.
  • Line 8 : This was explained above and has not changed nor does it associate with the new processing above.
  • Line 10 : This is outside all of the conditionals and simply picks up the next whitespace after the word or number. I have now added it to complete the actual regex to use on the pattern.

Summary

So one can see how if conditional in regexes can break down large issues. Here are some thoughts on If conditionals

  1. Always use IgnoreWhiteSpace option so you can visually see the operations unfold and comment your regular expression.
  2. Remember the if match condition does not have to be repeated. One can then do things outside of the conditional, so you are not limited to its conditional match.
  3. Also one does not have to have the else target, which might come into play in certain circumstances.
Share