Archive for the ‘Regular Expressions’ Category

Posted by OmegaMan at October 18, 2008

Category: Regular Expressions

Tags: ,

This article goes into how to appropriately use the Regex.Split function and the pitfalls one may run into when using it. This article is based on .Net 3.5 using C#, but can be applied to any version of .Net or language.

Overview

Regex replace is a great tool which allows one to do more than the simple string.Split but it has some serious downfalls to the uninitiated. Let us first review how it works in code. This example I term the Kool-aid example for it looks very much like string.Split; it conveys that it is easy to use….

foreach (string str in Regex.Split("Linq-to-SQL", "-"))
    Console.WriteLine(str);

/* Writes

Linq
to
SQL

*/

Pretty obvious and it appears to work like string.split, we are splitting on the dash and it works. But one might as well use string.Split for the easy examples, for in real life one doesn’t use regex split on basic patterns.

Pitfalls

Since a regex pattern is used to match specific text, one believes that because they have a pattern which picks up a valid matches, it can transfer into regex.split as is…oh no. One has to be extra vigilant with the pattern because one probably wants to split on a particular item, but forgets its surrounded by and contains whitespace and line feeds.

For example, since people work in textual items and not esoteric numeric examples, say you wanted to remove certain lines from paragraphs. Say this text is what is the originating text

Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.

The goal is to remove the finished place lines, shown in italic. The natural thing to do is to create a match for the line. A pattern such as

Y[^\.]+\.

Will find text that starts with a Y and match/consume til it finds a period. Run that and the data through a regex pattern testing tool and shows that we get this result:

You finished in 1st place.
You finished in 4th place.

Great! So one loads it into code and runs it such as:

string input = @"Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.";

string pattern = @"Y[^\.]+\.";

foreach (string str in Regex.Split(input, pattern))
    Console.WriteLine( str );

/* Outputs:

Cinthia Blake[CR][LF]
Olso Norway[CR][LF]
[CR][LF]
[CR][LF]
Fred Alter[CR][LF]
Chicago USA[CR][LF]
[CR][LF]

*/

I have included the whitespace of the \r\n’s as [CR][LF] to see the problem. Instead of a clean list of name, location, name, location…we now have name location, line line, name location, line. Whoa! where did those lines come from!

The user thinks that Regex replace is not working; its returning extra lines….and gives up!

No. the problem is that the whitespace was not accounted for when using the split and making the pattern. Yes it matched the lines and dutifully split and left the whitespace; almost as an after thought. Frustrating to the user.

Conclusion

You have to be extra vigilant about using Regex.Split and its pattern. Be painfully aware of whitespace. Here is a pattern to handle the whitespace and achieve the result intended:

string input = @"Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.";

string pattern =
@"[\r\n]{0,2}        # If there is Line feeds before then match it.
Y[^\.]+\.            # original pattern
\s*                  # Maybe there are spaces after the sentance...get those
[\r\n]{0,2}          # *If* there is Linefeeds after then match it.
";

// IgnorePatternWhitespace allows us to comment the code, it does
// NOT apply to the processing of spaces within the input text.

foreach (string str in Regex.Split(input, pattern, RegexOptions.IgnorePatternWhitespace))
    Console.WriteLine( str );

/* Outputs:

Cinthia Blake[CR][LF]
Olso Norway
Fred Alter[CR][LF]
Chicago USA

*/

Now we have what we expected, two groups split on the appropriate sentences. I have shown where there is still whitespace in the result comment.

But long story short, be extra vigilant about whitespace in real world regex replace work. Just matching on something will not fit the bill.

Share

Posted by OmegaMan at August 10, 2008

Category: Linq, Regular Expressions

Tags: , ,

Here is a code snippet which accomplishes these following goals:

  • It marries a C# Regular Expression MatchCollection to a property list using Linq.
  • It uses a Regex Pattern which creates named capture groups which Linq can easily exploit in the join of two data lists.

Let me show you the code. Don’t get hung up on the pattern or what it is doing. What needs to be known is that the pattern places the data matched into Named Capture Groups of Key and Value. The actual Key value corresponds to a property on a real class. Using reflection we will find that property on the class and link its property name to the value stored. That will allow us to change that properties value on the class from the Value we get from the regex match.

The goal of the Linq code is to join into another list, which is the list of properties from the class and the commonality is the PropertyInfo.Name found in that list. Once that data is joined a new object will be created which will have the actual property object and the value of Value. That new list will allow the following operations to set target properties value to Value of the match in Match collection.

public static T ASCIISerializeOut<T>( string targetSerialized )
     where T : new()
{

     T targetInstance = new T();

     string pattern = string.Format( @"(?<Key>[^{0}]*)(?:{0})(?<Value>[^{1}]*)(?:{1}?)",
           Seperators.cnKVPSeperator,   // "±"
           Seperators.cnSeperator );    // "¶"

     MatchCollection mcKVPs = Regex.Matches( targetSerialized,
                                             pattern,
                                             RegexOptions.Compiled );

     var kvps = from Match m in mcKVPs
                where mcKVPs != null
                where mcKVPs.Count > 0
                join prp in GetPublicProperties<T>() on m.Groups["Key"].Value equals prp.Name
                select new
                {
                    prop  = prp,
                    Value = m.Groups["Value"].Value ?? string.Empty
                };

     foreach (var item in kvps)
         item.prop.SetValue( targetInstance, item.Value, null );

     return targetInstance;

 }

 /// <summary>
 /// Return all public properties which are of string type from T class.
 /// </summary>
 public static IEnumerable<PropertyInfo> GetPublicProperties<T>()
 {
     return from p in typeof( T ).GetProperties()
            where p.PropertyType == typeof( string )
            select p;
 }
  • Line 01: The function takes in text such as "AProp±AValue¶BProp±BValue" which needs to be serialized into a newly created class of type T. The first item in the pattern is the property name AProp followed by a seperator ± then the value of the property AValue and finally a key value seperator: ¶. our regex will create individual matches for each of the key value pairs.
  • Line 07: This pattern when used will get key and value pair combinations and place them in named groups of Key and Value of the match.
  • Line 11: Get all the key/value pair combinations into the match collection.
  • Line 15: Linq starts here: We define a Var object kvps (key value pairs) which will use /loop each match from the match collection.
  • Line 16: Make sure the collection is not null.
  • Line 17: Make sure there are one or more matches.
  • Line 18: Get all the public properties of class T and make a join to our collection data. Key should match the property Name.
  • Line 19: Each match found within the property where the names are the same will create this new object below with two properties.
  • Line 21: Save the actual property object, we need that later to load data.
  • Line 22: Get the value out of the Value group and save that as well. Note, if it is null, just use string.Empty. Thanks Null Coalescing operation (??).
  • Line 25: Now for each var object created enumerate through it and load the target values into our newly minted class object of T.
  • Line 26: Set the target item’s property to the value found from the regex matches.
  • Line 28: Return the new object with the original text data serialized in.
  • Line 35: Return an enumeration of all generic string properties of the type T.

For completeness see my post entitled A C# ASCII Serializer Generic Method for Class Objects which has the actual downloaded project and working test example. (Post coming soon!)

Share

Posted by OmegaMan at August 4, 2008

Category: Regular Expressions

Tags:

The option IgnorePatternWhiteSpace only applies to the regex processor reading the pattern and not in how it handles the input text. The option allows one to document a pattern by placing items on different lines and allowing for comments using the # escape character. Using the option I can document my regex as follows:


string pattern = @"
^                # Beginning of The Line
(?<Text>[^\d]+)  # Move all non numbers into the Text group.
(?<Number>\d+)   # Get all numbers into the Number Capture group.
";

Otherwise without the option the pattern must be written like this:


string pattern = @"^(?<Text>[^\d]+)(?<Number>\d+)";

Hence the readability for patterns is greatly increased.

Share

Posted by OmegaMan at December 27, 2007

Category: .Net, Regular Expressions

Tags: ,

Here are some things which an make your working with C# regular expression patterns much easier in reading and processing in C#.

Suggestion 1 – Avoid C# Text Escape Pollution

When working in C# it can become confusing when one has to deal with string literals and escapes even before dealing with the regular expression escapes. For example if we have an escape such as word boundary in regex (\b) we have to escape the escape in C# such as

string pattern = "\\b";

That gets confusing because we don’t want have to deal with C#…we are working in regex and \\b does not mean what we think it is (though it gets sent to the parse appropriately). What we should do is use C# literal convention (@) infront of the string such as

string pattern = @"\b";

The two shown C# lines are functionally equivalent…but now we can concentrate on the regex pattern with no pollution from C# escapes.

Suggestion 2 – Use Regex Ascii Escapes for quotes

Some people will go out of there way to use double quotes (@” “” “) in C# to search for a double quote in a regex pattern. This is confusing try using the Regex Ascii escape pattern instead. Below is a code sample that is equivalent:

string pattern = @""""; // I am only searching for a quote
 
// VS
 
string pattern2 = @"\x22"; // Much better
 

Note if you are using Expresso as your regex editor, it provides a handy way of finding those escapes:

image

Suggestion 3 – Use the IgnorePatternWhitespace option

This option confuses beginning regex-ers because they think it applies to what the pattern does….when in fact it is a preprocessing instruction for the regex parser solely! What it does is it allows you to put space in a pattern and have it hang over lines for easier reading. Here is a sample I created for a forum post where I was able to break out a long pattern. Thereby commenting it and making it easier to read. without the IgnorePatternWhitespace option, one would have to remove the comments and make it all one line:

string text =
@"5:16:04.859 PM:  07:18:12p  2.33   0.45   NH4                      9558    WORK
5:16:06.000 PM:  07:18:13p  2.29   0.31   RIN                     10554    WORK
5:16:07.625 PM:  07:18:15p  2.33   0.44   NH4                      9645    WORK
5:16:09.125 PM:  07:18:16p  2.29   0.32   RIN                     10400    WORK";

 
string pattern =
@"^(?<Time1>[^\s]*)  # Start of line, capture first time and place into Time1
   (?:\s*)           # Match but don't capture (MBDC) the space (Used as an anchor)
   (?<AmPm1>[AP]M)   # Get the AM | PM and put it into AmPm1 capture group.
   (?:\:\s*)         # MBDC : and space
   (?<Time2>[^ap]*)  # Time 2 Capture
   (?<AmPm2>[ap])    # AmPm capture
   (?:\s*)
   (?<Col1>[^\s]*)   # Data column 1
   (?:\s*)
   (?<Col2>[^\s]*)   # Data column 2
   (?:\s*)
   (?<Col3>[^\s]*)   # Data column 3
   (?:\s*)
   (?<Col4>[^\s]*)   # Data column 4
   (?:\s*)
   (?<Col5>[^\s]*)   # Data column 5";

 
Regex rgx = new Regex(pattern,
                  RegexOptions.Multiline | // ^ and $ match Beginning and EOL.
                  RegexOptions.IgnorePatternWhitespace); // Allows us to do the comments.
 
 
string[] groupNames = rgx.GetGroupNames();
 
Console.WriteLine("Groups: ({0}){1}", string.Join(") (", groupNames), System.Environment.NewLine);
 
MatchCollection mc = rgx.Matches(text);
 
foreach (Match m in mc)
    if (m.Success)
    {
        Console.WriteLine("Match:");
        foreach (string name in groupNames)
            Console.WriteLine("{0,10} : {1}", name, m.Groups[name]);
 
        Console.WriteLine("{0}Time1 ({1}) Time2 ({2}){0}",
            System.Environment.NewLine,
            m.Groups["AmPm1"].Value,
            ( ( m.Groups["AmPm2"].Value == "a" ) ? "AM" : "PM" ));
    }
 
 
 
Share

Posted by OmegaMan at July 23, 2007

Category: Regular Expressions

Tags: , , ,

The regex match evaluator gives one the ability to do a post process match in step for each match found. It is a handy way to normalize the match before sending it on. In that process could easily change or alter the match when needed. It also allows us to eat the match and have it return nothing!

Here is an example which I had from the boards. The user wanted to use regex replace to remove all alphabetic characters but return all numbers and a decimal place. But he had situations where there were two decimals. In that situation then only return the first one.

12abc.def34 becomes 12.34

.56a.d78 becomes .5678

Here is the code to accomplish that

// Only worry about decimals and letters.
string pattern = @"
(?<Decimal>\.) |       # Check for decimal
(?<Letter>[A-Za-z]+)    # Check for letter
";

string data = "0.ab.c1d.23"; // We want 0.123

int decimalPointCount = 0;

// Here is the Match Evaluator for Post Processing.
// We will eat the letters and return the decimals.
// The match evaluator will feed us every match found as it finds it.
MatchEvaluator CatchMultipleDecimals = delegate( Match m )
{
// Check for a decimal match only and return only the first one found
if (string.IsNullOrEmpty(m.Groups["Decimal"].Value) == false)
{
   if (decimalPointCount++ > 0) // We have gone over…return nothing!
      return string.Empty;
   else
      return m.Groups["Decimal"].Value; // Return the .
}

// We are only capturing text...so return nothing
// on any other match we may get.
   return string.Empty;

};

MatchEvaluator myEvaluator = new MatchEvaluator(CatchMultipleDecimals);

// Remember IgnoreWhiteSpace option only applies to how the regex parser
// processes the pattern and not the data. Since I have created my
// pattern split over multiple lines for readability I need to tell
// the regex parser to strip all whitespace from my pattern *before* it
// looks at any data. It has nothing to do with how the data is matched or processed.
Console.WriteLine(Regex.Replace(data, pattern, myEvaluator, RegexOptions.IgnorePatternWhitespace));

// Outputs 0.123 from the text 0.ab.c1d.23

Explanation

  • Line 2  : The pattern will only capture a decimal point or letter(s).
  • Line 14  : Here is the delegate that has the code which will be called whenever a match occurs.
  • Line 17 : Since we have placed the items into groups, we will check the Decimal group for any data.  If data exists, we return a decimal point only once, otherwise we return string.empty.
  • Line 21 : We could check for the Letter group, but that is not needed. Since the decimal is handled already, we will eat whatever is at this point and return string.empty to show that this is a non match.
  • Line 27 : We use regex replace to return any numbers with only one decimal thanks to the Match Evaluator we have created and used.
Share

Posted by OmegaMan at June 20, 2007

Category: .Net, Regular Expressions

Tags:

This is the process or parse CSV data and place it into a dictionary for storage in C# in .Net 2 and above. The dictionary will have a key of the actual line number, zero based, and a list of the data items with the commas and quotes removed. The regex pattern can handle data that looks like this

“xxx” or “xxx,xxx” or ‘xxx’ or ‘xxx,xxx’ or xxx

It will then be placed into a dictionary where each line is the actual row line. The below regex is designed to do these things

  1. Each match represents one data line or row.
  2. Each of the data items are inserted into the column capture to keep the data consistent with the current match.
  3. Handles both the single and double quote.
  4. The pattern can handle the comma within the quotes.
  5. The pattern uses an if condition see my blog entitled Regular Expressions and the If Conditional.
  6. Use of named capture group Column will hold the data.
Regex rx = new Regex(
@"((?([\x27\x22])         # Regex If single/double quotes 
   (?:[\x27\x22])         # \\x27\\x22 are single/double quotes
   (?<Column>[^\x27\x22]*)# Match this in the quotes
(?:[\\x27\\x22])
|
(?<Column>[^,\r\n]*]*))   # Else Not within quotes
(?:,?))+                  # Either a comma or EOL
(?:$|[\r\n]{0,2})         # Handle EOL or EOB",
                 RegexOptions.IgnorePatternWhitespace);
Dictionary<int, List<string>> data
    = new Dictionary<int,List<string>>();
string text =
@"'1','01000000043','2','4',20061102
'2',333,444,'555'";
int lineNumber = 0;
foreach(Match m in rx.Matches(text))
    if (m.Success)
    {
        List<string> line = new List<string>();
        foreach (Capture cp in m.Groups["Column"].Captures)
            if (string.IsNullOrEmpty(cp.Value) == false)
                line.Add(cp.Value);
        if (line.Count > 0)
            data.Add(lineNumber++, line);
    }
    foreach (KeyValuePair<int, List<string>> kvp in data)
        Console.WriteLine("Line {0} : {1}",
            kvp.Key.ToString(),
            string.Join(" ", kvp.Value.ToArray()));
Console Output

Line 0 : 1 01000000043 2 4 20061102
Line 1 : 2 333 444 555

Share

Posted by OmegaMan at June 16, 2007

Category: Regular Expressions

Tags:

(Updated 3/26/2011 This is geared towards .Net Regex v1-4.0)

As one passes from the stages of beginner to intermediate of Regular Expression Parsing, the if statement (? () () | () ) really opens up what one can do with Regular Expressions. This is a short article on how to use the if statement and how it is valuable.

Anecdotal Evidence…Kind of

On MSDN where I am moderator of the .Net Regular Expression Forum, one of the forum posts that I had to responded to, the user was asking a question concerning string parsing. He had a sentence and was using the basic string functionality to parse it but wanted to use a regular expression instead to break out tokens from the parsing. My suggestion was to use regex on the whole sentence dividing up what was needed into regex named capture groups. To do this, the regex pattern has to work on two levels, one to tokenize the sentence and two to processes certain items and the if conditional was on its way to be used.

IF Usage Plan of Attack

That scenario cried out for the if conditional because the user wanted to split out numeric numbers from other words from the sentence and the subsequently tokenize out numbers of different sizes. The sentence looked like this:

This is a test 100 9999 22

His goal was to break out the numbers, three digits or smaller save them, but ignore the bigger series and words. So in my mind to do this in regex this must be done

  1. Break the sentence up in word boundaries.
  2. If it is a word place it into the words category.
  3. If it is a number place it into the numbers category.
  4. For a number check to see if it a small number or a too big number.

See the If’s above…that is an indicator that a regex If is needed as well.

Regex If Overview

Here is how the if conditional works.:

(? (Match Conditional Check no capture) (If true then match/capture this) | (else match/cpature this when conditional match is false) )

Above, I have sprinkled actual regex syntax: parenthesis, question mark and pipe with textual operations in red. What it says is,

if the first match check condition, matches, operationally do the next match condition.

When it is not a match, skip to the following true next match condition, then after the pipe, and do the false match.

The pipe is of course unix speak for or but works in .Net regex.

So my regex, with IngoreWhiteSpace turned on to allow the comments (# …);

(?(\b\d+\b)            # If it is a number
  ()                   # Removed for now.
|                      # Or from the first if
  (?<Word>[^\s.!?]*)   # It is a word
)
  • Line 1 : I have declared the if conditional and have created a match to look for a number.
  • Line 2 : This is blank ( … ) , but when the conditional is true, this is the match target which will be captured. I show the actual match capture below.
  • Line 3 : This tells the regex that we have an else condition to do if the if condition match fails.
  • Line 4 : When the number 2’s if conditional fails, processing goes to here. At this point regex parser will do a true match and place the word into the Word group. Capturing words like This is
  • Line 5 : Ending syntactic grouping for the if conditional.

Back to The User’s Request

So  the above example we have created a framework for the routing of either a digit word or an alpa word with the above example text to be used.

Just to clarify again, remember the if’s conditional match (the check) is not kept! That is a good thing. That allows to the freedom to

  1. Recreate the if conditional if needed, so we can match it.
  2. Or more importantly create a new match condition which frees us from the if conditionals match.

What that means is that we can do a basic find fromthe if, but a more robust match for the actual capture. That is what we will do to fill in #2 above, we will find/match in the if, but create a more robust match/capture below it for the true conditional.

We will now work on the submatch for numbers. The rule provided by the original user was that on a number it either fits or it is to big…again does this sound like another if conditional?

Yes it does.

Here is the full match/capture for #2 missing above which contains another if conditional to split out the numbers since we don’t have to adhere ot the if match/find

(?(\b\d+\b)            # If it is a number (The match / Find )
   (?(\b\d{1,3}\b)      # Then If it is 1-3 digits  ->A new if in our original Match / Capture for True
    (?<Number>\b\d+\b) # Capture to the Number named group
    |
     (?<TooBig>[^\s]*) # Too big of a number.
    )
  |                    # Or from the first if
   (?<Word>[^\s.!?]*)  # It is a word
 )
 (?:\s?|$|[.?!])       # Capture but don't match the nonwords which will not be presented in the final output (this is not a part of the if)
  • Line 2 : Now that we know its a number, let us check the size in a new if conditional. That condition is 1-3 digits in size.
  • Line 3 : If Line 2’s conditional is true we are going to actively match and place the number in to the Number named match group.
  • Line 4 : This is the or that goes with our sub if from line 2 above.
  • Line 5 : Line 2’s conditional failed so we have a number that is too big. Place into the named capture group TooBig.
  • Line 8 : This was explained above and has not changed nor does it associate with the new processing above.
  • Line 10 : This is outside all of the conditionals and simply picks up the next whitespace after the word or number. I have now added it to complete the actual regex to use on the pattern.

Summary

So one can see how if conditional in regexes can break down large issues. Here are some thoughts on If conditionals

  1. Always use IgnoreWhiteSpace option so you can visually see the operations unfold and comment your regular expression.
  2. Remember the if match condition does not have to be repeated. One can then do things outside of the conditional, so you are not limited to its conditional match.
  3. Also one does not have to have the else target, which might come into play in certain circumstances.
Share

Posted by OmegaMan at May 25, 2007

Category: Regular Expressions

Tags:

(Updated 3/26/2011 works with .Net 1-4)
I recently answered a post by a user where he did not understand why one pattern returned a match but the second one did not. What this person did not understand was that the second pattern did succeed, but the match he was looking for had actually five matches from where he wanted it!
That was due to the fact that the regex parser was faithfully reporting the null matches. In this post I will discuss why the Null comes up in .Net regular expression parsing and a suggestion to Microsoft to have a regular expression flag such as IgnoreNullMatchesAndCaptures explained.

Example of Null Matches

The original issue can be boiled down to this, he had a regex pattern of

(b*)
Now the * in the pattern says zero or more instances and that is at the crux of the issue. When that pattern is brought against data such as
ab
what happens is there are three matches. If one looks at the matches (as defined in my blog .Net Regex Capture Groups) returned, this is what is shown, note 0 and 1 under the matches are the group (match capture) indexes:
Match (1):
0 :
1 :
Match (2):
0 : b
1 : b
Match (3):
0 :
1 :
Match index zero (not shown) of the match is always the whole match.
This is what the parser is doing: The first match, the parser is looking at the a. Because a is not b, but the pattern of b’s * specifies zero or more; bingo there is a null match. The same is true for match 3, which I believe matches off the end of the buffer.(?). The second match works because B is B…no explanation needed.
Now I know the proper way to get around this is to change it to (b+) and there will not be a return of null matches. But I present this problem in the scope of a larger regex pattern where one may have grouping issues which capture a null inadvertently. Also sometimes when doing a match across \r\n boundaries…sometimes one gets a null or the end of buffer null…It is that where I think the below suggestion would be useful.

Suggestion Made to Microsoft

I suggested to Microsoft in a connect issue entitled Regular Expression (Regex) Improvements – Null Value Ignore to create a flag on the parser that would ignore all nulls. If there was nothing in the match or captures, then the match would not even be presented. In the above example, if my suggestion was applied only one match would be returned. I have never used the null in regex parsing…maybe a reader could enlighten me…but for the most part if there are nulls, don’t report them. Your thoughts?
If you find it compelling go to the connect issue and vote on it! Thanks.
Share