Posts Tagged Regular Expressions

Regex Split Pitfalls Over Multiple Lines

This article goes into how to appropriately use the Regex.Split function and the pitfalls one may run into when using it. This article is based on .Net 3.5 using C#, but can be applied to any version of .Net or language.

Overview

Regex replace is a great tool which allows one to do more than the simple string.Split but it has some serious downfalls to the uninitiated. Let us first review how it works in code. This example I term the Kool-aid example for it looks very much like string.Split; it conveys that it is easy to use….

foreach (string str in Regex.Split("Linq-to-SQL", "-"))
    Console.WriteLine(str);

/* Writes

Linq
to
SQL

*/

Pretty obvious and it appears to work like string.split, we are splitting on the dash and it works. But one might as well use string.Split for the easy examples, for in real life one doesn’t use regex split on basic patterns.

Pitfalls

Since a regex pattern is used to match specific text, one believes that because they have a pattern which picks up a valid matches, it can transfer into regex.split as is…oh no. One has to be extra vigilant with the pattern because one probably wants to split on a particular item, but forgets its surrounded by and contains whitespace and line feeds.

For example, since people work in textual items and not esoteric numeric examples, say you wanted to remove certain lines from paragraphs. Say this text is what is the originating text

Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.

The goal is to remove the finished place lines, shown in italic. The natural thing to do is to create a match for the line. A pattern such as

Y[^\.]+\.

Will find text that starts with a Y and match/consume til it finds a period. Run that and the data through a regex pattern testing tool and shows that we get this result:

You finished in 1st place.
You finished in 4th place.

Great! So one loads it into code and runs it such as:

string input = @"Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.";

string pattern = @"Y[^\.]+\.";

foreach (string str in Regex.Split(input, pattern))
    Console.WriteLine( str );

/* Outputs:

Cinthia Blake[CR][LF]
Olso Norway[CR][LF]
[CR][LF]
[CR][LF]
Fred Alter[CR][LF]
Chicago USA[CR][LF]
[CR][LF]

*/

I have included the whitespace of the \r\n’s as [CR][LF] to see the problem. Instead of a clean list of name, location, name, location…we now have name location, line line, name location, line. Whoa! where did those lines come from!

The user thinks that Regex replace is not working; its returning extra lines….and gives up!

No. the problem is that the whitespace was not accounted for when using the split and making the pattern. Yes it matched the lines and dutifully split and left the whitespace; almost as an after thought. Frustrating to the user.

Conclusion

You have to be extra vigilant about using Regex.Split and its pattern. Be painfully aware of whitespace. Here is a pattern to handle the whitespace and achieve the result intended:

string input = @"Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.";

string pattern =
@"[\r\n]{0,2}        # If there is Line feeds before then match it.
Y[^\.]+\.            # original pattern
\s*                  # Maybe there are spaces after the sentance...get those
[\r\n]{0,2}          # *If* there is Linefeeds after then match it.
";

// IgnorePatternWhitespace allows us to comment the code, it does
// NOT apply to the processing of spaces within the input text.

foreach (string str in Regex.Split(input, pattern, RegexOptions.IgnorePatternWhitespace))
    Console.WriteLine( str );

/* Outputs:

Cinthia Blake[CR][LF]
Olso Norway
Fred Alter[CR][LF]
Chicago USA

*/

Now we have what we expected, two groups split on the appropriate sentences. I have shown where there is still whitespace in the result comment.

But long story short, be extra vigilant about whitespace in real world regex replace work. Just matching on something will not fit the bill.

Share

Tags: ,

.Net Regex IgnorePatternWhiteSpace Only Applies to the Regex Processor

The option IgnorePatternWhiteSpace only applies to the regex processor reading the pattern and not in how it handles the input text. The option allows one to document a pattern by placing items on different lines and allowing for comments using the # escape character. Using the option I can document my regex as follows:


string pattern = @"
^                # Beginning of The Line
(?<Text>[^\d]+)  # Move all non numbers into the Text group.
(?<Number>\d+)   # Get all numbers into the Number Capture group.
";

Otherwise without the option the pattern must be written like this:


string pattern = @"^(?<Text>[^\d]+)(?<Number>\d+)";

Hence the readability for patterns is greatly increased.

Share

Tags:

C# Regular Expression Suggestions when working with C#

Here are some things which an make your working with C# regular expression patterns much easier in reading and processing in C#.

Suggestion 1 – Avoid C# Text Escape Pollution

When working in C# it can become confusing when one has to deal with string literals and escapes even before dealing with the regular expression escapes. For example if we have an escape such as word boundary in regex (\b) we have to escape the escape in C# such as

string pattern = "\\b";

That gets confusing because we don’t want have to deal with C#…we are working in regex and \\b does not mean what we think it is (though it gets sent to the parse appropriately). What we should do is use C# literal convention (@) infront of the string such as

string pattern = @"\b";

The two shown C# lines are functionally equivalent…but now we can concentrate on the regex pattern with no pollution from C# escapes.

Suggestion 2 – Use Regex Ascii Escapes for quotes

Some people will go out of there way to use double quotes (@” “” “) in C# to search for a double quote in a regex pattern. This is confusing try using the Regex Ascii escape pattern instead. Below is a code sample that is equivalent:

string pattern = @""""; // I am only searching for a quote
 
// VS
 
string pattern2 = @"\x22"; // Much better
 

Note if you are using Expresso as your regex editor, it provides a handy way of finding those escapes:

image

Suggestion 3 – Use the IgnorePatternWhitespace option

This option confuses beginning regex-ers because they think it applies to what the pattern does….when in fact it is a preprocessing instruction for the regex parser solely! What it does is it allows you to put space in a pattern and have it hang over lines for easier reading. Here is a sample I created for a forum post where I was able to break out a long pattern. Thereby commenting it and making it easier to read. without the IgnorePatternWhitespace option, one would have to remove the comments and make it all one line:

string text =
@"5:16:04.859 PM:  07:18:12p  2.33   0.45   NH4                      9558    WORK
5:16:06.000 PM:  07:18:13p  2.29   0.31   RIN                     10554    WORK
5:16:07.625 PM:  07:18:15p  2.33   0.44   NH4                      9645    WORK
5:16:09.125 PM:  07:18:16p  2.29   0.32   RIN                     10400    WORK";

 
string pattern =
@"^(?<Time1>[^\s]*)  # Start of line, capture first time and place into Time1
   (?:\s*)           # Match but don't capture (MBDC) the space (Used as an anchor)
   (?<AmPm1>[AP]M)   # Get the AM | PM and put it into AmPm1 capture group.
   (?:\:\s*)         # MBDC : and space
   (?<Time2>[^ap]*)  # Time 2 Capture
   (?<AmPm2>[ap])    # AmPm capture
   (?:\s*)
   (?<Col1>[^\s]*)   # Data column 1
   (?:\s*)
   (?<Col2>[^\s]*)   # Data column 2
   (?:\s*)
   (?<Col3>[^\s]*)   # Data column 3
   (?:\s*)
   (?<Col4>[^\s]*)   # Data column 4
   (?:\s*)
   (?<Col5>[^\s]*)   # Data column 5";

 
Regex rgx = new Regex(pattern,
                  RegexOptions.Multiline | // ^ and $ match Beginning and EOL.
                  RegexOptions.IgnorePatternWhitespace); // Allows us to do the comments.
 
 
string[] groupNames = rgx.GetGroupNames();
 
Console.WriteLine("Groups: ({0}){1}", string.Join(") (", groupNames), System.Environment.NewLine);
 
MatchCollection mc = rgx.Matches(text);
 
foreach (Match m in mc)
    if (m.Success)
    {
        Console.WriteLine("Match:");
        foreach (string name in groupNames)
            Console.WriteLine("{0,10} : {1}", name, m.Groups[name]);
 
        Console.WriteLine("{0}Time1 ({1}) Time2 ({2}){0}",
            System.Environment.NewLine,
            m.Groups["AmPm1"].Value,
            ( ( m.Groups["AmPm2"].Value == "a" ) ? "AM" : "PM" ));
    }
 
 
 
Share

Tags: ,

.Net Regex MatchEvaluator

The regex match evaluator gives one the ability to do a post process match in step for each match found. It is a handy way to normalize the match before sending it on. In that process could easily change or alter the match when needed. It also allows us to eat the match and have it return nothing!

Here is an example which I had from the boards. The user wanted to use regex replace to remove all alphabetic characters but return all numbers and a decimal place. But he had situations where there were two decimals. In that situation then only return the first one.

12abc.def34 becomes 12.34

.56a.d78 becomes .5678

Here is the code to accomplish that

// Only worry about decimals and letters.
string pattern = @"
(?<Decimal>\.) |       # Check for decimal
(?<Letter>[A-Za-z]+)    # Check for letter
";

string data = "0.ab.c1d.23"; // We want 0.123

int decimalPointCount = 0;

// Here is the Match Evaluator for Post Processing.
// We will eat the letters and return the decimals.
// The match evaluator will feed us every match found as it finds it.
MatchEvaluator CatchMultipleDecimals = delegate( Match m )
{
// Check for a decimal match only and return only the first one found
if (string.IsNullOrEmpty(m.Groups["Decimal"].Value) == false)
{
   if (decimalPointCount++ > 0) // We have gone over…return nothing!
      return string.Empty;
   else
      return m.Groups["Decimal"].Value; // Return the .
}

// We are only capturing text...so return nothing
// on any other match we may get.
   return string.Empty;

};

MatchEvaluator myEvaluator = new MatchEvaluator(CatchMultipleDecimals);

// Remember IgnoreWhiteSpace option only applies to how the regex parser
// processes the pattern and not the data. Since I have created my
// pattern split over multiple lines for readability I need to tell
// the regex parser to strip all whitespace from my pattern *before* it
// looks at any data. It has nothing to do with how the data is matched or processed.
Console.WriteLine(Regex.Replace(data, pattern, myEvaluator, RegexOptions.IgnorePatternWhitespace));

// Outputs 0.123 from the text 0.ab.c1d.23

Explanation

  • Line 2  : The pattern will only capture a decimal point or letter(s).
  • Line 14  : Here is the delegate that has the code which will be called whenever a match occurs.
  • Line 17 : Since we have placed the items into groups, we will check the Decimal group for any data.  If data exists, we return a decimal point only once, otherwise we return string.empty.
  • Line 21 : We could check for the Letter group, but that is not needed. Since the decimal is handled already, we will eat whatever is at this point and return string.empty to show that this is a non match.
  • Line 27 : We use regex replace to return any numbers with only one decimal thanks to the Match Evaluator we have created and used.
Share

Tags: , , ,

Regular Expression and the Ubiquitous Null

(Updated 3/26/2011 works with .Net 1-4)
I recently answered a post by a user where he did not understand why one pattern returned a match but the second one did not. What this person did not understand was that the second pattern did succeed, but the match he was looking for had actually five matches from where he wanted it!
That was due to the fact that the regex parser was faithfully reporting the null matches. In this post I will discuss why the Null comes up in .Net regular expression parsing and a suggestion to Microsoft to have a regular expression flag such as IgnoreNullMatchesAndCaptures explained.

Example of Null Matches

The original issue can be boiled down to this, he had a regex pattern of

(b*)
Now the * in the pattern says zero or more instances and that is at the crux of the issue. When that pattern is brought against data such as
ab
what happens is there are three matches. If one looks at the matches (as defined in my blog .Net Regex Capture Groups) returned, this is what is shown, note 0 and 1 under the matches are the group (match capture) indexes:
Match (1):
0 :
1 :
Match (2):
0 : b
1 : b
Match (3):
0 :
1 :
Match index zero (not shown) of the match is always the whole match.
This is what the parser is doing: The first match, the parser is looking at the a. Because a is not b, but the pattern of b’s * specifies zero or more; bingo there is a null match. The same is true for match 3, which I believe matches off the end of the buffer.(?). The second match works because B is B…no explanation needed.
Now I know the proper way to get around this is to change it to (b+) and there will not be a return of null matches. But I present this problem in the scope of a larger regex pattern where one may have grouping issues which capture a null inadvertently. Also sometimes when doing a match across \r\n boundaries…sometimes one gets a null or the end of buffer null…It is that where I think the below suggestion would be useful.

Suggestion Made to Microsoft

I suggested to Microsoft in a connect issue entitled Regular Expression (Regex) Improvements – Null Value Ignore to create a flag on the parser that would ignore all nulls. If there was nothing in the match or captures, then the match would not even be presented. In the above example, if my suggestion was applied only one match would be returned. I have never used the null in regex parsing…maybe a reader could enlighten me…but for the most part if there are nulls, don’t report them. Your thoughts?
If you find it compelling go to the connect issue and vote on it! Thanks.
Share

Tags: