This article goes into how to appropriately use the Regex.Split function and the pitfalls one may run into when using it. This article is based on .Net 3.5 using C#, but can be applied to any version of .Net or language.

Overview

Regex replace is a great tool which allows one to do more than the simple string.Split but it has some serious downfalls to the uninitiated. Let us first review how it works in code. This example I term the Kool-aid example for it looks very much like string.Split; it conveys that it is easy to use….

foreach (string str in Regex.Split("Linq-to-SQL", "-"))
    Console.WriteLine(str);

/* Writes

Linq
to
SQL

*/

Pretty obvious and it appears to work like string.split, we are splitting on the dash and it works. But one might as well use string.Split for the easy examples, for in real life one doesn’t use regex split on basic patterns.

Pitfalls

Since a regex pattern is used to match specific text, one believes that because they have a pattern which picks up a valid matches, it can transfer into regex.split as is…oh no. One has to be extra vigilant with the pattern because one probably wants to split on a particular item, but forgets its surrounded by and contains whitespace and line feeds.

For example, since people work in textual items and not esoteric numeric examples, say you wanted to remove certain lines from paragraphs. Say this text is what is the originating text

Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.

The goal is to remove the finished place lines, shown in italic. The natural thing to do is to create a match for the line. A pattern such as

Y[^\.]+\.

Will find text that starts with a Y and match/consume til it finds a period. Run that and the data through a regex pattern testing tool and shows that we get this result:

You finished in 1st place.
You finished in 4th place.

Great! So one loads it into code and runs it such as:

string input = @"Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.";

string pattern = @"Y[^\.]+\.";

foreach (string str in Regex.Split(input, pattern))
    Console.WriteLine( str );

/* Outputs:

Cinthia Blake[CR][LF]
Olso Norway[CR][LF]
[CR][LF]
[CR][LF]
Fred Alter[CR][LF]
Chicago USA[CR][LF]
[CR][LF]

*/

I have included the whitespace of the \r\n’s as [CR][LF] to see the problem. Instead of a clean list of name, location, name, location…we now have name location, line line, name location, line. Whoa! where did those lines come from!

The user thinks that Regex replace is not working; its returning extra lines….and gives up!

No. the problem is that the whitespace was not accounted for when using the split and making the pattern. Yes it matched the lines and dutifully split and left the whitespace; almost as an after thought. Frustrating to the user.

Conclusion

You have to be extra vigilant about using Regex.Split and its pattern. Be painfully aware of whitespace. Here is a pattern to handle the whitespace and achieve the result intended:

string input = @"Cinthia Blake
Olso Norway
You finished in 1st place.
Fred Alter
Chicago USA
You finished in 4th place.";

string pattern =
@"[\r\n]{0,2}        # If there is Line feeds before then match it.
Y[^\.]+\.            # original pattern
\s*                  # Maybe there are spaces after the sentance...get those
[\r\n]{0,2}          # *If* there is Linefeeds after then match it.
";

// IgnorePatternWhitespace allows us to comment the code, it does
// NOT apply to the processing of spaces within the input text.

foreach (string str in Regex.Split(input, pattern, RegexOptions.IgnorePatternWhitespace))
    Console.WriteLine( str );

/* Outputs:

Cinthia Blake[CR][LF]
Olso Norway
Fred Alter[CR][LF]
Chicago USA

*/

Now we have what we expected, two groups split on the appropriate sentences. I have shown where there is still whitespace in the result comment.

But long story short, be extra vigilant about whitespace in real world regex replace work. Just matching on something will not fit the bill.

Share