I recently answered a post by a user where he did not understand why one pattern returned a match but the second one did not. What this person did not understand was that the second pattern did succeed, but the match he was looking for was actually five matches from where he wanted it! That was due to the fact that the regex parser was faithfully reporting the null matches. In this post I will discuss why the Null comes up in .Net regular expression parsing and a suggestion to Microsoft to have a regular expression flag such as IgnoreNullMatchesAndCaptures explained.
Example of Null Matches
The original issue can be boiled down to this, he had a regex pattern of
(b*)
Now the * in the pattern says zero or more instances and that is at the crux of the issue. When that pattern is brought against data such as
ab
what happens is there are three matches. If one looks at the matches (as defined in my blog .Net Regex Capture Groups) returned, this is what is shown, note 0 and 1 under the matches are the group indexes:
Match (1):
0 :
1 :
Match (2):
0 : b
1 : b
Match (3):
0 :
1 :
Index zero of the match is always the whole match. This is what the parser is doing: The first match, the parser is looking at the a. Because a is not b, but the pattern of b’s * specifies zero or more; bingo there is a null match. The same is true for match 3, which I believe matches off the end of the buffer.(?). The second match works because B is B…no explanation needed. Now I know the proper way to get around this is to change it to (b+) and there will not be a return of null matches. But I present this problem in the scope of a larger regex pattern where one may have grouping issues which capture a null inadvertently. Also sometimes when doing a match across \r\n boundaries…sometimes one gets a null or the end of buffer null…It is that where I think the below suggestion would be useful.
Suggestion Made to Microsoft
I suggested to Microsoft in a connect issue entitled Regular Expression (Regex) Improvements – Null Value Ignore to create a flag on the parser that would ignore all nulls. If there was nothing in the match or captures, then the match would not even be presented. In the above example, if my suggestion was applied only one match would be returned. I have never used the null in regex parsing…maybe a reader could enlighten me…but for the most part if there are nulls, don’t report them. Your thoughts? If you find it compelling go to the connect issue and vote on it! Thanks.
#1 by Susan Mackay on July 12, 2007 - 11:55 pm
Quote
PCRE already has this functionality with the PCRE_NOTEMPTY option to the pcre_exec function. Therefore there is already a precedent for MS to follow.
Susan
#2 by omegaman on July 13, 2007 - 5:37 am
Quote
I had wondered if there was something out there that did this….I will have to play with Perls implementation to see how it behaves. Thanks for the post!
#3 by Steven Levithan on November 28, 2007 - 10:22 am
Quote
Why not just write the regex you actually meant to write, which doesn’t allow empty string matches? The ability for regexes or regex tokens to match the empty string is such a fundamental (though often misunderstood) mechanic that messing with it is bound to confuse you more in the long run.
#4 by omegaman on November 28, 2007 - 1:00 pm
Quote
I can’t argue with your logic Steven. In the MSDN regex forums I deal with people who’s knowledge of regex patterns is at a low/beginner’s level. In some situations it would be nice to say set this flag so you can see data on the first pull of the enumeration of the results. Other than that, yes one should write the regex pattern to avoid that circumstance.