.Net Regex: Can Regular Expression Parsing be Faster than XmlDocument or Linq to Xml?
Most of the time one needs the power of the xml parser whether it is the XmlDocument or Linq to Xml to manipulate and extract data. But what if I told you that in some circumstances regular expressions might be faster?
Most conventional development thinking has branded regex processing as slow and the thought of using regex on xml might seem counter intuitive. In a continuation of articles I again want to dispel those thoughts and provide a real world example where Regular Expression parsing is not only on par with other tools in the .Net world but sometimes faster. The results of my speed test may surprise you; and hopefully show that regular expressions are not as slow as believed, if not faster!
See: Are C# .Net Regular Expressions Fast Enough for You?
Real World Scenario
There was a developer on the MSDN forums who needed the ability to count URLs in multiple xml files. (See the actual post count the urls in xml file on Msdn) The poster received three distinct replies, one to use XMLDocument, another provided a Linq to XML solution and I chimed in with the regular expression method. The poster took the XMLDocument method and marked as the answer, but could he have done better?
I thought so…
So I took the three replies and distilled them down into their core processing and wrapped them in a similar IO extraction layer and proceeded to time them. I created 48 xml files with over one hundred thousand urls to find for a total of 13 meg on disk. I then proceeded to run the test all in release mode to get the results. (See below section Setup to get a gist repository of the code).
Real World Result
Five tests, each test name is the technology and the user as found on the original msdn post. In red is the slowest and fastest time. Remember XmlDoc is the one the user choose as the answer.
Test 1
Regex found 116736 urls in 00:00:00.1843576
XmlLinq_Link_FR found 116736 urls in 00:00:00.2662190
XmlDoc_Hasim() found 116736 urls in 00:00:00.3534628
Test 2
Regex found 116736 urls in 00:00:00.2317883
XmlLinq_Link_FR found 116736 urls in 00:00:00.2792730
XmlDoc_Hasim() found 116736 urls in 00:00:00.2694969
Test 3
Regex found 116736 urls in 00:00:00.1646719
XmlLinq_Link_FR found 116736 urls in 00:00:00.2333891
XmlDoc_Hasim() found 116736 urls in 00:00:00.2625176
Test 4
Regex found 116736 urls in 00:00:00.1677931
XmlLinq_Link_FR found 116736 urls in 00:00:00.2258825
XmlDoc_Hasim() found 116736 urls in 00:00:00.2590841
Test 5
Regex found 116736 urls in 00:00:00.1668231
XmlLinq_Link_FR found 116736 urls in 00:00:00.2278445
XmlDoc_Hasim() found 116736 urls in 00:00:00.2649262
Wow! Regex consistently performed better, even when there was no caching of the files as found for the first run! Note that the time is Hours : Minutes : Seconds and regex’s is the fastest at 164 millseconds to parse 48 files! Regex worst time of 184 milleseconds is still better than the other two’s best times.
How was this all done? Let me show you.
Setup
Ok what magic or trickery have I played? All tests are run in a C# .Net 4 Console application in release mode. I have created a public Gist (Regex vs Xml) repository of the code and data which is actually valid Git repository for anyone how may want to add their tests, but let me detail what I did here on the blog as well.
The top level operation found in the Main looks like this where I run the tests 5 times
Enumerable.Range( 1, 5 ) .ToList() .ForEach( tstNumber => { Console.WriteLine( "Test " + tstNumber ); Time( "Regex", RegexFindXml ); Time( "XmlLinq_Link_FR", XmlLinq_Link_FR ); Time( "XmlDoc_Hasim()", XmlDoc_Hasim ); Console.WriteLine( Environment.NewLine ); }
while the Time generic method looks like this and dutifully runs the target work and reports the results in “Test X found Y Urls in X [time]”:
public static void Time<T>( string what, Func<T> work ) { var sw = Stopwatch.StartNew(); var result = work(); sw.Stop(); Console.WriteLine( "\t{0,-15} found {1} urls in {2}", what, result, sw.Elapsed ); }
Now in the msdn post the different methods had differing ways of finding each xml file and opening it, I made them all adhere to the way I open and sum the ULR counts. Here is its snippet:
return Directory.EnumerateFiles( @"D:\temp", "*.xml" ) .ToList() .Sum( fl => { } );
Contender – XML Document
This is one which the poster marked as the chosen one he used and I dutifully copied it to the best of my ability.
public static int XmlDoc_Hasim() { return Directory.EnumerateFiles( @"D:\temp", "*.xml" ) .ToList() .Sum( fl => { XmlDocument doc = new XmlDocument(); doc.LoadXml( System.IO.File.ReadAllText( fl ) ); if (doc.ChildNodes.Count > 0) if (doc.ChildNodes[1].HasChildNodes) return doc.ChildNodes[1].ChildNodes.Count; return 0; } ); }
I used the sum extension method which is a little different from the original sum operation used, but it brings the tests closer in line by using the Extension.
Contender – Linq to Xml
Of the other two attempts, this one I felt was the more robust of the two, because it actually handled the xml namespace. Sadly it appeared to be ignored by the original poster. Here is his code
public static int XmlLinq_Link_FR() { XNamespace xn = "http://www.sitemaps.org/schemas/sitemap/0.9"; return Directory.EnumerateFiles( @"D:\temp", "*.xml" ) .Sum( fl => XElement.Load( fl ).Descendants( xn + "loc" ).Count() ); }
Contender – Regular Expression
Finally here is the speed test winner. I came up with the pattern design Upon by looking at the xml and it appeared one didn’t need to match the actual url, but just the two preceding tags and any possible space between. That is the key to regex, using good patterns can achieve fast results.
public static int RegexFindXml() { string pattern = @"(<url>\s*<loc>)"; return Directory.EnumerateFiles( @"D:\temp", "*.xml" ) .Sum( fl => Regex.Matches( File.ReadAllText( fl ), pattern ).OfType<Match>().Count() ); }
XML1 (Shortened)
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/barcelona.html</loc><changefreq>weekly</changefreq></url> <url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/basel.html</loc><changefreq>weekly</changefreq></url> <url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/bath.html</loc><changefreq>weekly</changefreq></url> <url><loc>http://www.linkedin.com/directory/companies/computer-networking/sheffield.html</loc><changefreq>weekly</changefreq></url> <url><loc>http://www.linkedin.com/directory/companies/computer-networking/singapore.html</loc><changefreq>weekly</changefreq></url> <url><loc>http://www.linkedin.com/directory/companies/computer-networking/slough.html</loc><changefreq>weekly</changefreq></url> <url><loc>http://www.linkedin.com/directory/companies/computer-networking/slovak-republic.html</loc><changefreq>weekly</changefreq></url> </urlset>
Xml2 Shortened
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url><loc>http://www.linkedin.com/groups/gid-2431604</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2430868</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/Wireless-Carrier-Reps-Past-Present-2430807</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2430694</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2430575</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2431452</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2432377</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2428508</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2432379</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2432380</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2432381</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2432383</loc><changefreq>monthly</changefreq></url> <url><loc>http://www.linkedin.com/groups/gid-2432384</loc><changefreq>monthly</changefreq></url> </urlset>
Summary
It really comes down to the right tool for the right situation and this one regex really did well. But Regex is not good at most xml parsing needs, but for certain scenarios it really shines. If the xml has malformed or the namespace was wrong, then the parser has its own unique problems which would lead to a bad count. All the technologies had to do some upfront loading and that is key to how they performed. Regex is optimized to handle large data efficiently and as long as the pattern is spot on, it can really be quick.
My thought is don’t dismiss regular expression parsing out of hand, while the learning of it can pay off in some unique text parsing situations.
Thanks!
I’m really surprised by this benchmark but as you said “It really comes down to the right tool for the right situation and this one regex really did well. But Regex is not good at most xml parsing needs, but for certain scenarios it really shines.”
Is LinqToXML always more performant than XmlDocument ?
Can the XmlDocument code in this test be factored better…but your question is a good one Link.fr. I believe since Linq-To-Sql is the newer technology, it will predominantly be faster for most Xml situations. This one was unique and regex was a good canidate.
And the worst result was: 353 milliseconds.
To me the takeaways are don’t prematurely optimize and use the tools designed for the job (such as an XML parser for parsing XML).
While there are always edge cases, the answer to 98% of the “how can I parse XML/HTML with regular expressions” questions on Stack Overflow remains: Don’t Use Regular Expressions, Use a Parser.
Well said Bill. I run into the same issue in MSDN Regex forum where the user wants to use regex for html node processing. Ack…use the html agility pack.
Sorry for the tangent, but I was wondering about this;
Enumerable.Range( 1, 5 )
.ToList()
.ForEach( tstNumber =>
{
Console.WriteLine( “Test ” + tstNumber );
}
Is there any benefit to using this pattern, rather than the more traditional for(int i=1; i <=5; i++) { … } ?
I went away from using the for loops and choose foreach[s] instead so not to worry about an index. Once I began using Linq extensively I prefer enumerating over enumerable lists. That pattern speaks to both of both of my current programming methodologies. That is a personal preference soley and both (the for and the .Foreach) achieve the same result in the end.