Posts tagged ‘Regular Expressions’

C# Regex for Parsing Known texts

A user wanted to parse basic text. Here is a regex pattern which  breaks out the user text.

 

string @pattern = @"
(?:OS\s)                     # Match but don't capture (MDC) OS, used an an anchor
(?<Version>\d\.\d+)          # Version of OS
(?:;)                        # MDC ;
(?<Phone>[^;]+)              # Get phone name up to ;
(?:;)                        # MDC ;
(?<Type>[^;]+)               # Get phone type up to ;
(?:;)                        # MDC ;
(?<Major>\d\.\d+)            # Major version
(?:;)
(?<Minor>\d+)                # Minor Version
";

string data = 
@"Windows Phone Search (Windows Phone OS 7.10;Acer;Allegro;7.10;8860)
Windows Phone Search (Windows Phone OS 7.10;HTC;7 Mozart T8698;7.10;7713)
Windows Phone Search (Windows Phone OS 7.10;HTC;Radar C110e;7.10;7720)";

 // Ignore pattern white space allows us to comment the pattern, it is not a regex processing command
var phones = Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace)
                  .OfType<Match>()
                  .Select (mt => new 
                  {
                    Name = mt.Groups["Phone"].Value.ToString(),
                    Type = mt.Groups["Type"].Value.ToString(),
                    Version = string.Format( "{0}.{1}", mt.Groups["Major"].Value.ToString(),
                                                        mt.Groups["Minor"].Value.ToString())
                  }
                  );
                  
Console.WriteLine ("Phones Supported are:");            

phones.Select(ph => string.Format("{0} of type {1} version ({2})", ph.Name, ph.Type, ph.Version))
      .ToList()
      .ForEach(Console.WriteLine);
      
/* Output
Phones Supported are:
Acer of type Allegro version (7.10.8860)
HTC of type 7 Mozart T8698 version (7.10.7713)
HTC of type Radar C110e version (7.10.7720)
*/
Share

.Net Regex: Can Regular Expression Parsing be Faster than XmlDocument or Linq to Xml?

iStock_000017256683XSmallMost of the time one needs the power of the xml parser whether it is the XmlDocument or Linq to Xml to manipulate and extract data. But what if I told you that in some circumstances regular expressions might be faster?

Most conventional development thinking has branded regex processing as slow and the thought of using regex on xml might seem counter intuitive. In a continuation of articles I again want to dispel those thoughts and provide a real world example where Regular Expression parsing is not only on par with other tools in the .Net world but sometimes faster. The results of my speed test may surprise you;  and hopefully show that regular expressions are not as slow as believed, if not faster!

See: Are C# .Net Regular Expressions Fast Enough for You?

Real World Scenario

There was a developer on the MSDN forums who needed the ability to count URLs in multiple xml files. (See the actual post count the urls in xml file on Msdn) The poster received three distinct replies, one to use XMLDocument, another provided a Linq to XML solution and I chimed in with the regular expression method. The poster took the XMLDocument method and marked as the answer, but could he have done better?

I thought so…

So I took the three replies and distilled them down into their core processing and wrapped them in a similar IO extraction layer and proceeded to time them. I created 48 xml files with over one hundred thousand urls to find for a total of 13 meg on disk. I then proceeded to run the test all in release mode to get the results.  (See below section Setup to get a gist repository of the code).

Real World Result

Five tests, each test name is the technology and the user as found on the original msdn post. In red is the slowest and fastest time. Remember XmlDoc is the one the user choose as the answer.

Test 1
Regex           found 116736 urls in 00:00:00.1843576
XmlLinq_Link_FR found 116736 urls in 00:00:00.2662190
XmlDoc_Hasim()  found 116736 urls in 00:00:00.3534628

Test 2
Regex           found 116736 urls in 00:00:00.2317883
XmlLinq_Link_FR found 116736 urls in 00:00:00.2792730
XmlDoc_Hasim()  found 116736 urls in 00:00:00.2694969

Test 3
Regex           found 116736 urls in 00:00:00.1646719
XmlLinq_Link_FR found 116736 urls in 00:00:00.2333891
XmlDoc_Hasim()  found 116736 urls in 00:00:00.2625176

Test 4
Regex           found 116736 urls in 00:00:00.1677931
XmlLinq_Link_FR found 116736 urls in 00:00:00.2258825
XmlDoc_Hasim()  found 116736 urls in 00:00:00.2590841

Test 5
Regex           found 116736 urls in 00:00:00.1668231
XmlLinq_Link_FR found 116736 urls in 00:00:00.2278445
XmlDoc_Hasim()  found 116736 urls in 00:00:00.2649262

 

Wow! Regex consistently performed better, even when there was no caching of the files as found for the first run! Note that the time is Hours : Minutes : Seconds and regex’s is the fastest at 164 millseconds to parse 48 files! Regex worst time of 184 milleseconds is still better than the other two’s best times.

How was this all done? Let me show you.

Setup

Ok what magic or trickery have I played? All tests are run in a C# .Net 4 Console application in release mode. I have created a public Gist (Regex vs Xml) repository of the code and data which is actually valid Git repository for anyone how may want to add their tests, but let me detail what I did here on the blog as well.

The top level operation found in the Main looks like this where I run the tests 5 times

Enumerable.Range( 1, 5 )
            .ToList()
            .ForEach( tstNumber =>
            {
                Console.WriteLine( "Test " + tstNumber );
                Time( "Regex", RegexFindXml );
                Time( "XmlLinq_Link_FR", XmlLinq_Link_FR );
                Time( "XmlDoc_Hasim()", XmlDoc_Hasim );
                Console.WriteLine( Environment.NewLine );
            }

while the Time generic method looks like this and dutifully runs the target work and reports the results in “Test X found Y Urls in X [time]”:

public static void Time<T>( string what, Func<T> work )
{
    var sw = Stopwatch.StartNew();
    var result = work();
    sw.Stop();
    Console.WriteLine( "\t{0,-15} found {1} urls in {2}", what, result, sw.Elapsed );
}

Now in the msdn post the different methods had differing ways of finding each xml file and opening it, I made them all adhere to the way I open and sum the ULR counts. Here is its snippet:

return Directory.EnumerateFiles( @"D:\temp", "*.xml" )
            .ToList()
            .Sum( fl =>
            {

            } );

Contender  –  XML Document

This is one which the poster marked as the chosen one he used and I dutifully copied it to the best of my ability.

public static int XmlDoc_Hasim()
{
    return Directory.EnumerateFiles( @"D:\temp", "*.xml" )
                .ToList()
                .Sum( fl =>
                {
                    XmlDocument doc = new XmlDocument();
                    doc.LoadXml( System.IO.File.ReadAllText( fl ) );

                    if (doc.ChildNodes.Count > 0)
                        if (doc.ChildNodes[1].HasChildNodes)
                            return doc.ChildNodes[1].ChildNodes.Count;

                    return 0;

                } );

}

I used the sum extension method which is a little different from the original sum operation used, but it brings the tests closer in line by using the Extension.

Contender – Linq to Xml

Of the other two attempts, this one I felt was the more robust of the two, because it actually handled the xml namespace. Sadly it appeared to be ignored by the original poster. Here is his code

public static int XmlLinq_Link_FR()
{
    XNamespace xn = "http://www.sitemaps.org/schemas/sitemap/0.9";

    return Directory.EnumerateFiles( @"D:\temp", "*.xml" )
                    .Sum( fl => XElement.Load( fl ).Descendants( xn + "loc" ).Count() );

}

Contender – Regular Expression

Finally here is the speed test winner. I came up with the pattern design Upon by looking at the xml and it appeared one didn’t need to match the actual url, but just the two preceding  tags and any possible space between. That is the key to regex, using good patterns can achieve fast results.

public static int RegexFindXml()
{
    string pattern = @"(<url>\s*<loc>)";

    return Directory.EnumerateFiles( @"D:\temp", "*.xml" )
                    .Sum( fl => Regex.Matches( File.ReadAllText( fl ), pattern ).OfType<Match>().Count() );

}

XML1 (Shortened)

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/barcelona.html</loc><changefreq>weekly</changefreq></url>
<url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/basel.html</loc><changefreq>weekly</changefreq></url>
<url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/bath.html</loc><changefreq>weekly</changefreq></url>
<url><loc>http://www.linkedin.com/directory/companies/computer-networking/sheffield.html</loc><changefreq>weekly</changefreq></url>
<url><loc>http://www.linkedin.com/directory/companies/computer-networking/singapore.html</loc><changefreq>weekly</changefreq></url>
<url><loc>http://www.linkedin.com/directory/companies/computer-networking/slough.html</loc><changefreq>weekly</changefreq></url>
<url><loc>http://www.linkedin.com/directory/companies/computer-networking/slovak-republic.html</loc><changefreq>weekly</changefreq></url>
</urlset>

Xml2 Shortened

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.linkedin.com/groups/gid-2431604</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2430868</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/Wireless-Carrier-Reps-Past-Present-2430807</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2430694</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2430575</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2431452</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2432377</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2428508</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2432379</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2432380</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2432381</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2432383</loc><changefreq>monthly</changefreq></url>
<url><loc>http://www.linkedin.com/groups/gid-2432384</loc><changefreq>monthly</changefreq></url>
</urlset>

Summary

It really comes down to the right tool for the right situation and this one regex really did well. But Regex is not good at most xml parsing needs, but for certain scenarios it really shines. If the xml has malformed or the namespace was wrong, then the parser has its own unique problems which would lead to a bad count. All the technologies had to do some upfront loading and that is key to how they performed. Regex is optimized to handle large data efficiently and as long as the pattern is spot on, it can really be quick.

My thought is don’t dismiss regular expression parsing out of hand, while the learning of it can pay off in some unique text parsing situations.

Share

C# Extracting CSV data into Linq and a Dictionary using Regular Expressions

One is uniqueI had written a post a while back which detailed a regular expression pattern used by the .Net regex parser which parsed a Comma Separated Value file, or a CSV file for short. Upon looking at the pattern I came to realize that the pattern didn’t work for all situations. So I have created a new pattern which will extract all items from the CSV data into into a dynamic anonymous Linq entity.  Following that example I will show one how to use the same Linq entity to put that CSV data into a dictionary, a hash table, where the key of entry is the first column’s data.

CSV Considerations

  1. Data separated by a comma.
  2. Quotes, single or double are an optional encapsulation of data.
  3. Any data which has a comma must be encased in quotes.
  4. Quoted data can be single or double quote.
  5. Data rows can be ragged.
  6. Null data handled except for last column
  7. Last data column cannot be null.
'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,"Other, Space",No Quote Space,'555'

Regular Expression Pattern

The things of note about the below pattern are

  • Pattern needs Regex Options. Those options for this article are defined both in the pattern and the call to the regular expression parser; normalcy only its done only once.
    1. Pattern commented so IgnorePatternWhitespace option is needed. Note that option does not affect the regex parsing of the data.
    2. Multiline option needed so ^ matches the beginning of each line and $ matches the end, after the \r\n.
  • Regular Expression if condition is used to test if the indivudal column data is enclosed in quotes. If it finds a quote it consumes the quotes but does not pass them on to the final data processing.
  • Each line will correspond to one match
  • All data  put into named match capture called Column; hence the match will have all line values in the capture collection named Column.
(?xm)                        # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
                             # and tell the compiler this is multiline (m),
                             # In Multiline the ^ matches each start line and $ is each EOL
                             # -Pattern Start-
^(                           # Start at the beginning of the line always
 (?![\r\n]|$)                # Stop the match if EOL or EOF found.
 (?([\x27\x22])              # Regex If to check for single/double quotes
      (?:[\x27\x22])         # \\x27\\x22 are single/double quotes
      (?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
      (?:[\x27\x22])

  |                          # or (else) part of If when Not within quotes

     (?<Column>[^,\r\n]*)    # Not within quotes, but put it in the column
  )                          # End of Pattern OR

(?:,?)                       # Either a comma or EOL/EOF
)+                           # 1 or more columns of data.

Regex to Linq

Here is the code which will enumerate over each match and add the contents of the match capture collection into a dynamic linq entity. Notes:

  1. tThe code below uses the regex pattern mentioned above but does not show it for brevity.
  2. The regex options are set twice for example. One only needs to set them once.
string pattern = @" ... ";


string text = /* Note the ,, as a null situation */
@"'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,""Other, Space"",No Quote Space,'555'";

// We specified the Regex options in teh pattern, but we can also specify them here.
// Both are redundant, decide which you prefer and use one.
var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
              select new
              {
                  Data = from Capture cp in m.Groups["Column"].Captures
                         select cp.Value,
              };

int lineNo = 0;

foreach ( var line in CSVData )
    Console.WriteLine( string.Format("Line #{0}:  {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) ));

/* Output

Line #1:  1||01000000043|2|4|Regex Space
Line #2:  2|333|444|Other, Space|No Quote Space|555

*/

Linq To Dictionary

Taking the same code above, specifically the dynamic Linq entity holder CSVData, we will transform it into a dictionary where the key into the hashtable is the first CSV column data item.

// Put into Dictionary where the key is the first csv column data.
// Note the below creates a KeyValuePair using an integer for the
// key whichextracted as the parsing goes on. It is not used. It
// is simply shown for example of getting the index from Linq and
// could be change to use the first column instead.

Dictionary<string, List<string>> items2 =
    CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) )
           .ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value );


foreach ( KeyValuePair<string, List<string>> kvp in items2 )
      Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) );

/*
Key Alpha : Alpha||01000000043|2|4|Regex Space
Key Beta : Beta|333|444|Other, Space|No Quote Space|555
*/

Full Code

string pattern = @"
(?xm)                        # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
                             # and tell the compiler this is multiline (m),
                             # In Multiline the ^ matches each start line and $ is each EOL
                             # Pattern Start
^(                           # Start at the beginning of the line always
 (?![\r\n]|$)                # Stop the match if EOL or EOF found.
 (?([\x27\x22])              # Regex If to check for single/double quotes
      (?:[\x27\x22])         # \\x27\\x22 are single/double quotes
      (?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
      (?:[\x27\x22])

  |                          # or (else) part of If when Not within quotes

     (?<Column>[^,\r\n]*)    # Not within quotes, but put it in the column
  )                          # End of Pattern OR

(?:,?)                       # Either a comma or EOL/EOF
)+                           # 1 or more columns of data.";


string text = /* Note the ,, as a null situation */
@"'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,""Other, Space"",No Quote Space,'555'";

// We specified the Regex options in teh pattern, but we can also specify them here.
// Both are redundant, decide which you prefer and use one.
var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
              select new
              {
                  Data = from Capture cp in m.Groups["Column"].Captures
                         select cp.Value,
              };

int lineNo = 0;

foreach ( var line in CSVData )
    Console.WriteLine( string.Format("Line #{0}:  {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) ));

/* Output

Line #1:  1||01000000043|2|4|Regex Space
Line #2:  2|333|444|Other, Space|No Quote Space|555

*/

// Put into Dictionary where the key is the first csv column data.
// Note the below creates a KeyValuePair using an integer for the
// key whichextracted as the parsing goes on. It is not used. It
// is simply shown for example of getting the index from Linq and
// could be change to use the first column instead.

Dictionary<string, List<string>> items2 =
    CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) )
           .ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value );


foreach ( KeyValuePair<string, List<string>> kvp in items2 )
      Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) );

/*
Key Alpha : Alpha||01000000043|2|4|Regex Space
Key Beta : Beta|333|444|Other, Space|No Quote Space|555
*/
Share

C# Regex Linq: Extract an Html Node with Attributes of Varying Types

iStock_000008717494XSmall

The premise of this article and subsequent code sample is that one has an html node to parse and needs the parsed node’s attributes accessible in a handy fashion. Using Regular Expressions with Linq  we can achieve our goal and examine all attributes of the html node. I will show the steps to take and pitfalls on using other methodology.

Data

<INPUT onblur=google&&google.fade&&google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'/>

Why Not Use XElement’s Attributes?

Because of the free-form text found in the html the following code throws an exception on the first attribute encountered:

string test = @"<INPUT onblur=google&&google.fade&&google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'>";

// Fails saying google is unexpected token!
var input = XElement.Parse( test )
                    .Attributes()
                    .Select( vl => new KeyValuePair<string, string>( vl.Name.ToString(), vl.Value.ToString() ) );

foreach ( KeyValuePair<string, string> item in input )
    Console.WriteLine( "Key: {0,15} Value: {1}", item.Key, item.Value );

Step 1: Regex

Our first step is to create a regular expression which can handle the node and its attributes. What is interesting about the below regex pattern is that it uses an if clause to discriminate if the attribute contains the value in quotes, single or double, and will put them into the captures collection.

(?:<)(?<Tag>[^\s/>]+)       # Extract the tag name.
(?![/>])                    # Stop if /> is found
# -- Extract Attributes Key Value Pairs  --

((?:\s+)             # One to many spaces start the attribute
 (?<Key>[^=]+)       # Name/key of the attribute
 (?:=)               # Equals sign needs to be matched, but not captured.

(?([\x22\x27])              # If quotes are found
  (?:[\x22\x27])
  (?<Value>[^\x22\x27]+)    # Place the value into named Capture
  (?:[\x22\x27])
 |                          # Else no quotes
   (?<Value>[^\s/>]*)       # Place the value into named Capture
 )
)+                  # -- One to many attributes found!

The above will find a match on a node, place the tag into the named capture of Tag. Then each attribute will be in two named capture collections of Key Value

Regex Returns A Match…Now What?

We need to extract the items into a Dictionary of key value pairs. The following code works with the name match captures and its indexed captures and extracts all attributes (Note copy code to clipboard or view to get alignment):

var attributes = ( from Match mt in Regex.Matches( node, pattern, RegexOptions.IgnorePatternWhitespace )
                   select new
                   {
                       Name = mt.Groups["Tag"],
                       Attrs = ( from cpKey in mt.Groups["Key"].Captures.Cast<Capture>().Select( ( a, i ) => new { a.Value, i } )
                                 join cpValue in mt.Groups["Value"].Captures.Cast<Capture>().Select( ( b, i ) => new { b.Value, i } ) on cpKey.i equals cpValue.i
                                 select new KeyValuePair<string, string>( cpKey.Value, cpValue.Value ) ).ToDictionary( kvp => kvp.Key, kvp => kvp.Value )
                   } ).First().Attrs;

What the above is doing is enumerating over all the matches, in this case there is only one. Then we work through all the keys in the “Key” captures array and marry them to the “Value” value in that array on a one-to-one basis. Notice how we can index into a joined array via its index thanks to the specialized select which returns the index value. Finally we express those combined items  into a key value pair.

Full Code and Result

string node = @"<INPUT onblur=google&amp;&amp;google.fade&amp;&amp;google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'/>";
string pattern =@"
(?:<)(?<Tag>[^\s/>]+)       # Extract the tag name.
(?![/>])                    # Stop if /> is found
                     # -- Extract Attributes Key Value Pairs  --

((?:\s+)             # One to many spaces start the attribute
 (?<Key>[^=]+)       # Name/key of the attribute
 (?:=)               # Equals sign needs to be matched, but not captured.

(?([\x22\x27])              # If quotes are found
  (?:[\x22\x27])
  (?<Value>[^\x22\x27]+)    # Place the value into named Capture
  (?:[\x22\x27])
 |                          # Else no quotes
   (?<Value>[^\s/>]*)       # Place the value into named Capture
 )
)+                  # -- One to many attributes found!";

var attributes = ( from Match mt in Regex.Matches( node, pattern, RegexOptions.IgnorePatternWhitespace )
                   select new
                   {
                       Name = mt.Groups["Tag"],
                       Attrs = ( from cpKey in mt.Groups["Key"].Captures.Cast<Capture>().Select( ( a, i ) => new { a.Value, i } )
                                 join cpValue in mt.Groups["Value"].Captures.Cast<Capture>().Select( ( b, i ) => new { b.Value, i } ) on cpKey.i equals cpValue.i
                                 select new KeyValuePair<string, string>( cpKey.Value, cpValue.Value ) ).ToDictionary( kvp => kvp.Key, kvp => kvp.Value )
                   } ).First().Attrs;


foreach ( KeyValuePair<string, string> kvp in attributes )
    Console.WriteLine( "Key {0,15}    Value: {1}", kvp.Key, kvp.Value );

/* Output:
Key          onblur    Value: google&amp;&amp;google.fade&amp;&amp;google.fade()
Key           class    Value: lst
Key           title    Value: Google Search
Key           value    Value: TESTING
Key       maxLength    Value: 2048
Key            size    Value: 55
Key            name    Value: q
Key    autocomplete    Value: off
Key            init    Value: true
*/
Share

INI Files Meet Regex and Linq in C# to Avoid the WayBack Machine of Kernal32.Dll

BetweenStonesWhat if you are stuck having to deal with older technology such as INI files while using the latest and greatest C# and .Net there is available? This article discusses an alternate way to read INI files and extract the data from those dusty tomes while  easily accessing the resulting data from dictionaries. Once the data resides in the dictionaries we can easily extract the data using the power of the indexer on section name followed by key name within the section. Such as IniFile[“TargetSection”][“TargetKey”] which will return a string of the value of that key in the ini file for that section.

Note all the code is one easy code section at the bottom of the article so don’t feel you have to copy each sections code.

Overview

If you are reading this, chances are you know what INI files are and don’t need a refresher. You may have looked into using the Win32 Kern32.dll method GetPrivateProfileSection to achieve your goals. Ack!  “Set the Wayback machine Sherman!” Thanks but no thanks.

Here is how to do this operation using Regular Expressions (Kinda a way back machine but very useful) and Linq to Object to get the values into a dictionary format so we can write this line of code to access the data within the INI file:

string myValue = IniFile[“SectionName”][“KeyName”];

The Pattern

Let me explain the Regex Pattern. If you are not so inclined to understand the semantics of it skip to the next section.

string pattern = @"
^                           # Beginning of the line
((?:\[)                     # Section Start
 (?<Section>[^\]]*)         # Actual Section text into Section Group
 (?:\])                     # Section End then EOL/EOB
 (?:[\r\n]{0,}|\Z))         # Match but don't capture the CRLF or EOB
 (                          # Begin capture groups (Key Value Pairs)
   (?!\[)                    # Stop capture groups if a [ is found; new section
   (?<Key>[^=]*?)            # Any text before the =, matched few as possible
   (?:=)                     # Get the = now
   (?<Value>[^\r\n]*)        # Get everything that is not an Line Changes
   (?:[\r\n]{0,4})           # MBDC \r\n
  )+                        # End Capture groups";

Our goal is to use Named Match groups. Each match will have its section name in the named group called  “Section”  and all of the data, which is the key and value pairs will be named “Key” and “Value” respectively.  The trick to the above pattern is found in line eight. That stops the match when a new section is hit using the Match Invalidator (?!). Otherwise our key/values would bleed into the next section if not stopped.

The Data

Here is the data for your perusal.

string data = @"[WindowSettings]
Window X Pos=0
Window Y Pos=0
Window Maximized=false
Window Name=Jabberwocky

[Logging]
Directory=C:\Rosetta Stone\Logs
";

We are interested in “Window Name” and “Directory”.

The Linq

Ok, if you thought the regex pattern was complicated, the Linq to Objects has some tricks up its sleeve as well. Primarily since our pattern matches create a single match per section with the accompany key and value data in two separate named match capture collections, that presents a problem. We need to join the the capture collections together, but there is no direct way to do that for the join in Linq because that link is only an indirect by the collections index number.

How do we get the two collections to be joined?

Here is the code:

Dictionary<string, Dictionary<string, string>> InIFile
= ( from Match m in Regex.Matches( data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
 select new
 {
  Section = m.Groups["Section"].Value,

  kvps = ( from cpKey in m.Groups["Key"].Captures.Cast<Capture>().Select( ( a, i ) => new { a.Value, i } )
     join cpValue in m.Groups["Value"].Captures.Cast<Capture>().Select( ( b, i ) => new { b.Value, i } ) on cpKey.i equals cpValue.i
     select new KeyValuePair<string, string>( cpKey.Value, cpValue.Value ) ).ToDictionary( kvp => kvp.Key, kvp => kvp.Value )

  } ).ToDictionary( itm => itm.Section, itm => itm.kvps );

Explanation:

  • Line 1: Our end goal object is a Dictionary where the key is the Section name and the value is a sub-dictionary with all the keys and values found in that section.
  • Line 2: The regex needs IPW because we have commented the pattern. It needs multiline because we are spanning multiple lines and need ^ to match each individual line and not just the beginning.
  • Line 5: This is the easiest item, simply access the named capture group “Section” for the section name.
  • Line 7 (.Captures) : Each one of the keys and values are in the specialized capture collection property off of the match.
  • Line 7 (.Cast<Capture>) : Since capture is specialized list and not a true generic list, such as List<string> we are going to Cast it(Cast<(Of <(TResult>) it (to IEnumerable<(Of <(T>)>),so we can access the standard query operators, i.e. the extension methods which are available to IEnumerable<T>. Short answer, so we can call .Select.
  • Line 7 (.Select): Because each list does not have a direct way to associate the data, we are going to create a new object that has a property which will have that index number, along with the target data value. That will allow us join it to the other list.
  • Line 7 (Lambda) : The lambda has two parameters, the first is our actual regex Capture object represented by a. The i is the index value which we need for the join. We then call new and create a new entity with two properties, the first is actual value of the Key found of the Capture class property “Value” and the second is i the index value.
  • Line 8 (Join) : We are going to join the data together using the direct properties of our new entity, but first we need to recreate the magic found in Line 7 for our Values capture collection. It is the same logic as the previous line so I will not delve into its explanation in detail.
  • Line 8 (on cpKey.i equals cpValue.i) : This is our association for the join on the new entities and yay, where index value i equals the other index value i allows us to do that. This is the keystone of all we are doing.
  • Line 9 (new KeyValuePair) : Ok we are now creating each individual linq projection item of the data as a KeyValuePair object. This could be removed for a new entity, but I choose to use the KeyValuePair class.
  • Line 9 (ToDictionary) : We want to easily access these key value pairs in the future, so we are going to place the Key into a Key of a dictionary and the dictionary key’s value from the actual Value.
  • Line 11 (ToDictionary) : Here is where we take the projection of the previous lines of code and create the end goal dictionary where the key name is the section and the value is the sub dictionary created in Line 9.

Whew…what is the result?

Console.WriteLine( InIFile["WindowSettings"]["Window Name"] ); // Jabberwocky
Console.WriteLine( InIFile["Logging"]["Directory"] );          // C:\Rosetta Stone\Logs

Summary

Thanks to the power of regular expressions and Linq we don’t have to use the old methods to extract and process the data. We can easily access the information using the newer structures. Hope this helps and that you may have learned something new from something old.

Code All in One Place

Here is all the code so you don’t have to copy it from each section above. Don’t forget to include the using System.Text.RegularExpressions to do it all.

string data = @"[WindowSettings]
Window X Pos=0
Window Y Pos=0
Window Maximized=false
Window Name=Jabberwocky

[Logging]
Directory=C:\Rosetta Stone\Logs
";
string pattern = @"
^                           # Beginning of the line
((?:\[)                     # Section Start
     (?<Section>[^\]]*)     # Actual Section text into Section Group
 (?:\])                     # Section End then EOL/EOB
 (?:[\r\n]{0,}|\Z))         # Match but don't capture the CRLF or EOB
 (                          # Begin capture groups (Key Value Pairs)
  (?!\[)                    # Stop capture groups if a [ is found; new section
  (?<Key>[^=]*?)            # Any text before the =, matched few as possible
  (?:=)                     # Get the = now
  (?<Value>[^\r\n]*)        # Get everything that is not an Line Changes
  (?:[\r\n]{0,4})           # MBDC \r\n
  )+                        # End Capture groups";

Dictionary<string, Dictionary<string, string>> InIFile
= ( from Match m in Regex.Matches( data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
    select new
    {
        Section = m.Groups["Section"].Value,

        kvps = ( from cpKey in m.Groups["Key"].Captures.Cast<Capture>().Select( ( a, i ) => new { a.Value, i } )
                 join cpValue in m.Groups["Value"].Captures.Cast<Capture>().Select( ( b, i ) => new { b.Value, i } ) on cpKey.i equals cpValue.i
                 select new KeyValuePair<string, string>( cpKey.Value, cpValue.Value ) ).ToDictionary( kvp => kvp.Key, kvp => kvp.Value )

    } ).ToDictionary( itm => itm.Section, itm => itm.kvps );

Console.WriteLine( InIFile["WindowSettings"]["Window Name"] ); // Jabberwocky
Console.WriteLine( InIFile["Logging"]["Directory"] );          // C:\Rosetta Stone\Logs
Share

Are C# .Net Regular Expressions Fast Enough for You?

FastEnough It is generally accepted that there is an overhead in using regular expression parsing and  there is truth to that statement. But the premise of this article is that the difference is really negligible and if its an excuse to not learn regex pattern processing because of that, well that is just plain foolish.  Just like any high level language programming construct which gives the developer a quicker development time, the price paid is in extra cycles it takes to complete it. But is the perception that usage of regular expressions are really that slow? Let me  show you by example….

The MSDN forums are littered with the vague warnings “Don’t use regex, its slow”. I have seen that advice given and yes its based on a truth as mentioned before, but they never add in the time it takes to subsequently process the information.  They forget that in most cases Regular Expressions already provides the post processing needs such storage and data extraction abilities built in.

It comes down to…Is it fast enough for you?

If one needs to shave off milliseconds from a multi-million operation, then don’t use regular expressions or at least do tests first. But for day to day use, I believe its always the right answer. With that premise, let us test some code.

Premise

The usual contender for a regular expression is string.Split. Now string.Split is a fast little function and very useful, but one has to then  consider the ancillary processing and I have found a real example culled from the forums.

The Test

A user asked what could be used to parse specific text and whether regular expressions could be used. The example text, changed slightly, the value 41 was used instead of 0, looked like this

name="rating_count" value="41"

The user was interested in achieving the value of 41 as an integer and wondered which is better.

The Opponent

Right out of the gate there was an answer saying Regex is slower and gave an example which actually failed. I have modified it to work. The originator had tested zero and didn’t realize they were getting a default value instead of an extracted value because it was only splitting on the ‘=’ character. In my test it is fixed and placed into a static method called Highway:

public static int Highway(string text)
{

string []parts = text.Split( new char[] { ' ', '=', '\x22' }, StringSplitOptions.RemoveEmptyEntries );
int value = 0;
for(int index = 0; index < parts.Length-1; index++)
   if(parts[index].ToLower() == "value") {
      string tempValue = parts[index+1];
      int.TryParse(tempValue, out value);
      break;
    }

return value;

}

Note that \x22 is hex for quotes(“).

The Contender

Here is what I wrote to do the same job in Regular Expressions which I called MyWay (get it MyWay or the Highway…bwhahahaha…nevermind)

public static int MyWay( string text )
{
int value = 0;

int.TryParse( Regex.Match( text, "(?:value=\x22)([^\x22]+)", RegexOptions.Compiled ).Groups[1].Value, out value );

return value;
}

Now I knew that this would be run multiple times so I told .Net to compile the expression for future uses after the first, but if this is a one off operation one should not do that.

The Cage

Here is the testing arena for the two operations. I throw away the first value, which does help regex in the long run due to the compilation, but frankly a one off test without the compilation flag is not to shabby. If you try this at home don’t forget System.Diagnostics using.

string data = string.Format("name={0}rating_count{0} value={0}41{0}", "\x22");
Stopwatch st = new Stopwatch();
int index;
int totalRuns = 100000;

Highway( data ); // Do a test and throw it out

st.Start();
for (index = 0; index < totalRuns; index++)
    Highway( data );
st.Stop();

Console.WriteLine( "Non Regex:\t{0}\tAvg Per Run:\t{1}", st.Elapsed.TotalMilliseconds, st.Elapsed.TotalMilliseconds / totalRuns );

MyWay( data ); // Throw out the first

st.Reset();
st.Start();
for ( index = 0; index < totalRuns; index++ )
    MyWay( data );
st.Stop();

Console.WriteLine( "Regex:\t\t{0}\tAvg Per Run:\t{1}", st.Elapsed.TotalMilliseconds, st.Elapsed.TotalMilliseconds / totalRuns );

Results

So what happens? Well in Release mode for 100000 times produces results like this result on a dual core machine (Total Milliseconds values):

Non Regex:      213.9509        Avg Per Run:    0.002139509
Regex:          226.7564        Avg Per Run:    0.002267564

So the difference was not really that great…and though the times for the non regex were usually faster overall, there wasn’t too great of a difference between the two.

So one has to ask, “Is Regex fast enough for you?”

I believe that to be yes! Note, in fairness, poorly formed regex patterns will slow the parser down, but garbage in garbage out; so yes your mileage will vary.

Share

Linq Orderby a Better IComparer in C#

Sometimes IComparer falls short when on has a need to sort on different, for lack of a better term, data columns. Before writing an IComparer interface for sort, try using Linq’s Orderby.

In the forums the user had data, in string lines, which looked like this

3 months ending 9/30/2007
9 months ending 9/30/2007
3 months ending 9/30/2008
9 months ending 9/30/2008

The user needed the white items sorted first in ascending fashion and the red year items sorted descending. Because the data was all in a string and needed differing sorts, je was having problems with sort with a custom IComparer class.

I recommend that he use regex to parse out the items then use linq to sort. Here is the result.  Note I merged all data into one string where each line is a true line.

string input =
@"3 months ending 9/30/2007
9 months ending 9/30/2007
3 months ending 9/30/2008
9 months ending 9/30/2008";

string pattern = @"(?<Total>\d\d?)(?:[^\d]+)(?<Date>[\d/]+)";

var items =
    from Match m in Regex.Matches( input, pattern )
    select new
    {
        Total = m.Groups["Total"].Value,
        Date = DateTime.Parse( m.Groups["Date"].Value ),
        Full = m.Groups[0].Value
    };

var values = from p in items
             orderby p.Total, p.Date.Year descending
             select p;

foreach ( var itm in values )
    Console.WriteLine( itm.Full );

/* Outputs
3 months ending 9/30/2008
3 months ending 9/30/2007
9 months ending 9/30/2008
9 months ending 9/30/2007
             */
Share

Regex To Linq to Dictionary in C#

This article demonstrates these concepts:

  1. Regex extraction of Key Value pairs and placing them into named capture groups.
  2. Linq extraction of the Key Value pairs extracted from the matches of Regex.
  3. Dictionary creation from Linq using the ToDictionary method.

I answered this on the MSDN forums, the user had this data in key value pairs delimited by the pipe:

abc:1|bbbb:2|xyz:45|p:120

Keys values separators

The need was to get the keys and values into a dictionary. The following code uses named regex group matches which are used in Linq to extract the keys and their values. Once that is done within the linq the extended method ToDictionary is used to create the dictionary on the fly. Here is the code:

string input = "abc:1|bbbb:2|xyz:45|p:120";
string pattern = @"(?<Key>[^:]+)(?:\:)(?<Value>[^|]+)(?:\|?)";

Dictionary<string, string> KVPs
    = ( from Match m in Regex.Matches( input, pattern )
      select new
      {
          key = m.Groups["Key"].Value,
          value = m.Groups["Value"].Value
       }
       ).ToDictionary( p => p.key, p => p.value );

foreach ( KeyValuePair<string, string> kvp in KVPs )
    Console.WriteLine( "{0,6} : {1,3}", kvp.Key, kvp.Value );

/* Outputs:
 abc :   1
bbbb :   2
 xyz :  45
  p  : 120
 */
Share