Posts tagged ‘Linq Index’

C# Regex Linq: Extract an Html Node with Attributes of Varying Types

iStock_000008717494XSmall

The premise of this article and subsequent code sample is that one has an html node to parse and needs the parsed node’s attributes accessible in a handy fashion. Using Regular Expressions with Linq  we can achieve our goal and examine all attributes of the html node. I will show the steps to take and pitfalls on using other methodology.

Data

<INPUT onblur=google&&google.fade&&google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'/>

Why Not Use XElement’s Attributes?

Because of the free-form text found in the html the following code throws an exception on the first attribute encountered:

string test = @"<INPUT onblur=google&&google.fade&&google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'>";

// Fails saying google is unexpected token!
var input = XElement.Parse( test )
                    .Attributes()
                    .Select( vl => new KeyValuePair<string, string>( vl.Name.ToString(), vl.Value.ToString() ) );

foreach ( KeyValuePair<string, string> item in input )
    Console.WriteLine( "Key: {0,15} Value: {1}", item.Key, item.Value );

Step 1: Regex

Our first step is to create a regular expression which can handle the node and its attributes. What is interesting about the below regex pattern is that it uses an if clause to discriminate if the attribute contains the value in quotes, single or double, and will put them into the captures collection.

(?:<)(?<Tag>[^\s/>]+)       # Extract the tag name.
(?![/>])                    # Stop if /> is found
# -- Extract Attributes Key Value Pairs  --

((?:\s+)             # One to many spaces start the attribute
 (?<Key>[^=]+)       # Name/key of the attribute
 (?:=)               # Equals sign needs to be matched, but not captured.

(?([\x22\x27])              # If quotes are found
  (?:[\x22\x27])
  (?<Value>[^\x22\x27]+)    # Place the value into named Capture
  (?:[\x22\x27])
 |                          # Else no quotes
   (?<Value>[^\s/>]*)       # Place the value into named Capture
 )
)+                  # -- One to many attributes found!

The above will find a match on a node, place the tag into the named capture of Tag. Then each attribute will be in two named capture collections of Key Value

Regex Returns A Match…Now What?

We need to extract the items into a Dictionary of key value pairs. The following code works with the name match captures and its indexed captures and extracts all attributes (Note copy code to clipboard or view to get alignment):

var attributes = ( from Match mt in Regex.Matches( node, pattern, RegexOptions.IgnorePatternWhitespace )
                   select new
                   {
                       Name = mt.Groups["Tag"],
                       Attrs = ( from cpKey in mt.Groups["Key"].Captures.Cast<Capture>().Select( ( a, i ) => new { a.Value, i } )
                                 join cpValue in mt.Groups["Value"].Captures.Cast<Capture>().Select( ( b, i ) => new { b.Value, i } ) on cpKey.i equals cpValue.i
                                 select new KeyValuePair<string, string>( cpKey.Value, cpValue.Value ) ).ToDictionary( kvp => kvp.Key, kvp => kvp.Value )
                   } ).First().Attrs;

What the above is doing is enumerating over all the matches, in this case there is only one. Then we work through all the keys in the “Key” captures array and marry them to the “Value” value in that array on a one-to-one basis. Notice how we can index into a joined array via its index thanks to the specialized select which returns the index value. Finally we express those combined items  into a key value pair.

Full Code and Result

string node = @"<INPUT onblur=google&amp;&amp;google.fade&amp;&amp;google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'/>";
string pattern =@"
(?:<)(?<Tag>[^\s/>]+)       # Extract the tag name.
(?![/>])                    # Stop if /> is found
                     # -- Extract Attributes Key Value Pairs  --

((?:\s+)             # One to many spaces start the attribute
 (?<Key>[^=]+)       # Name/key of the attribute
 (?:=)               # Equals sign needs to be matched, but not captured.

(?([\x22\x27])              # If quotes are found
  (?:[\x22\x27])
  (?<Value>[^\x22\x27]+)    # Place the value into named Capture
  (?:[\x22\x27])
 |                          # Else no quotes
   (?<Value>[^\s/>]*)       # Place the value into named Capture
 )
)+                  # -- One to many attributes found!";

var attributes = ( from Match mt in Regex.Matches( node, pattern, RegexOptions.IgnorePatternWhitespace )
                   select new
                   {
                       Name = mt.Groups["Tag"],
                       Attrs = ( from cpKey in mt.Groups["Key"].Captures.Cast<Capture>().Select( ( a, i ) => new { a.Value, i } )
                                 join cpValue in mt.Groups["Value"].Captures.Cast<Capture>().Select( ( b, i ) => new { b.Value, i } ) on cpKey.i equals cpValue.i
                                 select new KeyValuePair<string, string>( cpKey.Value, cpValue.Value ) ).ToDictionary( kvp => kvp.Key, kvp => kvp.Value )
                   } ).First().Attrs;


foreach ( KeyValuePair<string, string> kvp in attributes )
    Console.WriteLine( "Key {0,15}    Value: {1}", kvp.Key, kvp.Value );

/* Output:
Key          onblur    Value: google&amp;&amp;google.fade&amp;&amp;google.fade()
Key           class    Value: lst
Key           title    Value: Google Search
Key           value    Value: TESTING
Key       maxLength    Value: 2048
Key            size    Value: 55
Key            name    Value: q
Key    autocomplete    Value: off
Key            init    Value: true
*/
Share