C# Regex Linq: Extract an Html Node with Attributes of Varying Types
The premise of this article and subsequent code sample is that one has an html node to parse and needs the parsed node’s attributes accessible in a handy fashion. Using Regular Expressions with Linq we can achieve our goal and examine all attributes of the html node. I will show the steps to take and pitfalls on using other methodology.
Data
<INPUT onblur=google&&google.fade&&google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'/>
Why Not Use XElement’s Attributes?
Because of the free-form text found in the html the following code throws an exception on the first attribute encountered:
string test = @"<INPUT onblur=google&&google.fade&&google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'>"; // Fails saying google is unexpected token! var input = XElement.Parse( test ) .Attributes() .Select( vl => new KeyValuePair<string, string>( vl.Name.ToString(), vl.Value.ToString() ) ); foreach ( KeyValuePair<string, string> item in input ) Console.WriteLine( "Key: {0,15} Value: {1}", item.Key, item.Value );
Step 1: Regex
Our first step is to create a regular expression which can handle the node and its attributes. What is interesting about the below regex pattern is that it uses an if clause to discriminate if the attribute contains the value in quotes, single or double, and will put them into the captures collection.
(?:<)(?<Tag>[^\s/>]+) # Extract the tag name. (?![/>]) # Stop if /> is found # -- Extract Attributes Key Value Pairs -- ((?:\s+) # One to many spaces start the attribute (?<Key>[^=]+) # Name/key of the attribute (?:=) # Equals sign needs to be matched, but not captured. (?([\x22\x27]) # If quotes are found (?:[\x22\x27]) (?<Value>[^\x22\x27]+) # Place the value into named Capture (?:[\x22\x27]) | # Else no quotes (?<Value>[^\s/>]*) # Place the value into named Capture ) )+ # -- One to many attributes found!
The above will find a match on a node, place the tag into the named capture of Tag. Then each attribute will be in two named capture collections of Key Value
Regex Returns A Match…Now What?
We need to extract the items into a Dictionary of key value pairs. The following code works with the name match captures and its indexed captures and extracts all attributes (Note copy code to clipboard or view to get alignment):
var attributes = ( from Match mt in Regex.Matches( node, pattern, RegexOptions.IgnorePatternWhitespace ) select new { Name = mt.Groups["Tag"], Attrs = ( from cpKey in mt.Groups["Key"].Captures.Cast<Capture>().Select( ( a, i ) => new { a.Value, i } ) join cpValue in mt.Groups["Value"].Captures.Cast<Capture>().Select( ( b, i ) => new { b.Value, i } ) on cpKey.i equals cpValue.i select new KeyValuePair<string, string>( cpKey.Value, cpValue.Value ) ).ToDictionary( kvp => kvp.Key, kvp => kvp.Value ) } ).First().Attrs;
What the above is doing is enumerating over all the matches, in this case there is only one. Then we work through all the keys in the “Key” captures array and marry them to the “Value” value in that array on a one-to-one basis. Notice how we can index into a joined array via its index thanks to the specialized select which returns the index value. Finally we express those combined items into a key value pair.
Full Code and Result
string node = @"<INPUT onblur=google&&google.fade&&google.fade() class=lst title='Google Search' value=TESTING maxLength=2048 size=55 name=q autocomplete='off' init='true'/>"; string pattern =@" (?:<)(?<Tag>[^\s/>]+) # Extract the tag name. (?![/>]) # Stop if /> is found # -- Extract Attributes Key Value Pairs -- ((?:\s+) # One to many spaces start the attribute (?<Key>[^=]+) # Name/key of the attribute (?:=) # Equals sign needs to be matched, but not captured. (?([\x22\x27]) # If quotes are found (?:[\x22\x27]) (?<Value>[^\x22\x27]+) # Place the value into named Capture (?:[\x22\x27]) | # Else no quotes (?<Value>[^\s/>]*) # Place the value into named Capture ) )+ # -- One to many attributes found!"; var attributes = ( from Match mt in Regex.Matches( node, pattern, RegexOptions.IgnorePatternWhitespace ) select new { Name = mt.Groups["Tag"], Attrs = ( from cpKey in mt.Groups["Key"].Captures.Cast<Capture>().Select( ( a, i ) => new { a.Value, i } ) join cpValue in mt.Groups["Value"].Captures.Cast<Capture>().Select( ( b, i ) => new { b.Value, i } ) on cpKey.i equals cpValue.i select new KeyValuePair<string, string>( cpKey.Value, cpValue.Value ) ).ToDictionary( kvp => kvp.Key, kvp => kvp.Value ) } ).First().Attrs; foreach ( KeyValuePair<string, string> kvp in attributes ) Console.WriteLine( "Key {0,15} Value: {1}", kvp.Key, kvp.Value ); /* Output: Key onblur Value: google&&google.fade&&google.fade() Key class Value: lst Key title Value: Google Search Key value Value: TESTING Key maxLength Value: 2048 Key size Value: 55 Key name Value: q Key autocomplete Value: off Key init Value: true */