C# Extracting CSV data into Linq and a Dictionary using Regular Expressions
I had written a post a while back which detailed a regular expression pattern used by the .Net regex parser which parsed a Comma Separated Value file, or a CSV file for short. Upon looking at the pattern I came to realize that the pattern didn’t work for all situations. So I have created a new pattern which will extract all items from the CSV data into into a dynamic anonymous Linq entity. Following that example I will show one how to use the same Linq entity to put that CSV data into a dictionary, a hash table, where the key of entry is the first column’s data.
CSV Considerations
- Data separated by a comma.
- Quotes, single or double are an optional encapsulation of data.
- Any data which has a comma must be encased in quotes.
- Quoted data can be single or double quote.
- Data rows can be ragged.
- Null data handled except for last column
- Last data column cannot be null.
'Alpha',,'01000000043','2','4',Regex Space 'Beta',333,444,"Other, Space",No Quote Space,'555'
Regular Expression Pattern
The things of note about the below pattern are
- Pattern needs Regex Options. Those options for this article are defined both in the pattern and the call to the regular expression parser; normalcy only its done only once.
- Pattern commented so IgnorePatternWhitespace option is needed. Note that option does not affect the regex parsing of the data.
- Multiline option needed so ^ matches the beginning of each line and $ matches the end, after the \r\n.
- Regular Expression if condition is used to test if the indivudal column data is enclosed in quotes. If it finds a quote it consumes the quotes but does not pass them on to the final data processing.
- Each line will correspond to one match
- All data put into named match capture called Column; hence the match will have all line values in the capture collection named Column.
(?xm) # Tell the compiler we are commenting (x = IgnorePatternWhitespace) # and tell the compiler this is multiline (m), # In Multiline the ^ matches each start line and $ is each EOL # -Pattern Start- ^( # Start at the beginning of the line always (?![\r\n]|$) # Stop the match if EOL or EOF found. (?([\x27\x22]) # Regex If to check for single/double quotes (?:[\x27\x22]) # \\x27\\x22 are single/double quotes (?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column (?:[\x27\x22]) | # or (else) part of If when Not within quotes (?<Column>[^,\r\n]*) # Not within quotes, but put it in the column ) # End of Pattern OR (?:,?) # Either a comma or EOL/EOF )+ # 1 or more columns of data.
Regex to Linq
Here is the code which will enumerate over each match and add the contents of the match capture collection into a dynamic linq entity. Notes:
- tThe code below uses the regex pattern mentioned above but does not show it for brevity.
- The regex options are set twice for example. One only needs to set them once.
string pattern = @" ... "; string text = /* Note the ,, as a null situation */ @"'Alpha',,'01000000043','2','4',Regex Space 'Beta',333,444,""Other, Space"",No Quote Space,'555'"; // We specified the Regex options in teh pattern, but we can also specify them here. // Both are redundant, decide which you prefer and use one. var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline ) select new { Data = from Capture cp in m.Groups["Column"].Captures select cp.Value, }; int lineNo = 0; foreach ( var line in CSVData ) Console.WriteLine( string.Format("Line #{0}: {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) )); /* Output Line #1: 1||01000000043|2|4|Regex Space Line #2: 2|333|444|Other, Space|No Quote Space|555 */
Linq To Dictionary
Taking the same code above, specifically the dynamic Linq entity holder CSVData, we will transform it into a dictionary where the key into the hashtable is the first CSV column data item.
// Put into Dictionary where the key is the first csv column data. // Note the below creates a KeyValuePair using an integer for the // key whichextracted as the parsing goes on. It is not used. It // is simply shown for example of getting the index from Linq and // could be change to use the first column instead. Dictionary<string, List<string>> items2 = CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) ) .ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value ); foreach ( KeyValuePair<string, List<string>> kvp in items2 ) Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) ); /* Key Alpha : Alpha||01000000043|2|4|Regex Space Key Beta : Beta|333|444|Other, Space|No Quote Space|555 */
Full Code
string pattern = @" (?xm) # Tell the compiler we are commenting (x = IgnorePatternWhitespace) # and tell the compiler this is multiline (m), # In Multiline the ^ matches each start line and $ is each EOL # Pattern Start ^( # Start at the beginning of the line always (?![\r\n]|$) # Stop the match if EOL or EOF found. (?([\x27\x22]) # Regex If to check for single/double quotes (?:[\x27\x22]) # \\x27\\x22 are single/double quotes (?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column (?:[\x27\x22]) | # or (else) part of If when Not within quotes (?<Column>[^,\r\n]*) # Not within quotes, but put it in the column ) # End of Pattern OR (?:,?) # Either a comma or EOL/EOF )+ # 1 or more columns of data."; string text = /* Note the ,, as a null situation */ @"'Alpha',,'01000000043','2','4',Regex Space 'Beta',333,444,""Other, Space"",No Quote Space,'555'"; // We specified the Regex options in teh pattern, but we can also specify them here. // Both are redundant, decide which you prefer and use one. var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline ) select new { Data = from Capture cp in m.Groups["Column"].Captures select cp.Value, }; int lineNo = 0; foreach ( var line in CSVData ) Console.WriteLine( string.Format("Line #{0}: {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) )); /* Output Line #1: 1||01000000043|2|4|Regex Space Line #2: 2|333|444|Other, Space|No Quote Space|555 */ // Put into Dictionary where the key is the first csv column data. // Note the below creates a KeyValuePair using an integer for the // key whichextracted as the parsing goes on. It is not used. It // is simply shown for example of getting the index from Linq and // could be change to use the first column instead. Dictionary<string, List<string>> items2 = CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) ) .ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value ); foreach ( KeyValuePair<string, List<string>> kvp in items2 ) Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) ); /* Key Alpha : Alpha||01000000043|2|4|Regex Space Key Beta : Beta|333|444|Other, Space|No Quote Space|555 */