Posts tagged ‘CSV’

C# Extracting CSV data into Linq and a Dictionary using Regular Expressions

One is uniqueI had written a post a while back which detailed a regular expression pattern used by the .Net regex parser which parsed a Comma Separated Value file, or a CSV file for short. Upon looking at the pattern I came to realize that the pattern didn’t work for all situations. So I have created a new pattern which will extract all items from the CSV data into into a dynamic anonymous Linq entity.  Following that example I will show one how to use the same Linq entity to put that CSV data into a dictionary, a hash table, where the key of entry is the first column’s data.

CSV Considerations

  1. Data separated by a comma.
  2. Quotes, single or double are an optional encapsulation of data.
  3. Any data which has a comma must be encased in quotes.
  4. Quoted data can be single or double quote.
  5. Data rows can be ragged.
  6. Null data handled except for last column
  7. Last data column cannot be null.
'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,"Other, Space",No Quote Space,'555'

Regular Expression Pattern

The things of note about the below pattern are

  • Pattern needs Regex Options. Those options for this article are defined both in the pattern and the call to the regular expression parser; normalcy only its done only once.
    1. Pattern commented so IgnorePatternWhitespace option is needed. Note that option does not affect the regex parsing of the data.
    2. Multiline option needed so ^ matches the beginning of each line and $ matches the end, after the \r\n.
  • Regular Expression if condition is used to test if the indivudal column data is enclosed in quotes. If it finds a quote it consumes the quotes but does not pass them on to the final data processing.
  • Each line will correspond to one match
  • All data  put into named match capture called Column; hence the match will have all line values in the capture collection named Column.
(?xm)                        # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
                             # and tell the compiler this is multiline (m),
                             # In Multiline the ^ matches each start line and $ is each EOL
                             # -Pattern Start-
^(                           # Start at the beginning of the line always
 (?![\r\n]|$)                # Stop the match if EOL or EOF found.
 (?([\x27\x22])              # Regex If to check for single/double quotes
      (?:[\x27\x22])         # \\x27\\x22 are single/double quotes
      (?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
      (?:[\x27\x22])

  |                          # or (else) part of If when Not within quotes

     (?<Column>[^,\r\n]*)    # Not within quotes, but put it in the column
  )                          # End of Pattern OR

(?:,?)                       # Either a comma or EOL/EOF
)+                           # 1 or more columns of data.

Regex to Linq

Here is the code which will enumerate over each match and add the contents of the match capture collection into a dynamic linq entity. Notes:

  1. tThe code below uses the regex pattern mentioned above but does not show it for brevity.
  2. The regex options are set twice for example. One only needs to set them once.
string pattern = @" ... ";


string text = /* Note the ,, as a null situation */
@"'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,""Other, Space"",No Quote Space,'555'";

// We specified the Regex options in teh pattern, but we can also specify them here.
// Both are redundant, decide which you prefer and use one.
var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
              select new
              {
                  Data = from Capture cp in m.Groups["Column"].Captures
                         select cp.Value,
              };

int lineNo = 0;

foreach ( var line in CSVData )
    Console.WriteLine( string.Format("Line #{0}:  {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) ));

/* Output

Line #1:  1||01000000043|2|4|Regex Space
Line #2:  2|333|444|Other, Space|No Quote Space|555

*/

Linq To Dictionary

Taking the same code above, specifically the dynamic Linq entity holder CSVData, we will transform it into a dictionary where the key into the hashtable is the first CSV column data item.

// Put into Dictionary where the key is the first csv column data.
// Note the below creates a KeyValuePair using an integer for the
// key whichextracted as the parsing goes on. It is not used. It
// is simply shown for example of getting the index from Linq and
// could be change to use the first column instead.

Dictionary<string, List<string>> items2 =
    CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) )
           .ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value );


foreach ( KeyValuePair<string, List<string>> kvp in items2 )
      Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) );

/*
Key Alpha : Alpha||01000000043|2|4|Regex Space
Key Beta : Beta|333|444|Other, Space|No Quote Space|555
*/

Full Code

string pattern = @"
(?xm)                        # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
                             # and tell the compiler this is multiline (m),
                             # In Multiline the ^ matches each start line and $ is each EOL
                             # Pattern Start
^(                           # Start at the beginning of the line always
 (?![\r\n]|$)                # Stop the match if EOL or EOF found.
 (?([\x27\x22])              # Regex If to check for single/double quotes
      (?:[\x27\x22])         # \\x27\\x22 are single/double quotes
      (?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
      (?:[\x27\x22])

  |                          # or (else) part of If when Not within quotes

     (?<Column>[^,\r\n]*)    # Not within quotes, but put it in the column
  )                          # End of Pattern OR

(?:,?)                       # Either a comma or EOL/EOF
)+                           # 1 or more columns of data.";


string text = /* Note the ,, as a null situation */
@"'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,""Other, Space"",No Quote Space,'555'";

// We specified the Regex options in teh pattern, but we can also specify them here.
// Both are redundant, decide which you prefer and use one.
var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
              select new
              {
                  Data = from Capture cp in m.Groups["Column"].Captures
                         select cp.Value,
              };

int lineNo = 0;

foreach ( var line in CSVData )
    Console.WriteLine( string.Format("Line #{0}:  {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) ));

/* Output

Line #1:  1||01000000043|2|4|Regex Space
Line #2:  2|333|444|Other, Space|No Quote Space|555

*/

// Put into Dictionary where the key is the first csv column data.
// Note the below creates a KeyValuePair using an integer for the
// key whichextracted as the parsing goes on. It is not used. It
// is simply shown for example of getting the index from Linq and
// could be change to use the first column instead.

Dictionary<string, List<string>> items2 =
    CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) )
           .ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value );


foreach ( KeyValuePair<string, List<string>> kvp in items2 )
      Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) );

/*
Key Alpha : Alpha||01000000043|2|4|Regex Space
Key Beta : Beta|333|444|Other, Space|No Quote Space|555
*/
Share