C# Extracting CSV data into Linq and a Dictionary using Regular Expressions

Posted by OmegaMan at March 18, 2010

Category: Regular Expressions

Tags: , , ,

One is uniqueI had written a post a while back which detailed a regular expression pattern used by the .Net regex parser which parsed a Comma Separated Value file, or a CSV file for short. Upon looking at the pattern I came to realize that the pattern didn’t work for all situations. So I have created a new pattern which will extract all items from the CSV data into into a dynamic anonymous Linq entity.  Following that example I will show one how to use the same Linq entity to put that CSV data into a dictionary, a hash table, where the key of entry is the first column’s data.

CSV Considerations

  1. Data separated by a comma.
  2. Quotes, single or double are an optional encapsulation of data.
  3. Any data which has a comma must be encased in quotes.
  4. Quoted data can be single or double quote.
  5. Data rows can be ragged.
  6. Null data handled except for last column
  7. Last data column cannot be null.
'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,"Other, Space",No Quote Space,'555'

Regular Expression Pattern

The things of note about the below pattern are

  • Pattern needs Regex Options. Those options for this article are defined both in the pattern and the call to the regular expression parser; normalcy only its done only once.
    1. Pattern commented so IgnorePatternWhitespace option is needed. Note that option does not affect the regex parsing of the data.
    2. Multiline option needed so ^ matches the beginning of each line and $ matches the end, after the \r\n.
  • Regular Expression if condition is used to test if the indivudal column data is enclosed in quotes. If it finds a quote it consumes the quotes but does not pass them on to the final data processing.
  • Each line will correspond to one match
  • All data  put into named match capture called Column; hence the match will have all line values in the capture collection named Column.
(?xm)                        # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
                             # and tell the compiler this is multiline (m),
                             # In Multiline the ^ matches each start line and $ is each EOL
                             # -Pattern Start-
^(                           # Start at the beginning of the line always
 (?![\r\n]|$)                # Stop the match if EOL or EOF found.
 (?([\x27\x22])              # Regex If to check for single/double quotes
      (?:[\x27\x22])         # \\x27\\x22 are single/double quotes
      (?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
      (?:[\x27\x22])

  |                          # or (else) part of If when Not within quotes

     (?<Column>[^,\r\n]*)    # Not within quotes, but put it in the column
  )                          # End of Pattern OR

(?:,?)                       # Either a comma or EOL/EOF
)+                           # 1 or more columns of data.

Regex to Linq

Here is the code which will enumerate over each match and add the contents of the match capture collection into a dynamic linq entity. Notes:

  1. tThe code below uses the regex pattern mentioned above but does not show it for brevity.
  2. The regex options are set twice for example. One only needs to set them once.
string pattern = @" ... ";


string text = /* Note the ,, as a null situation */
@"'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,""Other, Space"",No Quote Space,'555'";

// We specified the Regex options in teh pattern, but we can also specify them here.
// Both are redundant, decide which you prefer and use one.
var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
              select new
              {
                  Data = from Capture cp in m.Groups["Column"].Captures
                         select cp.Value,
              };

int lineNo = 0;

foreach ( var line in CSVData )
    Console.WriteLine( string.Format("Line #{0}:  {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) ));

/* Output

Line #1:  1||01000000043|2|4|Regex Space
Line #2:  2|333|444|Other, Space|No Quote Space|555

*/

Linq To Dictionary

Taking the same code above, specifically the dynamic Linq entity holder CSVData, we will transform it into a dictionary where the key into the hashtable is the first CSV column data item.

// Put into Dictionary where the key is the first csv column data.
// Note the below creates a KeyValuePair using an integer for the
// key whichextracted as the parsing goes on. It is not used. It
// is simply shown for example of getting the index from Linq and
// could be change to use the first column instead.

Dictionary<string, List<string>> items2 =
    CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) )
           .ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value );


foreach ( KeyValuePair<string, List<string>> kvp in items2 )
      Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) );

/*
Key Alpha : Alpha||01000000043|2|4|Regex Space
Key Beta : Beta|333|444|Other, Space|No Quote Space|555
*/

Full Code

string pattern = @"
(?xm)                        # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
                             # and tell the compiler this is multiline (m),
                             # In Multiline the ^ matches each start line and $ is each EOL
                             # Pattern Start
^(                           # Start at the beginning of the line always
 (?![\r\n]|$)                # Stop the match if EOL or EOF found.
 (?([\x27\x22])              # Regex If to check for single/double quotes
      (?:[\x27\x22])         # \\x27\\x22 are single/double quotes
      (?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
      (?:[\x27\x22])

  |                          # or (else) part of If when Not within quotes

     (?<Column>[^,\r\n]*)    # Not within quotes, but put it in the column
  )                          # End of Pattern OR

(?:,?)                       # Either a comma or EOL/EOF
)+                           # 1 or more columns of data.";


string text = /* Note the ,, as a null situation */
@"'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,""Other, Space"",No Quote Space,'555'";

// We specified the Regex options in teh pattern, but we can also specify them here.
// Both are redundant, decide which you prefer and use one.
var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
              select new
              {
                  Data = from Capture cp in m.Groups["Column"].Captures
                         select cp.Value,
              };

int lineNo = 0;

foreach ( var line in CSVData )
    Console.WriteLine( string.Format("Line #{0}:  {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) ));

/* Output

Line #1:  1||01000000043|2|4|Regex Space
Line #2:  2|333|444|Other, Space|No Quote Space|555

*/

// Put into Dictionary where the key is the first csv column data.
// Note the below creates a KeyValuePair using an integer for the
// key whichextracted as the parsing goes on. It is not used. It
// is simply shown for example of getting the index from Linq and
// could be change to use the first column instead.

Dictionary<string, List<string>> items2 =
    CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) )
           .ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value );


foreach ( KeyValuePair<string, List<string>> kvp in items2 )
      Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) );

/*
Key Alpha : Alpha||01000000043|2|4|Regex Space
Key Beta : Beta|333|444|Other, Space|No Quote Space|555
*/
Share

7 Comments

  1. Diogo Haas says

    I like the Linq/Regex approach you chose.

    The only thing I missed is the embedded quotes: “I have “”embedded”” quotes”.

    Regards,
    Diogo.

    Reply
  2. Chris Malone says

    Awesome work, thank you You’re a Regex wizard!

    Reply
  3. Chuck says

    This is great. However, there is a slight problem with it. If the last column in the CSV is empty then it doesn’t include that column though it does if the empty column is not the last column like in your current example. This causes a problem if the code is expecting a fixed number of columns. I’ve been beating my head against the wall trying to figure out a fix for it but I’m really not very good at regex. Any ideas?

    Reply
    • OmegaMan says

      That is the achilles’ heel of that pattern. What I recommend that you do is catch that scenario by checking the min and max lengths of the return. If you have a scenario where they don’t match…simply add the missing null to the list. See below to check for the condition. HTH

      var CSVData = Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
                          .OfType<Match>()
                          .Select( m => m.Groups["Column"].Captures.OfType<Capture>().Select( cp => cp.Value ) );
      
      if ( CSVData.Max( cp => cp.Count() ) != CSVData.Min( cp => cp.Count() ) )
          Console.WriteLine( "Counts are not the same" );
      
      Reply
  4. Scott Merrick says

    Hi

    We had a problem if there was a single quote inside the data, for instance:

    “test”,”test’s”

    we removed all of the x27 references and was fine. i think as long as the file isn’t single quote delimited, should be fine ?

    Reply
  5. Peter says

    I’m running into some weird problems trying to use this with some unicode characters. Do you have any useful tips on handling unicode in c#?

    Reply
    • OmegaMan says

      Can you provide an example on the MSDN regex forum please.

      Reply

Leave a comment

(required)
(required) (will not be published)

This site uses Akismet to reduce spam. Learn how your comment data is processed.