I had written a post a while back which detailed a regular expression pattern used by the .Net regex parser which parsed a Comma Separated Value file, or a CSV file for short. Upon looking at the pattern I came to realize that the pattern didn’t work for all situations. So I have created a new pattern which will extract all items from the CSV data into into a dynamic anonymous Linq entity. Following that example I will show one how to use the same Linq entity to put that CSV data into a dictionary, a hash table, where the key of entry is the first column’s data.
CSV Considerations
- Data separated by a comma.
- Quotes, single or double are an optional encapsulation of data.
- Any data which has a comma must be encased in quotes.
- Quoted data can be single or double quote.
- Data rows can be ragged.
- Null data handled except for last column
- Last data column cannot be null.
'Alpha',,'01000000043','2','4',Regex Space 'Beta',333,444,"Other, Space",No Quote Space,'555'
Regular Expression Pattern
The things of note about the below pattern are
- Pattern needs Regex Options. Those options for this article are defined both in the pattern and the call to the regular expression parser; normalcy only its done only once.
- Pattern commented so IgnorePatternWhitespace option is needed. Note that option does not affect the regex parsing of the data.
- Multiline option needed so ^ matches the beginning of each line and $ matches the end, after the \r\n.
- Regular Expression if condition is used to test if the indivudal column data is enclosed in quotes. If it finds a quote it consumes the quotes but does not pass them on to the final data processing.
- Each line will correspond to one match
- All data put into named match capture called Column; hence the match will have all line values in the capture collection named Column.
(?xm) # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
# and tell the compiler this is multiline (m),
# In Multiline the ^ matches each start line and $ is each EOL
# -Pattern Start-
^( # Start at the beginning of the line always
(?![\r\n]|$) # Stop the match if EOL or EOF found.
(?([\x27\x22]) # Regex If to check for single/double quotes
(?:[\x27\x22]) # \\x27\\x22 are single/double quotes
(?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
(?:[\x27\x22])
| # or (else) part of If when Not within quotes
(?<Column>[^,\r\n]*) # Not within quotes, but put it in the column
) # End of Pattern OR
(?:,?) # Either a comma or EOL/EOF
)+ # 1 or more columns of data.
Regex to Linq
Here is the code which will enumerate over each match and add the contents of the match capture collection into a dynamic linq entity. Notes:
- tThe code below uses the regex pattern mentioned above but does not show it for brevity.
- The regex options are set twice for example. One only needs to set them once.
string pattern = @" ... ";
string text = /* Note the ,, as a null situation */
@"'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,""Other, Space"",No Quote Space,'555'";
// We specified the Regex options in teh pattern, but we can also specify them here.
// Both are redundant, decide which you prefer and use one.
var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
select new
{
Data = from Capture cp in m.Groups["Column"].Captures
select cp.Value,
};
int lineNo = 0;
foreach ( var line in CSVData )
Console.WriteLine( string.Format("Line #{0}: {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) ));
/* Output
Line #1: 1||01000000043|2|4|Regex Space
Line #2: 2|333|444|Other, Space|No Quote Space|555
*/
Linq To Dictionary
Taking the same code above, specifically the dynamic Linq entity holder CSVData, we will transform it into a dictionary where the key into the hashtable is the first CSV column data item.
// Put into Dictionary where the key is the first csv column data.
// Note the below creates a KeyValuePair using an integer for the
// key whichextracted as the parsing goes on. It is not used. It
// is simply shown for example of getting the index from Linq and
// could be change to use the first column instead.
Dictionary<string, List<string>> items2 =
CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) )
.ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value );
foreach ( KeyValuePair<string, List<string>> kvp in items2 )
Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) );
/*
Key Alpha : Alpha||01000000043|2|4|Regex Space
Key Beta : Beta|333|444|Other, Space|No Quote Space|555
*/
Full Code
string pattern = @"
(?xm) # Tell the compiler we are commenting (x = IgnorePatternWhitespace)
# and tell the compiler this is multiline (m),
# In Multiline the ^ matches each start line and $ is each EOL
# Pattern Start
^( # Start at the beginning of the line always
(?![\r\n]|$) # Stop the match if EOL or EOF found.
(?([\x27\x22]) # Regex If to check for single/double quotes
(?:[\x27\x22]) # \\x27\\x22 are single/double quotes
(?<Column>[^\x27\x22]*)# Match this in the quotes and place in Named match Column
(?:[\x27\x22])
| # or (else) part of If when Not within quotes
(?<Column>[^,\r\n]*) # Not within quotes, but put it in the column
) # End of Pattern OR
(?:,?) # Either a comma or EOL/EOF
)+ # 1 or more columns of data.";
string text = /* Note the ,, as a null situation */
@"'Alpha',,'01000000043','2','4',Regex Space
'Beta',333,444,""Other, Space"",No Quote Space,'555'";
// We specified the Regex options in teh pattern, but we can also specify them here.
// Both are redundant, decide which you prefer and use one.
var CSVData = from Match m in Regex.Matches( text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline )
select new
{
Data = from Capture cp in m.Groups["Column"].Captures
select cp.Value,
};
int lineNo = 0;
foreach ( var line in CSVData )
Console.WriteLine( string.Format("Line #{0}: {1}", ++lineNo, string.Join( "|", line.Data.ToArray() ) ));
/* Output
Line #1: 1||01000000043|2|4|Regex Space
Line #2: 2|333|444|Other, Space|No Quote Space|555
*/
// Put into Dictionary where the key is the first csv column data.
// Note the below creates a KeyValuePair using an integer for the
// key whichextracted as the parsing goes on. It is not used. It
// is simply shown for example of getting the index from Linq and
// could be change to use the first column instead.
Dictionary<string, List<string>> items2 =
CSVData.Select( ( a, index ) => new KeyValuePair<int, List<string>>( index, a.Data.ToList() ) )
.ToDictionary( kvp => kvp.Value[0], kvp => kvp.Value );
foreach ( KeyValuePair<string, List<string>> kvp in items2 )
Console.WriteLine( "Key {0} : {1}", kvp.Key, string.Join( "|", kvp.Value.ToArray() ) );
/*
Key Alpha : Alpha||01000000043|2|4|Regex Space
Key Beta : Beta|333|444|Other, Space|No Quote Space|555
*/
#1 by Diogo Haas on March 19, 2010 - 9:50 am
Quote
I like the Linq/Regex approach you chose.
The only thing I missed is the embedded quotes: “I have “”embedded”" quotes”.
Regards,
Diogo.