Posts Tagged String.Split

Are C# .Net Regular Expressions Fast Enough for You?

FastEnough It is generally accepted that there is an overhead in using regular expression parsing and  there is truth to that statement. But the premise of this article is that the difference is really negligible and if its an excuse to not learn regex pattern processing because of that, well that is just plain foolish.  Just like any high level language programming construct which gives the developer a quicker development time, the price paid is in extra cycles it takes to complete it. But is the perception that usage of regular expressions are really that slow? Let me  show you by example….

The MSDN forums are littered with the vague warnings “Don’t use regex, its slow”. I have seen that advice given and yes its based on a truth as mentioned before, but they never add in the time it takes to subsequently process the information.  They forget that in most cases Regular Expressions already provides the post processing needs such storage and data extraction abilities built in.

It comes down to…Is it fast enough for you?

If one needs to shave off milliseconds from a multi-million operation, then don’t use regular expressions or at least do tests first. But for day to day use, I believe its always the right answer. With that premise, let us test some code.

Premise

The usual contender for a regular expression is string.Split. Now string.Split is a fast little function and very useful, but one has to then  consider the ancillary processing and I have found a real example culled from the forums.

The Test

A user asked what could be used to parse specific text and whether regular expressions could be used. The example text, changed slightly, the value 41 was used instead of 0, looked like this

name="rating_count" value="41"

The user was interested in achieving the value of 41 as an integer and wondered which is better.

The Opponent

Right out of the gate there was an answer saying Regex is slower and gave an example which actually failed. I have modified it to work. The originator had tested zero and didn’t realize they were getting a default value instead of an extracted value because it was only splitting on the ‘=’ character. In my test it is fixed and placed into a static method called Highway:

public static int Highway(string text)
{

string []parts = text.Split( new char[] { ' ', '=', '\x22' }, StringSplitOptions.RemoveEmptyEntries );
int value = 0;
for(int index = 0; index < parts.Length-1; index++)
   if(parts[index].ToLower() == "value") {
      string tempValue = parts[index+1];
      int.TryParse(tempValue, out value);
      break;
    }

return value;

}

Note that \x22 is hex for quotes(“).

The Contender

Here is what I wrote to do the same job in Regular Expressions which I called MyWay (get it MyWay or the Highway…bwhahahaha…nevermind)

public static int MyWay( string text )
{
int value = 0;

int.TryParse( Regex.Match( text, "(?:value=\x22)([^\x22]+)", RegexOptions.Compiled ).Groups[1].Value, out value );

return value;
}

Now I knew that this would be run multiple times so I told .Net to compile the expression for future uses after the first, but if this is a one off operation one should not do that.

The Cage

Here is the testing arena for the two operations. I throw away the first value, which does help regex in the long run due to the compilation, but frankly a one off test without the compilation flag is not to shabby. If you try this at home don’t forget System.Diagnostics using.

string data = string.Format("name={0}rating_count{0} value={0}41{0}", "\x22");
Stopwatch st = new Stopwatch();
int index;
int totalRuns = 100000;

Highway( data ); // Do a test and throw it out

st.Start();
for (index = 0; index < totalRuns; index++)
    Highway( data );
st.Stop();

Console.WriteLine( "Non Regex:\t{0}\tAvg Per Run:\t{1}", st.Elapsed.TotalMilliseconds, st.Elapsed.TotalMilliseconds / totalRuns );

MyWay( data ); // Throw out the first

st.Reset();
st.Start();
for ( index = 0; index < totalRuns; index++ )
    MyWay( data );
st.Stop();

Console.WriteLine( "Regex:\t\t{0}\tAvg Per Run:\t{1}", st.Elapsed.TotalMilliseconds, st.Elapsed.TotalMilliseconds / totalRuns );

Results

So what happens? Well in Release mode for 100000 times produces results like this result on a dual core machine (Total Milliseconds values):

Non Regex:      213.9509        Avg Per Run:    0.002139509
Regex:          226.7564        Avg Per Run:    0.002267564

So the difference was not really that great…and though the times for the non regex were usually faster overall, there wasn’t too great of a difference between the two.

So one has to ask, “Is Regex fast enough for you?”

I believe that to be yes! Note, in fairness, poorly formed regex patterns will slow the parser down, but garbage in garbage out; so yes your mileage will vary.

Share

Tags: ,

C#: Splitting Data From a String and Extracting out Decimals and Integers into Separate Lists Using Extension Methods

Sometimes you have the need to extract text out of a string, once extracted you need to get the right data into lists. Say one list of integers and one list of decimals. Using the extension methods now found in .Net one can easily do those operations.

Example

Here is the data in a string, “12 34 56.1 18 19.25 41”. We want the integers separated out from the decimals into their own lists respectively as their native values (int/decimal) and not strings. Using Where, Select and a couple of Lambdas we can get to the intended  result.

string data =@"12 34 56.1 18 19.25 41";

string[] dataAsStrings = data.Split( ' ' );

List<decimal> decs = ( dataAsStrings.Where( itm => itm.Contains( '.' ) ) )
                  .Select( itm => Decimal.Parse( itm ) )
                  .ToList();

List<int> ints = ( dataAsStrings.Where( itm => itm.Contains( '.' ) == false ) ).Select( itm => int.Parse( itm ) ).ToList();

foreach ( decimal dc in decs )
    Console.WriteLine( dc );
/* Outputs 56.1 19.25 (on seperate lines)*/

foreach ( int it in ints)
    Console.WriteLine( it );
/* Outputs 12 34 18 41 (on seperate lines)*/
  • Step one is top split the string using string.Split.
  • Step 2 is take that split data and enumerate over it using the Where method.
  • Within the Where we use a lambda which basically says for each item in the list, if it contains a period select it and return it. Note the result of most Extension methods is deemed a projection which is really a list of the values. Where, Select and ToList all return projections.
  • Once the Where is done we call/chain the Select Extension. The select is saying take the projection from the Where and on each item parse out the value and return it in its native, non string format.
  • Using the ToList on the projection created by the Select we take the items and specify it to be in a List<> format.

We didn’t have to do the ToList and could have said something like this:

var decs = ( dataAsStrings.Where( itm => itm.Contains( '.' ) ) )
        .Select( itm => Decimal.Parse( itm ) );
foreach ( decimal dc in decs )
    Console.WriteLine( dc );
/* Outputs 56.1 19.25 (on seperate lines)*/

Which would have been just as vaild. At runtime the var would have been internally known as IEnumerable<decimal> under the covers and as you can see that is just a different type of list and works just the same.

Share

Tags: , , , , , ,