Bizarre benchmark result -- C# hundreds of times slower than Java?

Michael A. Covington

unread,

Mar 11, 2008, 10:21:02 AM3/11/08

to

While asking some Java enthusiasts what they think about C#, I came across
this:

http://www.manageability.org/blog/archive/20030520%23p_the_problem_with_cameron

Reportedly, the (essentially) same program in C# is much, much slower than
in Java.

This is a program that is heavy on regexes (which I'm not an expert on) and
am wondering if the C# version makes an elementary blunder. Do any experts
want to have a look? See also comp.lang.java.

(Query: Is he compiling the regex once in Java, but every time through the
loop in C#?)

Jon Skeet [C# MVP]

unread,

Mar 11, 2008, 10:33:52 AM3/11/08

to

Looks like he's compiling it every time in both cases - which is stupid
on both platforms.

One thing to note is that this test almost certainly says *nothing*
about the performance of the languages. Instead, it is a benchmark of
the regular expression implementations.

Oh, and it was created in 2003. Worth trying again, preferably having
fixed the code to not be braindead.

--
Jon Skeet - <sk...@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Michael A. Covington

unread,

Mar 11, 2008, 12:13:31 PM3/11/08

to

Thanks!

"Jon Skeet [C# MVP]" <sk...@pobox.com> wrote in message
news:MPG.2240a8a...@msnews.microsoft.com...

Jon Skeet [C# MVP]

unread,

Mar 11, 2008, 12:32:04 PM3/11/08

to

Michael A. Covington <lo...@ai.uga.edu.for.address> wrote:

> Thanks!

Having looked again at the code, you can't actually compile a single
regular expression and reuse it - they're creating a new regular
expression on each iteration.

However, this shows:

1) You shouldn't compile the regex unless you're going to reuse it
2) You shouldn't use regexes for cases where simple string matches work
just as well

It really is a completely bogus test. The claim that "You could build a
spam filter, a RSS categorization engine or even a rule based message
broker" is rubbish - because in those cases there would be a relatively
low number of regular expressions, tested against a high number of
messages. In those cases you *would* want to compile the regex, and you
could easily get completely different performance out.

To make the C# version run comparably with the Java version (still
slower, admittedly - a factor of 2 on my box) just take out
RegexOptions.Compile.

Lasse Vågsæther Karlsen

unread,

Mar 11, 2008, 12:51:38 PM3/11/08

to

Jon Skeet [C# MVP] wrote:
> Michael A. Covington <lo...@ai.uga.edu.for.address> wrote:
>> While asking some Java enthusiasts what they think about C#, I came across
>> this:
>>
>> http://www.manageability.org/blog/archive/20030520%23p_the_problem_with_cameron
>>
>> Reportedly, the (essentially) same program in C# is much, much slower than
>> in Java.
>>
>> This is a program that is heavy on regexes (which I'm not an expert on) and
>> am wondering if the C# version makes an elementary blunder. Do any experts
>> want to have a look? See also comp.lang.java.
>>
>> (Query: Is he compiling the regex once in Java, but every time through the
>> loop in C#?)
>
> Looks like he's compiling it every time in both cases - which is stupid
> on both platforms.
>
> One thing to note is that this test almost certainly says *nothing*
> about the performance of the languages. Instead, it is a benchmark of
> the regular expression implementations.
>
> Oh, and it was created in 2003. Worth trying again, preferably having
> fixed the code to not be braindead.
>

On my laptop the code never finishes as posted (at least not in the time
I allowed it to run), but if I remove the RegexOptions.Compiled, it
completes in about 6.5 seconds. I have no idea how the java program
would fare on my machine so I won't vouch for its performance compared
to the "equivalent" java program.

Let's see what the documentation says about .Compiled:

"Specifies that the regular expression is compiled to an assembly. This
yields faster execution but increases startup time. This value should
not be assigned to the Options property when calling the
CompileToAssembly method."

Ok, so lets check what Pattern.compile does in java:

"Compiles the given regular expression into a pattern."

Hm, not saying much exactly but from docs and all the sources I can
find, Pattern.compile(...) is the same as new Regex(...) in .NET.

So in .NET, it is compiled into a binary assembly running the same kind
of code as though you built the state machine from scratch, and in java
it builds the data structures.

Well, duh, this will take more time. There is no point in compiling the
regular expression into an assembly if its to be executed once, like the
code posted does.

This is the main problem with most benchmarks that try to prove one
language better than another, they invariably fall short on some kind of
user problem.

In any case, as Jon pointed out, this isn't a benchmark that tells you
anything useful. If anything, it tells you that of the two programs, the
java one completed, and the C# one didn't and ran a lot slower to boot.
It doesn't say anything about what can be accomplished if you use the
platform to its fullest, or at least avoid mistakes like this.

And besides, measuring the performance of C# would be a whole other
task. What the program *attempts* to do is measure one part of the BCL
against a similar one in Java.

And a rather bad attempt at that.

--
Lasse Vågsæther Karlsen
mailto:la...@vkarlsen.no
http://presentationmode.blogspot.com/
PGP KeyID: 0xBCDEA2E3

Ethan Strauss

unread,

Mar 11, 2008, 4:04:00 PM3/11/08

to

That code is very interesting. I have written for similar code for a real DNA
sequence manipulation application and I found that Regex's have a large
amount of variability in the time it takes them to run. Most were taking 1-3
ticks and occassionally one would take 3 seconds! I never did track down
exactly what was going on, but I switched from using a Regex to using a
series of string.IndexOf() statements. and that consistently runs much
faster.
Has anyone else seen this sort of variability in Regex time to run?

Ethan
Ethan Strauss Ph.D.
Bioinformatics Scientist
Promega Corporation
2800 Woods Hollow Rd.
Madison, WI 53711
608-274-4330
800-356-9526
ethan....@promega.com

Jeroen Mostert

unread,

Mar 11, 2008, 4:36:02 PM3/11/08

to

Ethan Strauss wrote:
> That code is very interesting. I have written for similar code for a real DNA
> sequence manipulation application and I found that Regex's have a large
> amount of variability in the time it takes them to run. Most were taking 1-3
> ticks and occassionally one would take 3 seconds! I never did track down
> exactly what was going on, but I switched from using a Regex to using a
> series of string.IndexOf() statements. and that consistently runs much
> faster.
> Has anyone else seen this sort of variability in Regex time to run?
>

Plenty, usually caused by the fact that regexes aren't regular expressions
(the theoretical constructs, which always match in linear time). See, e.g.,
http://www.codinghorror.com/blog/archives/000488.html and
http://www.regular-expressions.info/catastrophic.html.

The bottom line is that advanced regexes need optimization just like code
does. People often assume that anything they can throw in a regex will
magically run fast, but alas that's not the case.

--
J.

Alun Harford

unread,

Mar 11, 2008, 7:40:48 PM3/11/08

to

Ethan Strauss wrote:
> That code is very interesting. I have written for similar code for a real DNA
> sequence manipulation application and I found that Regex's have a large
> amount of variability in the time it takes them to run. Most were taking 1-3
> ticks and occassionally one would take 3 seconds! I never did track down
> exactly what was going on, but I switched from using a Regex to using a
> series of string.IndexOf() statements. and that consistently runs much
> faster.
> Has anyone else seen this sort of variability in Regex time to run?

Yes. C# Regular Expressions can potentially take exponential time to run
(IIRC they're NP-hard).

And I hope you realise that using strings to represent DNA sequences
except for input/output makes baby Jesus cry. Your programs would be
faster and more maintainable using your own type (probably backed by a
byte[] and some unsafe code). The best way to represent it normally
depends on what you're doing and whether you need to consider SNPs, but
Unicode strings are a bad idea!

Alun Harford

Shalin Shah

unread,

Mar 12, 2008, 11:54:35 PM3/12/08

to

> This is a program that is heavy on regexes (which I'm not an expert on) and
> am wondering if the C# version makes an elementary blunder. Do any experts
> want to have a look? See also comp.lang.java.
>
> (Query: Is he compiling the regex once in Java, but every time through the
> loop in C#?)

I think he is compiling the regular expression each time in the loop.
A good benchmark would be compiling it once and matching it in the
loop. Maybe C# uses a DFA-NFA hybrid (which might explain the large
memory usage, as the author of the article claims) which has the
potential of matching a regular expression several times faster than a
backtracking implementation, but it would compile the regular
expression slower than backtracking.

Another good benchmark would include the use of backreferences, which
forces a regex implementation to use backtracking.

FYI, egrep uses a DFA-NFA hybrid.

Ethan Strauss

unread,

Mar 13, 2008, 11:49:01 AM3/13/08

to

I don't want to be the one making Jesus cry, but I am not sure that my code
is what is doing it. I see that representing DNA as strings is not going to
make the cpu as happy as it could be, but it is not obvious to me how to
represent DNA (and RNA and protein, so I can't use a byte array anymore) as a
numeric array and still get relatively programmer friendly functionality.
I started yesterday (code below...) and stopped pretty rapidly because I
don't see a way to recreate IndexOf or Regex type functionality without a lot
of work! If you have anything more complete I would be interested!
Thanks,
Ethan

using System;
using System.Collections.Generic;
using System.Text;

namespace TestSequence
{
public struct DNA
{
private DNABase[] _Sequence;
public DNA(string sequence)
{
List<DNABase> ThisSequence = new List<DNABase>();
foreach (char thisBase in sequence.ToUpper().ToCharArray())
{
DNABase NextBase;
switch (thisBase)
{
case "G":
{
NextBase = DNABase.G;
break;
}
case "A":
{
NextBase = DNABase.A;
break;
}
case "T":
{
NextBase = DNABase.T;
break;
}
case "C":
{
NextBase = DNABase.C;
break;
}
default:
{
continue;
}
}
ThisSequence.Add(NextBase);
}
_Sequence = ThisSequence.ToArray();
}
}
public enum DNABase : byte
{
N = 0,
G = 1,
A = 2,
T = 3,
C = 4

Jon Skeet [C# MVP]

unread,

Mar 13, 2008, 12:36:20 PM3/13/08

to

Ethan Strauss <EthanS...@discussions.microsoft.com> wrote:
> I don't want to be the one making Jesus cry, but I am not sure that my code
> is what is doing it. I see that representing DNA as strings is not going to
> make the cpu as happy as it could be, but it is not obvious to me how to
> represent DNA (and RNA and protein, so I can't use a byte array anymore) as a
> numeric array and still get relatively programmer friendly functionality.

I'd create a type (probably *not* a value type, btw) representing a DNA
sequence, which *internally* used an efficient format, but which could
also present its contents as strings.

Basically it looks like you've started the right way, and you just need
to write a byte equivalent of IndexOf. That shouldn't be too hard to do
- and Array.IndexOf may sort it out for you. What kind of regex
capabilities do you need?

Ethan Strauss

unread,

Mar 13, 2008, 12:56:05 PM3/13/08

to

Thanks,
I will think about it some more. In general, I am much more concerned with
usability and correctness than effeciency, but I certainly won't turn down
effeciency gains if they are just sitting there. If I end up just storing DNA
as an int array and then, for most uses, convert it back to a string, will
that actually gain anything?

The biggest Regex need is shown below. I do use some other Regex's, but this
is the one I don't think I could really replace. I guess I can always convert
to a string and then create the Regex from that is need be.
Ethan

/// <summary>
/// Creates a regular expression of all sequences which could code
for a specific amino acid Sequence
/// </summary>
/// <param name="residues">A string containing the amino acid
Sequence</param>
/// <returns>a string which represents the Regular
expression</returns>
public static string BackTranslateToRegex(string residues)
{
if (residues == null)
{
throw new ArgumentNullException("residues", "you need a
Sequence from which to make a regex.");
}
Hashtable BackTranslations = new Hashtable();
BackTranslations.Add('*', "T(GA|A[GA])");
BackTranslations.Add('A', "GC[GATC]");
BackTranslations.Add('C', "TG[TC]");
BackTranslations.Add('D', "GA[TC]");
BackTranslations.Add('E', "GA[GA]");
BackTranslations.Add('F', "TT[TC]");
BackTranslations.Add('G', "GG[GATC]");
BackTranslations.Add('H', "CA[TC]");
BackTranslations.Add('I', "AT[ATC]");
BackTranslations.Add('K', "AA[GA]");
BackTranslations.Add('L', "(TT[GA]|CT[GATC])");
BackTranslations.Add('M', "ATG");
BackTranslations.Add('N', "AA[TC]");
BackTranslations.Add('P', "CC[GATC]");
BackTranslations.Add('Q', "CA[GA]");
BackTranslations.Add('R', "(AG[GA]|CG[GATC])");
BackTranslations.Add('S', "(AG[TC]|TC[GATC])");
BackTranslations.Add('T', "AC[GATC]");
BackTranslations.Add('V', "GT[GATC]");
BackTranslations.Add('W', "TGG");
BackTranslations.Add('Y', "TA[TC]");
StringBuilder Expression = new StringBuilder();
char[] Residues = residues.ToCharArray();
foreach (char Residue in Residues)
{
Expression.Append(BackTranslations[Residue]);
}
return Expression.ToString();
}

"Jon Skeet [C# MVP]" wrote:

Jon Skeet [C# MVP]

unread,

Mar 13, 2008, 1:08:14 PM3/13/08

to

Ethan Strauss <EthanS...@discussions.microsoft.com> wrote:
> I will think about it some more. In general, I am much more concerned with
> usability and correctness than effeciency, but I certainly won't turn down
> effeciency gains if they are just sitting there. If I end up just storing DNA
> as an int array and then, for most uses, convert it back to a string, will
> that actually gain anything?

Well, I'd say that the biggest gain is in terms of concepts. Your
sequence should logically be a sequence of individual strongly typed
elements, not arbitrary characters in a string.

> The biggest Regex need is shown below. I do use some other Regex's, but this
> is the one I don't think I could really replace. I guess I can always convert
> to a string and then create the Regex from that is need be.

Yes, if necessary. Are those A-Y sequences standardised amino acid
sequence identifiers? You could potentially create another type for
those, of course... but that would end up being quite a bit more work,
I suspect.

Jesse Houwing

unread,

Mar 14, 2008, 4:04:00 PM3/14/08

to

Hello Michael,

Replacing
Regex regexpr = new Regex(matchthis, RegexOptions.Compiled);

with
Regex regexpr = new Regex(matchthis, RegexOptions.None);

made it fly.

--
Jesse Houwing
jesse.houwing at sogeti.nl