Row comparisons for Rhino ETL

37 views
Skip to first unread message

webpaul

unread,
Feb 21, 2009, 11:13:52 AM2/21/09
to Rhino Tools Dev
I looked at some recent changes and one of them was for checking row
equality. I noticed there was a specific test for an (int)1 not being
equal to a (byte)1 - is that the desired behavior or was the test put
in there just to demonstrate that subtlety? I did a little test and
was surprised to find the below .NET framework behavior, I would have
thought they would be equal:

object a = (int)1;
object b = (byte)1;

Assert.IsFalse(a.Equals(b));

I'm guessing the framework just returns false if the types are
different in the Equals implementation.

So I understand why the test behaves how it does, just curious if that
is the desired effect or just due to the above and you wanted it to be
clear.

Simone Busoli

unread,
Feb 21, 2009, 11:43:11 AM2/21/09
to rhino-t...@googlegroups.com
That was to point out the subtlety in the .net fx. I already discussed it, please lookup "row equality" on the mailing list. I think this can be addressed in several ways, but didn't take the time to do it yet.

webpaul

unread,
Feb 21, 2009, 9:54:42 PM2/21/09
to Rhino Tools Dev
Ok, mission accomplished then - Makes sense once you think about it. I
certainly don't have any burning need for it to work and the easy work
around is to cast one of the items as they are read in if it becomes
an issue so I think it's fine. Just wanted to check if that was a
desired thing or not.
> > clear.- Hide quoted text -
>
> - Show quoted text -

Simone Busoli

unread,
Feb 22, 2009, 11:36:54 AM2/22/09
to rhino-t...@googlegroups.com
Actually, when you're doing a join it would be a very cool feature to have. I spent quite some time wondering why the rows didn't join correctly, and it was because the field on which it was performing the join was an integer on one side and a byte on the other. So far, the solution has been to write tests which ensure that the two sides of the join have the same field types, but I would like to solve it at the RhinoETL level.

Ayende Rahien

unread,
Feb 22, 2009, 12:33:38 PM2/22/09
to rhino-t...@googlegroups.com
+1

webpaul

unread,
Feb 22, 2009, 1:49:49 PM2/22/09
to Rhino Tools Dev
How are you thinking of doing it? Casting up should always be safe, so
you could always cast any numeric type to double or something like
that in order to compare. That way you could compare 1 with 1.00 also.
Not sure if that is a perf problem or not though.

On Feb 22, 11:33 am, Ayende Rahien <aye...@ayende.com> wrote:
> +1
>
> >> > - Show quoted text -- Hide quoted text -

Ayende Rahien

unread,
Feb 22, 2009, 1:52:57 PM2/22/09
to rhino-t...@googlegroups.com
Custom Comparators for the join.
We can detect them not being of the same type and coerce them to the bigger type

Simone Busoli

unread,
Feb 22, 2009, 1:56:48 PM2/22/09
to rhino-t...@googlegroups.com
What about LCG with expressions? They know how to compare each other, when they know who they are :)

Ayende Rahien

unread,
Feb 22, 2009, 2:01:23 PM2/22/09
to rhino-t...@googlegroups.com
Go for it :-)
That would actually keep us consistent with the appropriate C# behavior, which is the expected one.

Simone Busoli

unread,
Feb 22, 2009, 2:03:51 PM2/22/09
to rhino-t...@googlegroups.com
Right :)  I'm not sure I can take the time in the next few days, though, but it's on my todo list.

Simone Busoli

unread,
Mar 3, 2009, 9:06:08 PM3/3/09
to rhino-t...@googlegroups.com
committed in rev. 2086

webpaul

unread,
Mar 3, 2009, 9:40:53 PM3/3/09
to Rhino Tools Dev
Do you have a link to something that explains the general concept of
what is going on there? I haven't ever seen anything like that before.

On Mar 3, 8:06 pm, Simone Busoli <simone.bus...@gmail.com> wrote:
> committed in rev. 2086
>
> On Sun, Feb 22, 2009 at 20:03, Simone Busoli <simone.bus...@gmail.com>wrote:
>
>
>
> > Right :)  I'm not sure I can take the time in the next few days, though,
> > but it's on my todo list.
>
> > On Sun, Feb 22, 2009 at 20:01, Ayende Rahien <aye...@ayende.com> wrote:
>
> >> Go for it :-)That would actually keep us consistent with the appropriate
> >> C# behavior, which is the expected one.
>
> >> On Sun, Feb 22, 2009 at 1:56 PM, Simone Busoli <simone.bus...@gmail.com>wrote:
>
> >>> What about LCG with expressions? They know how to compare each other,
> >>> when they know who they are :)
>
> >>> On Sun, Feb 22, 2009 at 19:52, Ayende Rahien <aye...@ayende.com> wrote:
>
> >>>> Custom Comparators for the join.We can detect them not being of the

Simone Busoli

unread,
Mar 4, 2009, 3:31:13 AM3/4/09
to rhino-t...@googlegroups.com
I'm creating on the fly a method which performs a comparison of the two values by coercing them to the same type.

Say you have:

object a = (int)1;
object b = (byte)1;

You'd get a.Equals(b) to be false, which is somewhat unexpected.

What I'm doing is this:

((int)a).Equals((int)(byte)b), which returns true, as expected.

There might be other ways, however.

webpaul

unread,
Mar 4, 2009, 4:19:02 AM3/4/09
to Rhino Tools Dev
Is there a framework you are using that supports this or is that
something you guys came up with? I see a .compile in there when it is
generating that final .Equal statement.

Simone Busoli

unread,
Mar 4, 2009, 7:23:01 AM3/4/09
to rhino-t...@googlegroups.com
It's the .NET fx, and it's LCG, lightweight code generation.

webpaul

unread,
Mar 5, 2009, 10:56:26 PM3/5/09
to Rhino Tools Dev
What kind of performance gain is there to what you did compared to
using a TypeConverter on both objects, then a final comparison of the
result? Curious if you have looked at that before.

That is creating an expression for each row and compiling it for each
row, right? I'm guessing the main factor here is performance so if it
was that big of a concern I wonder if you wouldn't want to make a
dictionary of them for the unique types encountered, especially since
most rows are going to have the same columns.

I think I'm going to work up an example app with a few of these
methods on lots of rows, to try some of these out and see the
difference. I'll post it here within a few days.

Ayende Rahien

unread,
Mar 6, 2009, 12:46:29 AM3/6/09
to rhino-t...@googlegroups.com
Stopwatch.StartNew

Please measure.

webpaul

unread,
Mar 6, 2009, 10:10:22 AM3/6/09
to Rhino Tools Dev
It's pretty slow as is - takes 6.5 seconds for 10K iterations. I'm
going to try it a few different ways and see which is fastest.

class Program
{
static void Main(string[] args)
{
RunComparisons(CreateComparerLCG);
}

//Comparison types

private static Func<object, object, bool> CreateComparerLCG
(Type firstType, Type secondType)
{
if (firstType == secondType)
return Equals;

var firstParameter = Expression.Parameter(typeof(object),
"first");
var secondParameter = Expression.Parameter(typeof(object),
"second");

var equalExpression = Expression.Equal(Expression.Convert
(firstParameter, firstType),
Expression.Convert(Expression.Convert(secondParameter,
secondType), firstType));

return Expression.Lambda<Func<object, object, bool>>
(equalExpression, firstParameter, secondParameter).Compile();
}

private static void RunComparisons(Func<Type, Type,
Func<object, object, bool>> createComparer)
{
List<Comparison> comparisonsToMake = new List<Comparison>
{
new Comparison { item = (byte)1, otherItem = (int)1 },
new Comparison { item = (int)1, otherItem = (long)1 },
new Comparison { item = (long)1, otherItem = (float)
1 },
new Comparison { item = (float)1, otherItem = (double)
1 },
};

Program program = new Program();

Stopwatch watch = new Stopwatch();
watch.Start();
for (int i = 0; i < 10000; i++)
{
foreach (var comparison in comparisonsToMake)
{
if (!program.Compare(comparison, createComparer))
throw new ApplicationException("Comparison
didn't work");
}
};
watch.Stop();
Console.WriteLine("All comparisons took " +
watch.ElapsedMilliseconds + "ms");
}

private class Comparison
{
public object item;
public object otherItem;
}

private bool Compare(Comparison comparison, Func<Type, Type,
Func<object, object, bool>> createComparer)
{
object item = comparison.item;
object otherItem = comparison.otherItem;

if (item == null | otherItem == null)
return item == null & otherItem == null;

var equalityComparer = createComparer(item.GetType(),
otherItem.GetType());

return equalityComparer(item, otherItem);

webpaul

unread,
Mar 6, 2009, 10:23:30 AM3/6/09
to Rhino Tools Dev
This does it in 93ms vs. 6500ms on my box, could be improved with a
Dictionary as well. Also, it always converts to the first type so if
you compare byte.MaxValue with int.MaxValue you get an exception. It
should convert to the largest type so I'll see what I can do on that
as well.

private static Func<object, object, bool>
CreateComparerTypeConverter(Type firstType, Type secondType)
{
if (firstType == secondType)
return Equals;

TypeConverter converterFirstType =
TypeDescriptor.GetConverter(firstType);
TypeConverter converterSecondType =
TypeDescriptor.GetConverter(firstType);

return delegate(object first, object second)
{
return
object.Equals(
converterFirstType.ConvertTo(first,
firstType),
converterFirstType.ConvertTo(
converterSecondType.ConvertTo(second,
secondType),
firstType
)
)
;
};
> ...
>
> read more »- Hide quoted text -

webpaul

unread,
Mar 6, 2009, 10:46:44 AM3/6/09
to Rhino Tools Dev
This one takes 1000ms vs 6500ms but it handles one of the types being
an overflow for the other. I couldn't find a way to detect if there
will be an overflow for a specific value without actually catching the
overflow exception which makes this take much longer. Any ideas?

class Program
{
static void Main(string[] args)
{
RunComparisons(CreateComparerLCG);
RunComparisons(CreateComparerTypeConverter);
}

//Comparison types

private static Func<object, object, bool>
CreateComparerTypeConverter(Type firstType, Type secondType)
{
if (firstType == secondType)
return Equals;

TypeConverter converterFirstType =
TypeDescriptor.GetConverter(firstType);
TypeConverter converterSecondType =
TypeDescriptor.GetConverter(secondType);

return delegate(object first, object second)
{
try
{
return object.Equals(
first,
converterFirstType.ConvertTo(second,
firstType)
);
}
catch (OverflowException)
{
return object.Equals(
second,
converterSecondType.ConvertTo(first,
secondType)
);
}
};
}

private static Func<object, object, bool> CreateComparerLCG
(Type firstType, Type secondType)
{
if (firstType == secondType)
return Equals;

var firstParameter = Expression.Parameter(typeof(object),
"first");
var secondParameter = Expression.Parameter(typeof(object),
"second");

var equalExpression = Expression.Equal(Expression.Convert
(firstParameter, firstType),
Expression.Convert(Expression.Convert(secondParameter,
secondType), firstType));

return Expression.Lambda<Func<object, object, bool>>
(equalExpression, firstParameter, secondParameter).Compile();
}

private static void RunComparisons(Func<Type, Type,
Func<object, object, bool>> createComparer)
{
List<Comparison> comparisonsToMake = new List<Comparison>
{
new Comparison { item = (byte)1, otherItem = (int)1 },
new Comparison { item = (int)1, otherItem = (long)1 },
new Comparison { item = (long)1, otherItem = (float)
1 },
new Comparison { item = (float)1, otherItem = (double)
1 },
new Comparison { item = (byte)byte.MaxValue, otherItem
= (int)int.MaxValue, expectedValue = false },
};

Program program = new Program();

Stopwatch watch = new Stopwatch();
watch.Start();
for (int i = 0; i < 10000; i++)
{
foreach (var comparison in comparisonsToMake)
{
if (program.Compare(comparison, createComparer) !=
comparison.expectedValue)
throw new ApplicationException("Comparison
didn't work");
}
};
watch.Stop();
Console.WriteLine("All comparisons took " +
watch.ElapsedMilliseconds + "ms");
}

private class Comparison
{
public object item;
public object otherItem;
public bool expectedValue = true;

Simone Busoli

unread,
Mar 6, 2009, 11:02:14 AM3/6/09
to rhino-t...@googlegroups.com
Try caching the LCG comparer, you'll be surprised.

webpaul

unread,
Mar 7, 2009, 9:32:35 AM3/7/09
to Rhino Tools Dev
Type converter is still faster than dictionary cached LCG (without
overflow problem solved) and I'm not even caching the type converter
yet... The LCG solution does handle the overflow condition gracefully
and I don't fully understand how it isn't overflowing, can you explain
how what you are doing doesn't fail when converting int.MaxValue to a
byte?

Also, how do you guys handle multi-key dictionaries typically? I have
a way but am curious how you usually do it.

Simone Busoli

unread,
Mar 7, 2009, 9:36:49 AM3/7/09
to rhino-t...@googlegroups.com
Sorry Paul, I'm not going to reply anymore. Write a test, write a patch and submit it if you think that using type converters is a better option.

webpaul

unread,
Mar 7, 2009, 10:45:03 AM3/7/09
to Rhino Tools Dev
Until I know how you want to handle multi key dictionaries in your
code base I'm not going to submit a patch. Below is how I did it with
a class I usually use for that, more details at:
http://www.codeproject.com/KB/recipes/ClassKey.aspx

If anyone is interested, the LINQ expressions used in the LCG have a
checked/unchecked version which is why it didn't have the overflow
issue. As far as I can tell type converters don't have that feature
and using an unchecked block doesn't stop it from throwing an
exception. When the type converters are cached they are better than
the cached LCG unless I'm doing something wrong here.

Performance from worst to best, with the original test rows plus one
more for overflow condition:

10K iterations * 5 tests per iteration:

LCG compiled each time - 8043ms
TypeConverter with try/catch : 869ms
LCG cached: 143ms
TypeConverter without try/catch (exception on overflow): 82ms
Type converter with try/catch cached: 23ms

class Program
{
static void Main(string[] args)
{
RunComparisons(CreateComparerLCG);
RunComparisons(CreateComparerTypeConverter);
RunComparisons(CreateComparerLCGCachedClass);
RunComparisons(CreateComparerTypeConverterCached);
private class CacheKey : ClassKey<CacheKey>
{
public Type FirstType = null;
public Type SecondType = null;

public override object[] GetKeyValues()
{
return new object[] { FirstType, SecondType };
}
}

private class ComparerCacheClass : Dictionary<CacheKey,
Func<object, object, bool>> { }
private static readonly ComparerCacheClass comparerCacheClass
= new ComparerCacheClass();
private static Func<object, object, bool>
CreateComparerLCGCachedClass(Type firstType, Type secondType)
{
if (firstType == secondType)
return Equals;

var firstParameter = Expression.Parameter(typeof(object),
"first");
var secondParameter = Expression.Parameter(typeof(object),
"second");

var equalExpression = Expression.Equal(Expression.Convert
(firstParameter, firstType),
Expression.Convert(Expression.Convert(secondParameter,
secondType), firstType));

Func<object, object, bool> compareExpression = null;
CacheKey key = new CacheKey { FirstType = firstType,
SecondType = secondType };
if (!comparerCacheClass.TryGetValue(key, out
compareExpression))
{
compareExpression = Expression.Lambda<Func<object,
object, bool>>(equalExpression, firstParameter,
secondParameter).Compile();
comparerCacheClass.Add(key, compareExpression);
}

return compareExpression;
}

private static Func<object, object, bool>
CreateComparerTypeConverterCached(Type firstType, Type secondType)
{
if (firstType == secondType)
return Equals;

Func<object, object, bool> compareExpression = null;
CacheKey key = new CacheKey { FirstType = firstType,
SecondType = secondType };
if (!comparerCacheClass.TryGetValue(key, out
compareExpression))
{
TypeConverter converterFirstType =
TypeDescriptor.GetConverter(firstType);
TypeConverter converterSecondType =
TypeDescriptor.GetConverter(secondType);

compareExpression = delegate(object first, object
second)
{
try
{
return object.Equals(
first,
converterFirstType.ConvertTo(second,
firstType)
);
}
catch (OverflowException)
{
return object.Equals(
second,
converterSecondType.ConvertTo(first,
secondType)
);
}
};

comparerCacheClass.Add(key, compareExpression);
}

return compareExpression;

Simone Busoli

unread,
Mar 7, 2009, 11:29:14 AM3/7/09
to rhino-t...@googlegroups.com
Did you realize that you're using a shared cache, which is already filled with delegates when you run the fourth step?

webpaul

unread,
Mar 7, 2009, 12:34:41 PM3/7/09
to Rhino Tools Dev
Doh! Nice catch.

Looks like it is really 800ms, the try/catch is a killer because it's
in the delegate so caching it doesn't give a big benefit. If there was
an unchecked type converter or some way to tell which type was bigger
so I could always cast to that then I suspect the type converter would
be better. But as-is (including the overflow scenario) the cached LCG
is the best.

webpaul

unread,
Mar 8, 2009, 8:15:27 PM3/8/09
to Rhino Tools Dev
Found a way to compare the type sizes. Using this method for previous
examples, it returns in 37ms which would seem to be faster than the
LCG. I wanted to rule out a larger startup time for LCG as opposed to
execution time, so for 1 million rows the cached LCG takes 13 seconds
and the type converter method below takes 3 seconds.

Separate caches of course. :)

private static Func<object, object, bool>
CreateComparerTypeConverterCached(Type firstType, Type secondType)
{
if (firstType == secondType)
return Equals;

Func<object, object, bool> compareExpression = null;
CacheKey key = new CacheKey { FirstType = firstType,
SecondType = secondType };
if (!comparerCacheClass.TryGetValue(key, out
compareExpression))
{
TypeConverter converterFirstType =
TypeDescriptor.GetConverter(firstType);
TypeConverter converterSecondType =
TypeDescriptor.GetConverter(secondType);

int firstTypeSize =
System.Runtime.InteropServices.Marshal.SizeOf(firstType);
int secondTypeSize =
System.Runtime.InteropServices.Marshal.SizeOf(secondType);

if (secondTypeSize >= firstTypeSize)
{
compareExpression = delegate(object first, object
second)
{
return object.Equals(
second,
converterSecondType.ConvertTo(first,
secondType)
);
};

}
else
{
compareExpression = delegate(object first, object
second)
{
return object.Equals(
first,
converterFirstType.ConvertTo(second,
firstType)
);
};
}

comparerCacheClass.Add(key, compareExpression);
}

return compareExpression;
> > - Show quoted text -- Hide quoted text -

webpaul

unread,
Mar 8, 2009, 8:19:24 PM3/8/09
to Rhino Tools Dev
Even the uncached type converter is faster than cached LCG, 11 seconds
instead of 13. Perhaps the LINQ expression implementation for
conversion is more expensive than the standard type converter.

private static Func<object, object, bool>
CreateComparerTypeConverter(Type firstType, Type secondType)
{
if (firstType == secondType)
return Equals;

TypeConverter converterFirstType =
TypeDescriptor.GetConverter(firstType);
TypeConverter converterSecondType =
TypeDescriptor.GetConverter(secondType);

int firstTypeSize =
System.Runtime.InteropServices.Marshal.SizeOf(firstType);
int secondTypeSize =
System.Runtime.InteropServices.Marshal.SizeOf(secondType);

if (secondTypeSize >= firstTypeSize)
{
return delegate(object first, object second)
{
return object.Equals(
second,
converterSecondType.ConvertTo(first,
secondType)
);
};

}
else
{
return delegate(object first, object second)
{

webpaul

unread,
Mar 15, 2009, 3:47:41 PM3/15/09
to Rhino Tools Dev
Uploaded RowComparePerformanceFix.patch, previous code did not
complete in under 500ms, took 1200ms (on my computer). Just did multi
keys as string concatenation since I never heard any feedback on how
you want to handle multi key dictionaries.
Reply all
Reply to author
Forward
0 new messages