new release ?

Sharique

unread,

May 14, 2009, 10:28:14 AM5/14/09

to DbLinq

Hi,
It has been almost 1 year since last release, I think it is time make
a new release (guess 0.19) . If there is any blocking issue, pls put
it here for discussion. So that we can resolve it quickly.

--
Sharique

Giacomo Tesio

unread,

May 14, 2009, 10:55:40 AM5/14/09

to dbl...@googlegroups.com

I've two need:

- Thread Safety of static caches: should be done for QueryCache, but Jon have encountered a strange (unreproduced on tests) bug while working on NerdDinner. If no other static cache exists they are ok.

- XmlMappingSource working correctly: now, associations are not loaded from external mappings. This fix require IDataMapper and DataMapper modifications and DataContext fixes.

The first is absolutelly needed (but should yet work right, just missing true multithread test on multi core machines).
The second I think is really important, but require a bit of work.

Giacomo

Jonathan Pryor

unread,

May 14, 2009, 11:19:42 AM5/14/09

to dbl...@googlegroups.com

This is bound to be a stupid/silly question, but why do the caches need to be static? Static data is effectively global, i.e. a GC root in and of itself, and thus will never be collected. Even with a good policy, it's possible that this could use more memory than people would expect.

See also:
http://blogs.msdn.com/oldnewthing/archive/2006/05/02/588350.aspx
http://blogs.msdn.com/ricom/archive/2004/01/19/60280.aspx

Is there really a need for a cache that's static (i.e. shared amongst all DataContext instances)? Or can it just be non-static and attached to the DataContext (which would also remove all thread safety requirements).

Put another way, with non-shared caches if the DataContext gets collected then the cache is also collected, thus providing a natural mechanism to clear the cache. With shared (static) caches, they're not connected to the DataContext, and thus it could be holding cached data for a DataContext that no longer exists. (This may not be the case anyway; I haven't fully read and understood the code. I'm just trying to make clear that preferring shared caches isn't an open and shut easy decision.)

Thanks,
- Jon

Giacomo Tesio

unread,

May 15, 2009, 4:00:26 AM5/15/09

to dbl...@googlegroups.com

That's a good question, i think.

QueryCache is a cache of generated queries for each expression tree evaluated.
It actually has to be static (at least thread static, but this would multiply the memory usage per number of threads, also reducing the hits and reducing livetime to the thread one) to improve the hits, since in the most common DataContext use case, it rappresents a unit of work and have short live.
If the QueryCache livecicle would match the DataContext one, probably it would have no reason to exists.

In "our" DbLinq use case, a readonly DataContext is used from all threads and has a long live (the AppDomain one), while single unit of works actually are created per request or on a per need basis.

The global readonly one, has to be no "instance caches" at all (no object tracking for example, and I hope there are no other caches... but actually I should indagate this more), but still need a QueryCache becouse it could share its yet parsed expression tree among threads.

Moreover all the DbLinq DataContexts would benefit from such a static queries, greatly increasing the performances (I hope! ! ! :-D).

Other caches (like the EntityTrakings one) must not be static since they are conceptually linked to the DataContext that fill and use them.

Giacomo

Giacomo Tesio

unread,

May 15, 2009, 4:25:39 AM5/15/09

to dbl...@googlegroups.com

OFFTOPIC about the interfaces
IQueryCache is an example of an interface which must not be removed. This way if a skilled programmer would really need it, it could reimplement it to better fill it's requirement.

Even if using ReaderWriterLockSlim, thread safety has a cost: some one could require speed rather than thread safety.
Than it should be possible to reimplement it.

Another use of such an interface could be to move the cache out of the appdomain (in dedicated cache servers) and share them among servers. This would also make the cache livecicle longher than the iis application.
I'm not a fan of this solution (I'm not sure this would lead in better performances), but in an enteprise environment like ours, it had to be possible to take such a decision later.

That's why I think that good interfaces are a good thing: we should not decide how DbLinq as to be used (tecnologically speaking of course).

Having GOOD internal interfaces allow better flexibility in the long run, and produce a better open source product.

BAD interfaces, on the other hand, reduce flexibility and improve developments effort, but I think that they always underling a wrong analisis or design (if not a completely missing one).

I encounter often .NET programmers talking against interfaces. But till now, I've always noticed they are talking about wrong interfaces they have designed bottom up. It could took much time to explain a Microsoft .NET developer that the problem are not the interfaces, the problem is their design.

More or less like explaing them that a Domain Model IS Object Orientation, not a way of doing Object Orientation.
Or to explain them that a object oriented language does not lead by itself to object oriented software...

Ok... I don't like Microsoft. :-D

Giacomo

Jonathan Pryor

unread,

May 15, 2009, 7:37:29 AM5/15/09

to dbl...@googlegroups.com

I think you'll find that you're doing it wrong. :-)

First, I'm not sure that the assertion that all apps have short-lived DataContexts is correct. That's certainly not the case for NerdDinner, which has only one DataContext for the lifetime of the app. Many apps may have short-lived DataContexts, but many won't.

Secondly, and primarily, Microsoft doesn't do implicit query caching. They do explicit query caching:

http://blogs.msdn.com/ricom/archive/2008/01/11/performance-quiz-13-linq-to-sql-compiled-queries-cost.aspx
http://blogs.msdn.com/ricom/archive/2008/01/14/performance-quiz-13-linq-to-sql-compiled-query-cost-solution.aspx

For example:

var fq = CompiledQuery.Compile 
( 
    (Northwinds nw) => 
            (from o in nw.Orders 
            select new 
                   { 
                       OrderID = o.OrderID, 
                       CustomerID = o.CustomerID, 
                       EmployeeID = o.EmployeeID, 
                       ShippedDate = o.ShippedDate 
                   }).Take(5) 
);

The result of CompiledQuery.Compile is a pre-compiled, pre-analyzed query, which the user is responsible for caching and dealing with. Query caches are not part of DataContext itself, precisely because it's a recipe for a giant memory leak.

(Consider a fictional app which uses Linq-to-SQL once at startup, or otherwise very infrequently. The DbLinq approach would assure that the original cached queries would never be freed; it would be a permanent memory tax on the app.)

I would strongly suggest that you follow Microsoft's approach, drop the DataContext query caching, and use CompiledQuery instead.

- Jon

Giacomo Tesio

unread,

May 15, 2009, 10:35:28 AM5/15/09

to dbl...@googlegroups.com

I'll really hope is it right... :-|
It would be a great problem otherwise.

AFAIK, the datacontext rappresent a UnitOfWork. If the unit of work is long as the application lifetime, than keeping it alive is right.
But in an internet application delivered via http, I've got my doubt.

If it's not readonly, the tracked entities would become an unaligned copy of the full database... in memory.

That said, does DbLinq support compiled queries?

The problem with this approach, would be that our queries are not static: we progressively add clausoles to IQueryable<T>.
And, how to handle with closures?

Giacomo

Jonathan Pryor

unread,

May 15, 2009, 1:34:48 PM5/15/09

to dbl...@googlegroups.com

I think the correct approach is to use CompiledQuery, for precisely the cache management reasons I originally outlined.

Though long-lived DataContexts would be problematic, as you point out, as you'd effectively clone the entire DB into memory given enough time and queries... I don't see how a connected database approach (which DataContext follows) could behave otherwise.

Does DbLinq support compiled queries? find says:

$ find src -name CompiledQuery.cs
src/DbLinq/System.Data.Linq/CompiledQuery.cs

Survey says...No. I believe we should.

I'm not sure what you mean by our queries being static. Once we cache a query, we don't change it anymore. (If we did, we'd be altering the original query, which would be...bad.)

I'm also not sure what you're getting at with closures.

The way I envision CompiledQuery working is very similar/identical to how your existing query caching works, except instead of storing the SelectQuery type within QueryCache, it would instead be stored within the delegate returned from CompiledQuery.Compile(). Thus, the user actually deals with pre-parsed query statements, and DbLinq doesn't need to re-parse the LINQ statement again (as the delegate returned by CompiledQuery.Compile() stores the pre-parsed SelectQuery instance).

I'm not sure how much work this would take, but I'm hopeful that it wouldn't be too much work.

However, this does bring up two related issues.

1. I still need to figure out wtf is going on with the NerdDinner caching bug I was seeing last week (and better, how to reproduce in the unit tests so that you can take a better look at it).

2. SelectQuery.GetCommand() gives me really bad feelings, because:

It takes no arguments.
It calls InputParameterExpression.GetValue() with no parameters.
(A) and (B) together imply that, even though the underlying SELECT takes named parameters (yay), there's no way to actually provide them/alter them for the current SelectQuery instance (wtf?).

I think this is why (1) fails for me (but again, I still need to debug).

In any event, it makes no sense to me at all. The point to having a SELECT with parameters is so that you can cache the expression itself but vary the parameters. But since we're not providing any parameters, the parameters can't vary. So...

It makes my head hurt, if nothing else.

I would instead expect AbstractQuery.GetCommand() to take an 'object[] parameters' argument (or similar) so that we can cache the actual select statement w/o associated parameters. This would also dovetail nicely with the semantics CompiledQuery.Compile(), as you can provide parameters to the expression you're compiling:

var pepleWithLastName = CompiledQuery.Compile(
	(PeopleDb db, string lastName, int start, int count) =>
		(from p in db.People
		 where p.LastName == lastName
		 select p)
		.Skip(start)
		.Take(count));
foreach (p in peopleWithLastName(myDB, "Foo", 0, 1)) ...

Alas, CompiledQuery.Compile() will only create delegates accepting 4 parameters, but there are workarounds...
- Jon

Giacomo Tesio

unread,

May 16, 2009, 3:33:03 AM5/16/09

to dbl...@googlegroups.com

AFAIK the closures are a sort of "pointer" to the value (actually to a wrapper of the context that is used).

I suppose that in this case, someone else (like the CLR) would update the closures values so that the yet compiled query would find the correct values in the environment.

I promise I'll take a look to the QueryCache again monday, but I don't believe we could actually remove the cache, since the DbLinq's performance would make it unusable.

The problem here is understanding fully the CLR managing of closures and understanding why at different time you get different closures. We could also include the closures in the expression comparisation (which actually work only on the string rappresentation of the expression itself).

Giacomo

PS: as to the connected datacontext approach: I think that long lived datacontext should not be "connected", so ours will have ObjectTrackingEnable == false.

Jonathan Pryor

unread,

May 16, 2009, 7:55:56 AM5/16/09

to dbl...@googlegroups.com

Behold the actual underlying problem to why QueryCache is broken:

var id = "foo";
var a = db.Customers.Where(c => c.CustomerID == id);
id = "bar";
var b = db.Customers.Where(c => c.CustomerID == id);
Console.WriteLine("HashCodes: a={0}; b={1}", a.GetHashCode(), b.GetHashCode());
Console.WriteLine("a == b? {0}", a.Equals(b));

Output while running under .NET:

HashCodes: a=38068051; b=17286255
a == b? False

In short, it is not appropriate to use an expression tree as a key into a Dictionary, as they do not demonstrate value equality properties.

QueryCache is broken (and effectively unfixably broken) because the only useful key into the cache that we have...can't actually be used as a key.

(Though I suppose we could instead the result of ToString() as the key into the cache, but more on that below.)

However, just using ToString() won't fix the unit test breakage I wrote yesterday. It will instead just make the existing (un-patched) ReadTest.A5_SelectSingleOrDefault() test fail (for precisely the same "input parameter values are also cached" issue I referred to yesterday).

Finally, I'd like to put try one last effort at pointing out why CompiledQuery.Compile() is a better approach. Executing a Linq-to-SQL query takes 3 steps:

Create the expression tree (done by the compiler and at runtime via Queryable extension methods).
Convert the expression tree into SQL code (done by DbLinq).
Execute the SQL against the database.

The current QueryCache approach only caches the result of (2), and if you've ever profiled expression trees you'll find that (1) can also be quite slow (often unexpectedly so). (Which is why, while using ToString() would "fix" the key lookup over using the expression tree itself, still isn't an ideal fix: ToString() requires traversing the (possibly quite large) expression tree, and would thus add additional overhead just to perform the cache lookup.)

Meanwhile, CompiledQuery.Compile() allows caching both steps (1) and (2). There is no need for the delegate returned by CompiledQuery.Compile() to re-create the expression tree, nor to re-parse it, which saves time (which is why CompiledQuery exists in the first place).

Please consider implementing CompiledQuery.

- Jon

Giacomo Tesio

unread,

May 16, 2009, 9:38:03 AM5/16/09

to dbl...@googlegroups.com

I'm considering the approach you suggest. Really.

The problem I can't figure out is how to handle with closures: we use DbLinq as a DAL integrating the data access and a good query object (http://www.martinfowler.com/eaaCatalog/queryObject.html).

The Queryable<T> is passed toward a series of methods which add their relevant part to the tree (joining and fitering as each function know). The final Queryable<T> could contain different closures from different scopes.

Then, after all this, the Queryable<T> is simply executed to take all T which match all the rules that the functions (configured by the user) has added to it's tree.

As I could understad, the CompiledQuery approach would allow to use query many times if all the information needed in it are known when it's to be executed. This is not our case, since each Strategy could know a part of it (I hope to be understandable).

May be the fixable (I hope) bug is not in QueryCache: it's in its usage.

After retrieving the SelectQuery from the cache we should traverse the query again just searching for the new closures values and setting them in the SelectQuery. But again I'm figuring out a new thread safety bug, which would probably lead to a thread static query cache sinche the SelectQuery should not be shared once returned from the cache.

An alternative would be to deep copy the SelectQuery before returnig from QueryCache,

That smell of workaround... I know... :-(

Giacomo

Pascal Craponne

unread,

May 16, 2009, 9:50:12 AM5/16/09

to dbl...@googlegroups.com

Hi,

right, two expressions at different places have a different signature (hash and reference). Even the same expression called twice has a difference signature. This is why when I wrote the Expression comparer I chosed the option to compare the comparable contents.

I'm pretty sure it was working at that time.

Someone probably added a new comparison, which now causes comparisons to fail.

Do we have simple samples of comparison returning an equality where it should not?

Do we also have opposite samples?

I don't like the ToString() approach, since it is not proven to be 100% reliable (I'm not sure all sub expressions generate something and something unique where converted to a string).

Pascal.

Pascal Craponne

unread,

May 16, 2009, 9:58:34 AM5/16/09

to dbl...@googlegroups.com

I took a look at the history, and it is now just a string comparison.

Was it failing before the end of april (27th)?

Giacomo Tesio

unread,

May 16, 2009, 2:54:57 PM5/16/09

to dbl...@googlegroups.com

No it probably wasn't...

But with the old approach the query cache hit only with queries with identical parameter values. This lead in a performance problem, since memory grow too much (a query is cached for each values' combination) and (since the hash code provided in ExpressionEqualityComparer didn't include the parameters' values) after a while the queries compared for equality during to the dictionary lookup would become too much.

When I added the string comparison, I actually didn't go so deep in the code to understand that the parameters' values where cached too (even if I did some test which didn't show this problem... wrong test!)

Let me some hours to search a better approach.

Probably, the best one, would be to split the SelectQuery in two parts, one immutable and cachable (without the parameter values) and one with parameters' values included.

The cache hit would provide a select query which actually require a parse of the expression tree (if I can not find a way to cache the closures) to fill the query with those values.

That said, there are surely case (even for other use we do of DbLinq) where CompileQuery would be a better approach, and time allowing, I'll try to implement it.

Giacomo

Giacomo Tesio

unread,

May 16, 2009, 3:08:53 PM5/16/09

to dbl...@googlegroups.com

A note: on cache hit the expression tree visitor should be far faster than the one creating the cached select query (that without the param values), becouse it should only handle parameters itself.

I'll probably implement a tree visitor focused to handle them.

What do you think about this?

Giacomo

Pascal Craponne

unread,

May 16, 2009, 4:03:07 PM5/16/09

to dbl...@googlegroups.com

OK, so I suggest to switch back to tree visitor and ignore closures values (not types) in the comparison process.

Jonathan Pryor

unread,

May 16, 2009, 9:35:56 PM5/16/09

to dbl...@googlegroups.com

Looks like the string comparison was done in r1060:

r1060 | gia...@tesio.it | 2009-04-27 16:53:24 -0400 (Mon, 27 Apr 2009) | 3 lines

Improving performances of ExpressionEqualityComparer.Equals() by comparing Expression.ToString().

If I checkout and build r1053 (April 24), and replace ReadTest.A5_SelectSingleOrDefault() with my new tests, the tests pass.

Checking some more revisions, I find that r1056 is the last revision to have a "working" QueryCache implementation ("working" defined as "my new tests pass").

r1057 fails to compile, while r1059 is the first to actually work, and the test fails. So one of the commits in r1057, r1058, or r1059 broke QueryCache. (Which further implies that the string comparison change isn't directly responsible...)

I haven't read the diffs on those revisions to further narrow things down.

- Jon

Jonathan Pryor

unread,

May 16, 2009, 10:18:00 PM5/16/09

to dbl...@googlegroups.com

You've asked how to deal with closures. After spending some time thinking about this, I don't think it's possible to support closures, assuming we're thinking of the same thing.

I'm assuming you have code vaguely like this:

public IQueryable<Person> GetPeopleWithFirstName(string firstName)
{
  return GetPeople().Where(e => e.FirstName == firstName);
}

that is, some form of method/property that returns an IQueryable<T>, which subsequent code applies further extension methods to.

Furthermore, you then want to cache the underlying queries, without the cache growing too large.

The problem, as you say, is closures -- '.Where(e => e.FirstName == firstName)' creates a closure on firstName -- the compiler is, quite literally, both permitting closures to exist and screwing things up for you.

To facilitate the above, the compiler will create a new class to contain the captured firstName value, instantiate an instance of this new class, populate the firstName field of the new class, and further pass this class to the .Where() extension method. This class is obviously garbage collected (how could it not be?).

So, by the time we get a SelectQuery instance to insert into the cache, it must be referring to this captured firstName value. Furthermore, there's no way to modify the firstName value (because we never keep a reference around to permit modifying it).

This is why with the current QueryCache code, subsequent uses of the same lambda expression return the original parameter values -- the captured values are inexorably linked to the SelectQuery instance, but we're not taking the values into consideration (to keep the cache smaller), we get major breakage like what I was seeing in NerdDinner.

I suspect that there is no solution to this: you can't say that set A of captured values should stay with the resulting SelectQuery, while set B of captured values needs to be overridden later. Even if you could say this, there's no provided mechanism to get the updated values for use with a new SQL generation step, as the only way to get any value is via closure capturing.

QueryCache is busted. The only way to make it work is to do what was originally done (make values part of the cached values), which results in large memory use as every query variation is stored uniquely.

(Which is why caches and cache policy are actually quite hard to design and implement, as a poorly written cache is indistinguishable from a memory leak, and it looks like our existing cache is indistinguishable from a memory leak. :-( Furthermore, given the unbounded memory properties of our cache, and the CPU-bound nature of SQL generation, there is probably a point where always generating the Query+SQL "from scratch" is more performant than using the cache, because the cache sucked up several GB of space and is slowing down the entire process due to VM memory being swapped around...)

The only solution to this that I see is CompiledQuery, which does provide a mechanism to distinguish between captured variables and explicit parameters, e.g.

string implicitlyCaptured = "Jon";
Func<PeopleDB, string, Person> compiledQuery = CompiledQuery.Compile(
	(PeopleDB db, string explicitParameter) =>
		db.People.Where(p => p.FirstName == implicitlyCaptured &&
			p.LastName = explicitParameter)
		.SingleOrDefault()
);

To return to your previous question:

The Queryable<T> is passed toward a series of methods which add their relevant part to the tree (joining and fitering as each function know). The final Queryable<T> could contain different closures from different scopes.

I don't think you can actually do this, unless the different methods accept their closure variables as parameters. If this is the case, you could conceivably write a CompiledQuery which was implemented in terms of these other methods, and could take explicit parameters for delegate to the other methods. This is untested, but what I'm thinking may be possible:

var people = CompiledQuery.Compile(
	(PeopleDB db, string firstName, string lastName) =>
		// see earlier definition
		GetPeopleWithFirstName(firstName)
		.Where(p => p.LastName == lastName);
);

You could then store people as a class field (thus caching the pre-parsed query and intermediate SQL with the containing object, so the user has full control over cache policy), and parameters Just Work w/o worrying about a cache lookup using the previous values (as 1. there is no centralized cache, and 2. all needed parameters are explicitly provided).

Hope this provides some food for thought.
- Jon

Giacomo Tesio

unread,

May 17, 2009, 6:49:13 AM5/17/09

to dbl...@googlegroups.com

I'm still analizing the question.

QueryCache is busted. The only way to make it work is to do what was originally done (make values part of the cached values), which results in large memory use as every query variation is stored uniquely.

If InputParameterExpression would Expression<Func<Expression, object>> containing a lambda able to find the parameter value in the expression, this would allow to store input parameters expressions in the select query and refilling parameters' values with the one provided current expression (without revisiting the tree).

I really think that, since our sql generation process take much time, we should try this approach (which require adding an Expression<Func<Expression, object>> to many ExpressionDispatcher.Analizer method, I think...)

Giacomo

(PS I'm not sure to have been clear)

Jonathan Pryor

unread,

May 17, 2009, 9:12:06 AM5/17/09

to dbl...@googlegroups.com

Even if we change InputParameterExpression so that it accepts the object from which to retrieve a value instead of caching the value itself (which will be required for implementing CompiledQuery), we have a more important profiling question:

Which takes more time: the expression tree generation, the optimization passes, or the SQL generation itself?

This is important, because if we continue using QueryCache but change it to accept parameters from a new expression tree (which would be needed to prevent caching the values with the InputParameterExpression), then we'll still need to go through the expression tree generation phase, just so that we can later extract the current parameter values, before we continue to the SQL generation phase.

I, of course, haven't yet profiled this. (Perhaps I will on Monday.)

But I do know that expression trees are slow.

How slow? Ever hear of "static reflection." It's a way to get Reflection information in a compiler-checked fashion. (Google it if you haven't seen it before.) For example:

static class Reflection {
	public static MethodInfo GetMethod (Expression lambda)
	{
		LambdaExpression le = lambda as LambdaExpression;
		if (le == null)
			throw new ArgumentException ("lambda", "Not a LambdaExpression.");
		MethodCallExpression mce = le.Body as MethodCallExpression;
		if (mce == null)
			throw new ArgumentException ("lambda", "Does not contain a MethodCallExpression.");
		return mce.Method;
	}
}

static class Reflection<T> {
	public static MethodInfo GetMethod<S,R> (Expression<Func<S,R>> lambda)
	{
		return Reflection.GetMethod(lambda);
	}
}

Usage is really cool:

// Gets the MethodInfo for string.CompareTo(string):
var readKey = Reflection<string>.GetMethod(s => s.CompareTo("foo"));
// vs.
var oldReadKey = typeof(string).GetMethod("CompareTo(System.String)");

Performance, on the other hand, sucks. For my stupid test, Type.GetMethod() is ~4481ms. Reflection<T>.GetMethod() is 7755ms (for 1000000 iterations), so 75% slower. (It gets worse if the class has fewer methods; on another class with only 15 members, Type.GetMethod() was ~1000ms, while static reflection was over 5000ms, again for 1000000 iterations.)

Look at the above code. The overhead isn't in parsing the expression tree (which is what Reflection.GetMethod() does), the overhead is in generating the expression tree.

Our SQL generation is basically just string concats and appends. Again, I haven't profiled it, but I really, really, doubt that our SQL generation will be slower than either the expression tree generation or the parsing. I'm not sure it's sensible to try to just cache the results of SQL generation, though it may be sensible to try to cache the results of our "optimized" expression types.

- Jon

Giacomo Tesio

unread,

May 17, 2009, 10:27:32 AM5/17/09

to dbl...@googlegroups.com

Please let me know about the profiling, once you'll do.

If sql generation cache could give any speed-up to the query execution, I will do the required refactoring.

Otherwise we could remove the cache (but still, we will need to work a lot on performances, since in any enterprise environment (as the one we will use DbLinq in our company) 7 seconds for each query are not acceptable (we work in a web environment with many different queries done per request)...

Giacomo

Reply all

Reply to author

Forward