But my more complex application still crashes with TimeOut exceptions
or OOME's after running for several hours.
I had it running over the weekend with Ants, taking a snapshot every 30
min.
What I see is, that the memory consumption grows exponentially. During
the first 5 hours, it stays below 85 MB (total size of objects). After
8h I have 180 MB, 379MB after 10 hours an finally after 13h it's 712MB
(at this point my test
loop breaks with an TimoutException from a WaitForNonStaleResults())
Ants tells me there's no memory fragmentation issue.
I would like to share the profiler results, unfortunately, a snapshot
every 30 minutes is way too much and Ants does not seem to allow me to
delete snapshots.
But here's a CSV of the diff between the snapshots at 5h and 13h:
http://www.filedropper.com/snapshot-comparison
At the top I have a
Dictionary<TKey, TValue>+Entry<Object,
LinkedListNode<SimpleLRUCache+ListValueEntry>>[]
with an increase of about 380MB.
This seems to be something that happens inside Lucene.Net.
I hope to find some more time this week to dig deeper into this.
Tobias
> Does your "complex" app handle significantly more documents, or larger
> documents?
What's a good way to measure the document size?
I have basically two type of docs involved. "Product" translates to a
formatted JsonDoc of about 3k and "Sale" is about 40k.
And the real app has some more indexes involved (but just 8 indexes, no
map/reduce).
> Can you make a dump of that dictionary object, or rather the SimpleLRUCache
> ?
How can I create a dump with the Ants Memory Profiler?
> What I'm interested to see is the count of object it holds references to.
> It's size should be 1024 - if its significantly larger than it doesn't do
> removals correctly. Otherwise it leaks memory.
It doesn't look like there's a SimpleLRUCache with more than 1024
entries. But there are a lot of SimpleLRUCache instances (13.016, this
was just 229 at the second snapshot).
What I also see are whopping 274.083 instances of
"ThreadLocal<T>+Boxed<Object>". This seems to relate to
CloseableThreadLocal in Lucene.Net. Looks like the ThreadLocal instance
created there is never disposed.
> Lastly, can you create a simple application that mimics this behavior you
> have in your project?
I'm working on this - it's not that easy, if you have to wait hours to
see a significant increase in memory usage.
Tobias
Am 15.08.2011 13:11, schrieb Itamar Syn-Hershko:What's a good way to measure the document size?
Does your "complex" app handle significantly more documents, or larger
documents?
I have basically two type of docs involved. "Product" translates to a formatted JsonDoc of about 3k and "Sale" is about 40k.
And the real app has some more indexes involved (but just 8 indexes, no map/reduce).
How can I create a dump with the Ants Memory Profiler?Can you make a dump of that dictionary object, or rather the SimpleLRUCache
?
It doesn't look like there's a SimpleLRUCache with more than 1024 entries. But there are a lot of SimpleLRUCache instances (13.016, this was just 229 at the second snapshot).What I'm interested to see is the count of object it holds references to.
It's size should be 1024 - if its significantly larger than it doesn't do
removals correctly. Otherwise it leaks memory.
What I also see are whopping 274.083 instances of "ThreadLocal<T>+Boxed<Object>". This seems to relate to CloseableThreadLocal in Lucene.Net. Looks like the ThreadLocal instance created there is never disposed.
I'm working on this - it's not that easy, if you have to wait hours to see a significant increase in memory usage.Lastly, can you create a simple application that mimics this behavior you
have in your project?
> I think that I know what is going on, although I am not so sure about why.
> If the number of SimpleLRUCache grows, it is likely that there is an issue
> with disposing them, the question is _why_ they are still being held.
> We recreate a search after every indexing batch, but I am pretty sure we
> properly dispose of them.
> Tobi,
> Can you give me the rooted information for all the SimpleLRUCache and for
> the Thread Local stuff ?
That's all I can give you right now:
http://www.filedropper.com/ants_1
I had too many snapshots and Ants couldn't save them and just died.
I'll have to start a new test today before leaving.
Could it be a problem, that the ThreadLocal<> in CloseableThreadLocal
never gets disposed? (it's listed with >274.000 instances in Ants).
Tobias
> I think that I know the bug, I wasn't disposing of the ThreadLocal instance
> properly,
Great, so my guess was right...
> I did a better code review of our usage of that and I think that I caught
> all of the places where this is happening.
> New build should be out in a few hours, and hopefully that would resolve
> that issue once and for all.
I eagerly await the commit!
> Thanks a lot for being so through in finding and tracking this. Hopefully we
> can now put it to rest.
I'll start another long-running test as soon as you've commited your
changes.
Tobias
Are you going to push the ThreadLocal-changes today?
Tobias
> New build is out with this fix
Thanks! - I'm already building it and will start a new test tonight, so I
can hopefully tell you tomorrow afternoon, if this solves the memory trouble.
Tobias
> Thanks! - I'm already building it
... got an assertion error when building RavenDB:
http://e-tobi.net/stuff/ravendb/assertion1.png
(This is about accessing an already disposed ThreadLocal)
It didn't happen when I ran the build a second time.
Tobias
> ... got an assertion error when building RavenDB:
> http://e-tobi.net/stuff/ravendb/assertion1.png
> It didn't happen when I ran the build a second time.
But it happened again during the third build, but at a different place:
http://e-tobi.net/stuff/ravendb/assertion2.png
At the 4'th build I got:
Tobias
I still have an increasing number of ThreadLocal<T>+Boxed<Object>
instances (>17.000 right now). But I have no idea, where they come
from. Ants shows them as childs of object[] type instances (which are
the GC root) but they seem to be created by Lucene (Value is
FieldsReader or TermInfosReader or just null.
I guess they will be collected at some point...
Tobias
Tobias
> Tobi, thanks for that, I am going to fix that, but it will probably only
> be neאt week, since I am traveling currently
Pity! I have a lot of trouble with the memory leak issues, but don't trust
the current unstable build.
I tried to fix the ThreadLocal issues myself, but it's kinda hard to
understand what's going on there and why ThreadLocal members are accessed
after the object has been disposed.
Here's what I tried so far:
https://github.com/e-tobi/ravendb/commit/ec17e91e9a1bac4bcc5dad2f813545bb0f3ac623
The same needs to be done in the Esent storage. It kinda works, but I
don't know how to cover this with tests and I'm not sure if there isn't a
better way to fix these problems. And there are more problems with
the ThreadLocal disposal in other places.
Tobias
> That is why it is called unstable
I know. But someone has to test unstable and figure out all those nasty
little problems :-)
BTW: What happened to the unstable build/commit logs posted to the mailing
list? There hasn't been one since build-428.
> Next week, this has the first prio
Ok. Looking forward to it!
I'm testing the current unstable now for >24 hours. I get some index
errors because of the ThreadLocal-Disposal issues, but until now I never
got so far. Memory usage is now much more sane.
I still have a growing number of ThreadLocal<T>+Boxed<Object> instances
(>30.000 at the moment). They are rooted in object[] type instances.
And I guess running this for some more days, I would get an OOME again.
The only explanation I have is, that CloseableThreadLOocal.Close() isn't
called in all cases.
If time permits I'll try to track this down.
Tobias
On 18.08.2011 03:07, Ayende Rahien wrote:I know. But someone has to test unstable and figure out all those nasty
> That is why it is called unstable
little problems :-)
BTW: What happened to the unstable build/commit logs posted to the mailing
list? There hasn't been one since build-428.
> Next week, this has the first prioOk. Looking forward to it!
I'm testing the current unstable now for >24 hours. I get some index
errors because of the ThreadLocal-Disposal issues, but until now I never
got so far. Memory usage is now much more sane.
I still have a growing number of ThreadLocal<T>+Boxed<Object> instances
(>30.000 at the moment). They are rooted in object[] type instances.
And I guess running this for some more days, I would get an OOME again.
The only explanation I have is, that CloseableThreadLOocal.Close() isn't
called in all cases.
> I don't think that I follow why this change would work, since this is
> already covered by the lock, it shouldn't be any different.
No. The first access to current.Value in Batch() is NOT covered by the lock.
https://github.com/ayende/ravendb/blob/master/Raven.Storage.Managed/TransactionalStorage.cs#L88
Tobias
> Actually, that is what 3 AM is for, right?
:-)
>You should have a new build in an hour or so.
Great!
> That is interesting, I am pretty sure that we are disposing of all of our
> thread local instances.
> Do you have any idea who created those thread local instances?
For about 1.700 of the 30.000 instances Value is != null and contains
Lucene related stuff like FieldsReader or TermInfosReader.
Ants doesn't show me who created the instances, I think dotTrace can do
this - maybe I'll have to try this one.
Tobias
> Okay, that probably means that we need to do a code review of Lucene and see
> if they aren't closing the thread local stuff.
At first sight, it looks like, Lucene.Net.Analysis.Analyzer.Close()
isn't used at all.
Tobias
> At first sight, it looks like, Lucene.Net.Analysis.Analyzer.Close()
> isn't used at all.
Forget this - Analyzers are not instantiated from within Lucene.Net.
I've looked at the QueryParsers Main() without noticing.
Tobias
> Okay, that probably means that we need to do a code review of Lucene and see
> if they aren't closing the thread local stuff.
> If they do close it, it might might that we aren't properly disposing of
> Lucene resources.
Ok. I did some "brute-force" analysis of this problem by adding some
code to track when a CloseableThreadLocal is created and what instances
are not closed. The number of unclosed CloseableThreadLocal's is
significantly increasing over time:
2011-08-18T17:28:40Z, 238
2011-08-18T17:30:42Z, 312
2011-08-18T17:32:45Z, 375
2011-08-18T17:34:47Z, 470
2011-08-18T17:36:49Z, 455
2011-08-18T17:38:52Z, 560
2011-08-18T17:40:55Z, 595
2011-08-18T17:42:57Z, 683
2011-08-18T17:44:58Z, 739
2011-08-18T17:46:59Z, 748
2011-08-18T17:49:01Z, 825
2011-08-18T17:51:05Z, 887
2011-08-18T17:53:07Z, 984
Here's a list of stack traces from within the CloseableThreadLocal
ctor, where no Close() was called yet:
http://www.filedropper.com/dump1
The first one belongs to a SimpleAnalyzer created by
Raven.Database.Indexing.IndexStorage. This is created per
DocumentStore, so nothing to worry about - this is expected.
The second oldest stack trace looks more interesting.
FieldsReader gets cloned here and assigned to a FieldsReaderLocal,
which itself is a ClosableThreadLocal - feels like having a knot in my
brain :-)
I guess FieldsReaderLocal should Close its wrapped value when being
closed itself:
+ public override void Close()
+ {
+ ((FieldsReader)Get()).Close();
+ base.Close();
+ }
This doesn't seem to hurt, but it doesn't help either. It's hard to
follow, what's going on inside Lucene.Net :-(
I'll now try to test this in isolation somehow, maybe this brings me
closer to the root of all evil.
Tobias
PS: This is how I created the above stack traces - "brute force" - I
told you :-):
> I guess FieldsReaderLocal should Close its wrapped value when being
> closed itself:
>
> + public override void Close()
> + {
> + ((FieldsReader)Get()).Close();
> + base.Close();
> + }
Gotcha!
Took me some amount of thread-thinking, but this is indeed the "problem
zone". My above solution only works partially. Because of the
nature of ThreadLocal the above code will only close the FieldsReader
assigned to the thread that calls FieldsReaderLocal.Close(), not the
ones that have been created by other threads.
So assigning any IDisposable (or like in the Lucene example something
where Close() needs to be called), to ThreadLocal.Value generally is a
bad idea if you can't make each thread responsible for
disposing/closing it's value.
Currently I see no other solution than to track all created
FieldsReader instances for all threads and explicitly close
them in FieldsReaderLocal.Close().
Hope this helps...
Tobias
> I modified our fork of Lucene.NET to reflect those behaviors, and it will
> now properly dispose of all of those values.
Thanks!
> There is a new build getting ready to pop now, and I would be grateful if
> you can run the same test cases on that.
That's the first thing I'll do tomorrow!
Tobias