NPE in SailSourceModel

6 views
Skip to first unread message

Arjohn Kampman

unread,
Apr 6, 2021, 10:39:20 AM4/6/21
to RDF4J Users

Hi all,

Long time since I've been in this space, hope you all are doing well :)

We've recently upgraded from openrdf 4.1.2 to rdf4j 3.4.3 and we're now encountering the following NullPointerException:

java.lang.NullPointerException: null
    at org.eclipse.rdf4j.sail.nativerdf.SailSourceModel.remove(SailSourceModel.java:291)
    at org.eclipse.rdf4j.sail.nativerdf.MemoryOverflowModel.remove(MemoryOverflowModel.java:134)
    at org.eclipse.rdf4j.model.impl.AbstractModel.remove(AbstractModel.java:171)
    at org.eclipse.rdf4j.sail.base.Changeset.deprecate(Changeset.java:286)
    at org.eclipse.rdf4j.sail.base.SailSourceBranch.flush(SailSourceBranch.java:487)
    at org.eclipse.rdf4j.sail.base.SailSourceBranch.compressChanges(SailSourceBranch.java:325)
    at org.eclipse.rdf4j.sail.base.SailSourceBranch$2.close(SailSourceBranch.java:215)
    at org.eclipse.rdf4j.sail.base.SailSourceConnection.endUpdateInternal(SailSourceConnection.java:646)
    at org.eclipse.rdf4j.sail.helpers.AbstractSailConnection.endUpdate(AbstractSailConnection.java:535)
    at org.eclipse.rdf4j.sail.helpers.AbstractSailConnection.flush(AbstractSailConnection.java:351)
    at org.eclipse.rdf4j.sail.helpers.AbstractSailConnection.flushPendingUpdates(AbstractSailConnection.java:827)
    at org.eclipse.rdf4j.sail.helpers.AbstractSailConnection.addStatement(AbstractSailConnection.java:443)
    at org.eclipse.rdf4j.repository.sail.SailRepositoryConnection.addWithoutCommit(SailRepositoryConnection.java:383)
    at org.eclipse.rdf4j.repository.base.AbstractRepositoryConnection.add(AbstractRepositoryConnection.java:428)
    ...

In the issue tracker there is #693 and #774 which are related to this, but both have been closed in earlier releases. We have now upgraded to rdf4j 3.6.2, but the problem remains.

The calling code doesn't actually do anything fancy. The relevant part of the algorithm starts a READ_COMMITTED transaction and then calls getStatements(s,p,o), add(s,p,o) and remove(s,p,o) a lot of times. The entire algorithm is single threaded and runs on a simple NativeStore. The process runs fine for 1 to 2 hours, after which these exception start to appear.

Is this a known issue? Any suggestions how to fix or where to look? The NativeStore code looks quite different from when I last worked on it...

Cheers,

Arjohn Kampman

Jeen Broekstra

unread,
Apr 6, 2021, 7:14:37 PM4/6/21
to RDF4J Users


On Wed, 7 Apr 2021, at 00:39, Arjohn Kampman wrote:

Hi all,

Long time since I've been in this space, hope you all are doing well :)

Hey Arjohn, this is a nice surprise :) Welcome back! 


We've recently upgraded from openrdf 4.1.2 to rdf4j 3.4.3 and we're now encountering the following NullPointerException:

java.lang.NullPointerException: null
    at org.eclipse.rdf4j.sail.nativerdf.SailSourceModel.remove(SailSourceModel.java:291)
    at org.eclipse.rdf4j.sail.nativerdf.MemoryOverflowModel.remove(MemoryOverflowModel.java:134)
    at org.eclipse.rdf4j.model.impl.AbstractModel.remove(AbstractModel.java:171)

[...]

In the issue tracker there is #693 and #774 which are related to this, but both have been closed in earlier releases. We have now upgraded to rdf4j 3.6.2, but the problem remains.

The calling code doesn't actually do anything fancy. The relevant part of the algorithm starts a READ_COMMITTED transaction and then calls getStatements(s,p,o), add(s,p,o) and remove(s,p,o) a lot of times. The entire algorithm is single threaded and runs on a simple NativeStore. The process runs fine for 1 to 2 hours, after which these exception start to appear.

Is this a known issue? Any suggestions how to fix or where to look? The NativeStore code looks quite different from when I last worked on it...


I don't believe this is a known issue, and I think you're right that it's related to 693 in particular - it may be a special case that we've overlooked. I'm looking at SailSourceModel.remove(). On line 285 it calls the private dataset() method. One of its side effects is that it resets the private sink field to null (lines 393-399). This looks odd to me because then on line 291 the remove method directly uses that private field. I haven't had time yet to dive in more deeply but my guess is that either line 291 should use the sink() method to obtain a reference to the sink (instead of just using the field directly), or dataset() should not touch the sink field as a side effect, or both. Looking at it I'm actually a bit surprised that this NPE doesn't occur sooner/more often.

I've logged a bug report for this: https://github.com/eclipse/rdf4j/issues/2967 . Do you think you are available to have a look at this yourself, and provide a regression test + fix? We would really appreciate it. Have a look at https://github.com/eclipse/rdf4j/blob/main/.github/CONTRIBUTING.md for a quick rundown on how to get started providing patches to RDF4J, in particular the point on signing the Eclipse Contributor Agreement is kinda important. Shouldn't take more than a few minutes to sort it out (the "signing" is just ticking a box on a web form), let me know if anything is unclear. 

Cheers,

Jeen 


Arjohn Kampman

unread,
Apr 7, 2021, 4:04:44 PM4/7/21
to rdf4j...@googlegroups.com

Not sure if I'll have the time to debug it properly now, but if I find anything I'll definately let you know. Observation so far is that this NPE is triggered for the first time shortly after the model "overflows" to disk, or at least that's my understanding of what happens in the code so far:

* Initially the algorithm starts with a very good speed, but the memory usage is growing quickly. I guess in this phase everything is still stored in RAM. This takes about 12 minutes.

* Then when the memory usage gets close to 80 or 90% the process slows down alot. Memory usage stays at this 80-90% but cpu usage goes up a lot. We're not seeing any progress during this time, but not sure if the process has completely stalled or if it's just very slow. This takes about 40 minutes.

* Then after this period the memory usage drops drastically to a steady 1 GB and we start to see progress again. The first NPE is triggered some 90 seconds after this.

Does this help?

--
You received this message because you are subscribed to a topic in the Google Groups "RDF4J Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rdf4j-users/NaPslvPoS6k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rdf4j-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rdf4j-users/1c5b1f77-99c1-4207-8333-400cd265cff8%40beta.fastmail.com.

Jeen Broekstra

unread,
Apr 7, 2021, 6:34:00 PM4/7/21
to RDF4J Users
That's helpful, thanks. It confirms that the cause is in the memory overflow / disk sync part of the transaction handling.

There are a number of outstanding issues with the memory overflow approach in the native store, to be honest, we've had a few bug reports with exceptions and other strange behaviors, and it slowing down significantly close to 90% memory usage (when the disk sync kicks in) is a known issue as well. 

I'll try and find time myself to dive in a bit deeper as well, and see about a fix. 

I must admit I'm starting to consider if we should not be a bit more drastic about things and completely rethink the memory overflow approach in the native store.

Jeen
You received this message because you are subscribed to the Google Groups "RDF4J Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rdf4j-users...@googlegroups.com.

Arjohn Kampman

unread,
Apr 8, 2021, 5:19:07 PM4/8/21
to rdf4j...@googlegroups.com

The reason that syncing the data to disk when the overflow is triggered takes so long is also related to a call to dataset(). When overflowToDisk() calls disk.addAll(memory), this triggers a call to SailSourceModel.add(s,p,o,c...) for each statement. This method then calls both contain(s,p,o,c...) and sink().approve(s,p,o.c) for for each statement. The latter call starts a new transaction and updates the txn-status file, but the contains() call then commits the transaction for the previous statement via a call to dataset(), again updating the txn-status file. So for every cached statement, rdf4j does two I/O calls. On a spinning disk with a an average write time of 10 ms, this limits the overflow process to at most 50 statements per second.

Jeen Broekstra

unread,
Apr 8, 2021, 6:49:50 PM4/8/21
to RDF4J Users
Thanks for the analysis. I'm a little rusty on the inner workings of the native store myself, but this sounds like something we can trace and fix. Do you think you could help sketch out what such a fix should look like? At a guess we would need to take a look at some way to avoid doing intermittent commits for each statement. 

Regarding the NPE, there's a pull request up with a fix: https://github.com/eclipse/rdf4j/pull/2977 . Can you take a look and see what you think?

Cheers,

Jeen 
Reply all
Reply to author
Forward
0 new messages