java.lang.OutOfMemoryError - configuration advice?

49 views
Skip to first unread message

Bill Roberts

unread,
Aug 3, 2013, 11:34:27 AM8/3/13
to sta...@clarkparsia.com
Hi

I'm new to Stardog and working with an evaluation copy of Stardog 1.2.3.  Overall, I'm really liking it so far, but having some problems with data loading that I would welcome advice on.

I'm trying to load a biggish set of test data (35 million triples in total), but I've been having problems with out of memory errors and would be glad of any advice on either configuration or alternative loading strategies that might avoid the problem.

I'm running on a Macbook Pro, 8GB RAM, OSX 10.8.4.

My Java version is:
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06-451-11M4406)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01-451, mixed mode)

I have this data as a collection of about 1000 ntriples files, the biggest of which is about 1.4 million triples. (I also have it as a single big file, but figuring that would have higher memory requirements, started by loading the smaller files one by one).

I created a new empty on-disk database

stardog-admin db create -n test

Then used a script to call this command in turn for each file in the collection (all files are going into the same named graph):

stardog data add test -g http://postcodes {filename}

Via ps, the java config for stardog server is:

/usr/bin/java -Xmx2g -Xms2g -Dstardog.install.location=/Users/bill/code/stardog/stardog-1.2.3 -XX:SoftRefLRUPolicyMSPerMB=1 -XX:+UseParallelOldGC -XX:+UseCompressedOops -server -classpath /Users/bill/code/stardog/stardog-1.2.3/lib/stardog-cli.jar com.clarkparsia.stardog.cli.admin.CLI server start

When I start the test, there is about 5GB free on the machine.

After about half an hour, I get an error in the log

Exception in thread "QuartzScheduler_QuartzSchedulerThread" java.lang.OutOfMemoryError: Java heap space
at java.util.TreeMap.put(TreeMap.java:518)
at java.util.TreeSet.add(TreeSet.java:238)
at org.quartz.simpl.RAMJobStore.acquireNextTriggers(RAMJobStore.java:1439)
at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:264)
Exception in thread "Stardog.Executor-141" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.SynchronousQueue$TransferStack.snode(SynchronousQueue.java:280)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:322)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:874)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:955)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917)
at java.lang.Thread.run(Thread.java:680)
[WARNING org.jboss.netty.channel.socket.nio.AbstractNioWorker.null - Aug 3, 2013 03:48:00.721] Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded

(which leads to various other errors about problems with committing transactions).

At this point it is attempting to load the biggest individual file in the collection (about 1.4 million triples).

I appreciate that in general, more memory is better - and when we run this in anger, it would be on a linux server with lots of RAM.  But I figured that by loading small files one at a time, then the overall memory requirement wouldn't be too much. 

Do you have any suggestions on how I could fix this?  Is it perhaps because the GC is not getting the chance to clear old stuff out quickly enough?  Happy to provide further diagnostic info if you can advise what would be most useful.

The data is a copy of the UK Ordnance Survey postcodes linked data. In the full set, there are something like 1.7 million distinct subjects, so about 20 triples per subject.  None of the literals are particularly big.  If anyone would like to attempt to reproduce, the full set of data can be downloaded from s3://opendatacommunities-dumps/dataset_data_postcodes_20130506183000.nt.zip

Thanks in advance for any advice you can offer.

Best regards

Bill






Zachary Whitley

unread,
Aug 3, 2013, 4:24:29 PM8/3/13
to sta...@clarkparsia.com
--
-- --
You received this message because you are subscribed to the C&P "Stardog" group.
To post to this group, send email to sta...@clarkparsia.com
To unsubscribe from this group, send email to
stardog+u...@clarkparsia.com
For more options, visit this group at
http://groups.google.com/a/clarkparsia.com/group/stardog?hl=en
 
 

Bill Roberts

unread,
Aug 3, 2013, 7:11:46 PM8/3/13
to sta...@clarkparsia.com
Hi Zachary

Thanks for the pointer to the previous thread - apologies for not having searched the forum more thoroughly before posting.

Anyway, just to confirm what you no doubt expected, that I can load 35 million triples pretty quickly by using the bulk loader as part of a db create. Also, I realise I can load a trig file via db create, handy for me as I would generally want to initialise a new db with data in various named graphs.

Part of the reason I was trying out the 'data add' option was because an important use case for me will be adding reasonably big files to an already running database, where using the bulk loader isn't really a viable option. I appreciate that this requires a decent chunk of memory.

I've done a few experiments and with my test data, I've found that a 2GB heap size will happily load a 120MB/750k triple file using stardog data add, but runs out of memory with a 160MB/1million triple file. So that gives me a useful rule of thumb - does it correspond roughly to what you would expect? In the cases where it doesn't run out of memory, it loads the triples pretty quick.

I'll see if I can set up stardog on a machine with more RAM than my laptop and try some bigger experiments.

Zachary Whitley

unread,
Aug 3, 2013, 9:16:54 PM8/3/13
to sta...@clarkparsia.com


On Saturday, August 3, 2013 7:11:46 PM UTC-4, Bill Roberts wrote:
Hi Zachary

Thanks for the pointer to the previous thread - apologies for not having searched the forum more thoroughly before posting.

No worries. 
 

Anyway, just to confirm what you no doubt expected, that I can load 35 million triples pretty quickly by using the bulk loader as part of a db create.  Also, I realise I can load a trig file via db create, handy for me as I would generally want to initialise a new db with data in various named graphs.

Part of the reason I was trying out the 'data add' option was because an important use case for me will be adding reasonably big files to an already running database, where using the bulk loader isn't really a viable option.  I appreciate that this requires a decent chunk of memory.

I've done a few experiments and with my test data, I've found that a 2GB heap size will happily load a 120MB/750k triple file using stardog data add, but runs out of memory with a 160MB/1million triple file.  So that gives me a useful rule of thumb - does it correspond roughly to what you would expect?  In the cases where it doesn't run out of memory, it loads the triples pretty quick.

One of the Stardog dev's will have to let you know if that sounds right or not. You can try taking a look at http://stardog.com/docs/admin/#resource-requirements You may want to try just running it in batches of 100kT (kilo triples???) as the previous post suggested, "100k triple chunks is a good starting point for the batch size, but you can try larger or smaller batches to see what works best for your system & dataset."

Kendall Clark

unread,
Aug 5, 2013, 11:38:02 AM8/5/13
to stardog
Evren is going to be addressing this issue with Stardog intermediate writes (i.e., large writes after database creation) to avoid forcing people to do awkward wipe-and-loads.

If that makes it into the 2.0 release, it will have some limitations (only local adds, not remote) which we will work on lifting in the 2.x cycle.

Cheers,
Kendall


--

Bill Roberts

unread,
Aug 5, 2013, 1:06:52 PM8/5/13
to sta...@clarkparsia.com

On 5 Aug 2013, at 16:38, Kendall Clark <ken...@clarkparsia.com> wrote:

> Evren is going to be addressing this issue with Stardog intermediate writes (i.e., large writes after database creation) to avoid forcing people to do awkward wipe-and-loads.

Thanks Kendall - that would definitely be useful. But as long as I know roughly what to expect in terms of max file sizes, the current approach should work ok for me. I can just throw enough RAM at it to cover the majority of cases, and split files into chunks for edge cases.

One follow-up question, if I use /transaction/begin and /transaction/commit in the HTTP protocol, would all data added during the transaction be held in memory until it is committed? Hence I'd have to bear in mind heap sizes if grouping together several biggish data adds?


>
> If that makes it into the 2.0 release, it will have some limitations (only local adds, not remote) which we will work on lifting in the 2.x cycle.

My ideal solution would be remote adds using HTTP, but fully understand a step-by-step development approach is best!
>
> Cheers,
> Kendall
>

Mike Grove

unread,
Aug 6, 2013, 8:51:17 AM8/6/13
to stardog
On Mon, Aug 5, 2013 at 1:06 PM, Bill Roberts <bi...@swirrl.com> wrote:

On 5 Aug 2013, at 16:38, Kendall Clark <ken...@clarkparsia.com> wrote:

> Evren is going to be addressing this issue with Stardog intermediate writes (i.e., large writes after database creation) to avoid forcing people to do awkward wipe-and-loads.

Thanks Kendall - that would definitely be useful.  But as long as I know roughly what to expect in terms of max file sizes, the current approach should work ok for me.  I can just throw enough RAM at it to cover the majority of cases, and split files into chunks for edge cases.

One follow-up question, if I use /transaction/begin and /transaction/commit in the HTTP protocol, would all data added during the transaction be held in memory until it is committed?  Hence I'd have to bear in mind heap sizes if grouping together several biggish data adds?

Correct.  Transactions are in-memory only which is the reason for the memory footprint.  As Kendall said, we're currently addressing this, but in the meantime, you'll have to be aware of how much data you've added into the transaction.

Cheers,

Mike
 


>
> If that makes it into the 2.0 release, it will have some limitations (only local adds, not remote) which we will work on lifting in the 2.x cycle.

My ideal solution would be remote adds using HTTP, but fully understand a step-by-step development approach is best!
>
> Cheers,
> Kendall
>

Reply all
Reply to author
Forward
0 new messages