|Unexplainable difference in loading times||Laurens De Vocht||6/22/12 6:50 AM|
I am evaluating Stardog for loading a large amount of triples ( up to 5 billion ). as a first step I am testing how well it performs on 4GB 4CPU Virtual Machine running on Virtual Box under Windows 7.
I am loading the LODIB generated datasets (5M triples) and I get following strange results:
Source 1 1 564.1 MB
Data load complete. Loaded 1,626,795 triples in 00:14:30 @ 1.9K triples/sec -> ?
Source 2 447.9 MB
Data load complete. Loaded 1,786,369 triples in 00:00:44 @ 40.4K triples/sec. -> that's fast
Source 3 569.5 MB
Data load complete. Loaded 1,584,828 triples in 00:11:09 @ 2.4K triples/sec.
A second run for source 3 gave:
Data load complete. Loaded 1,584,828 triples in 00:07:29 @ 3.5K triples/sec.
I don't understand how the difference can be explained, but prob it has to do smth with garbage collection? As during the loading of the triples memory consumption was 100% and prob Source 2 just 'fitted' in memory.
Also when idle and Stardog running loaded memory usage is about 3GB! Will it work on the cloud actually on a small instance with let's say 512 MB available for stardog?
|Re: [stardog-users] Unexplainable difference in loading times||Kendall||6/22/12 7:04 AM|
On Fri, Jun 22, 2012 at 9:50 AM, Laurens De Vocht <laur...@gmail.com> wrote:
That is enormous variation, far beyond what we've ordinarily seen.
My first guess is that you should run a database for which you want good performance on a real computer, not on a virtual one (where I/O can be very bad, etc).
Beyond that, it could be GC, it could be something else.
We'll take a look and see if we can reproduce.
Al Baker, who's on this list, is the world's expert on running Stardog on memory constrained systems, so maybe he'll chime in here.
|Re: [stardog-users] Unexplainable difference in loading times||Robert Butler||6/22/12 8:38 AM|
I thought I would weigh in on running Stardog in the cloud as well since that is what we do at Pancake Technology:
On cloud in general, your mileage is going to vary with your cloud provider. I've run Stardog instances in both Amazon EC2 and Rackspace. Amazon's performance in general is not great for disk I/O based systems and can be very unpredictable on small instances. Rackspace has much better performance with only small increases in price per performance stat. Comparable machines in Rackspace perform better hands down. So, it's critical to evaluate your cloud provider (or private cloud infrastructure) and test Stardog on it directly.
I currently have Stardog running in the Rackspace cloud on the following instance sizes (most of the memory is reserved for Stardog on these boxes):
- 512 MB dev box
- 1024 MB dev box
- 256 MB prod box
- 2048 MB prod box
The query rates and data sizes on these boxes are relatively small for what Stardog can handle on that size memory. Stardog does run on < 256m RAM on a production machine with no stability/lag issues. The key is to figure out the memory size needed for your particular data-set and performace size.
As a side note, I've personally seen huge performance/lag issues with running disk I/O intensive apps inside a VM on non-server grade hardware and software. I would expect that the VM to be causing your intermittent performance issues.
Hope that helps,
|Re: [stardog-users] Unexplainable difference in loading times||Evren Sirin||6/22/12 9:21 AM|
I'm not sure what is causing the slowness but it probably has
something to do with how much resource virtual machine can spare for
Loading these three data sources on my desktop (iMac running OSX
10.6.8 2.8Ghz Intel i7, 16G RAM) with default Stardog settings (JVM
memory set to 2GB) gave the following results where data source 2
loads faster but the difference is not that much:
~/programs/stardog/stardog-1.0$ ./stardog-admin create -n lodib5m_1
Bulk loading data to new database.
Data load complete. Loaded 1,626,795 triples in 00:00:34 @ 46.7K triples/sec.
Successfully created database 'lodib5m_1'.
~/programs/stardog/stardog-1.0$ ./stardog-admin create -n lodib5m_2
Bulk loading data to new database.
Data load complete. Loaded 1,786,369 triples in 00:00:28 @ 63.8K triples/sec.
Successfully created database 'lodib5m_2'.
~/programs/stardog/stardog-1.0$ ./stardog-admin create -n lodib5m_3
Bulk loading data to new database.
Data load complete. Loaded 1,584,828 triples in 00:00:36 @ 43.7K triples/sec.
Successfully created database 'lodib5m_3'.
It is is not usual to see 10% or 20% difference in loading times
especially when the same data source is loaded subsequently due to how
the OS caches pages from disk. It is also usual to see loading times
change between different data sources. For example, see the loading
times we report in . Nearly 20 times difference you see for loading
different data sources is very unusual though. Maybe the load on the
machine varied between different loads affecting the load performance?
|Re: [stardog-users] Unexplainable difference in loading times||Laurens De Vocht||6/25/12 1:04 AM|
Is there any difference in between the configuration of these boxes dev vs prod? Or is it just the JVM settings that are adapted to the available memory?
|Re: [stardog-users] Unexplainable difference in loading times||Laurens De Vocht||6/25/12 1:09 AM|
Maybe yes, but I did the same test with Jena TDB (default Tomcat settings, same JVM configuration) and the file with 5 million triples loaded in 86 seconds ( all sources loaded under 90 seconds ). So as I can see Stardog could be faster than Jena TDB, but for some reason it hangs.
I wonder what's happening.
|Re: [stardog-users] Unexplainable difference in loading times||Kendall||6/25/12 3:37 AM|
It's hard for us to address an issue we can't reproduce, and we've failed to reproduce yr report.
|Re: [stardog-users] Unexplainable difference in loading times||Laurens De Vocht||6/25/12 4:15 AM|
OK, thx, will try again in another configuration.
|Re: [stardog-users] Unexplainable difference in loading times||Robert Butler||6/25/12 4:24 AM|
The only difference w.r.t. configuration is max memory size passed to the JVM.
|Re: [stardog-users] Unexplainable difference in loading times||Laurens De Vocht||6/25/12 5:19 AM|
Problem is related to memory swapping (in the virtual machine).
As I am loading the triples - more than 1 million at once (JVMconfigured with 2GB and VM 4GB), the memory gets full and at that point the system starts swapping.
I increased the VM's memory to 5GB (swap file was around 500 MB) and maintaining 2GB for JVM did the trick.
Data load complete. Loaded 1,626,795 triples in 00:00:55 @ 29.1K triples/sec.
Successfully created database 'ldb5ms1t1'.
Data load complete. Loaded 1,786,369 triples in 00:00:24 @ 73K triples/sec.
Successfully created database 'ldb5ms2t1'.
Data load complete. Loaded 1,584,828 triples in 00:00:52 @ 30.4K triples/sec.
Successfully created database 'ldb5ms3t1'.
So that's twice as fast as Jena TDB, note though that Jena TDB had the tomcat 7 memory set to 1.2 GB (which is default).
I am not sure if the use of Tomcat is an advantage or disadvantage?