Import data in batch mode

166 views
Skip to first unread message

Sotiris Beis

unread,
Sep 25, 2014, 10:13:03 AM9/25/14
to spar...@googlegroups.com
Hi all,

is there any interface to import the data into Sparksee in batch mode? I looking something like BatchInserter used in Neo4j.

Thanks in advance,
Sotiris

c3po.ac

unread,
Sep 25, 2014, 10:37:47 AM9/25/14
to spar...@googlegroups.com

Hi,

You can use Sparksee scripts to load CSV files:

http://www.sparsity-technologies.com/UserManual/Scripting.html

To load big files fast, I would recommend to backup your database first and then use a script without recovery (the default option unless you enable it in the "spaksee.cfg" file).
The script should be used to add new data. If some data is not valid very big log files can be created.

If you want to update existing data the normal api is a better option.

Best regards.


El dijous 25 de setembre de 2014 16:13:03 UTC+2, Sotiris Beis va escriure:

Sotiris Beis

unread,
Sep 26, 2014, 5:25:42 AM9/26/14
to c3po.ac, spar...@googlegroups.com
Is there a way to chech an object if exist before I create it with script import? My graph dataset is an edge list, e.g.

0   1
0   2
1   4
1   5
2   0
......

so I want something like findOrCreateObject function sparksee that java API already has.

Sotiris

--
You received this message because you are subscribed to a topic in the Google Groups "Sparksee" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sparksee/I1aLMXR_0Gk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sparksee+u...@googlegroups.com.
To post to this group, send email to spar...@googlegroups.com.

c3po.ac

unread,
Sep 26, 2014, 6:11:33 AM9/26/14
to spar...@googlegroups.com
Hi,

Unfortunately the scripts can only be used to add completely new data.
To update existing data you can only use the api.

We will definitely take this into consideration to improve the scripts.

Best regards.

Laura Daian

unread,
Feb 3, 2015, 3:25:10 PM2/3/15
to spar...@googlegroups.com
Hi,

Is there any scripts to load the data from a .nt file, or is there just the loading from a csv? I tried doing it by reading each line and import it with the Java API. It works ok for a 10M lines file but if I want to import 100M it takes some hours to finish. Is there a more rapid way to do it?

Thanks,
Laura

c3po.ac

unread,
Feb 4, 2015, 4:44:47 AM2/4/15
to spar...@googlegroups.com
Hi,

Unfortunately, the only loader provided right now is for importing/exporting from CSV files.

Loading your data using the java API as you have mentioned, is the right approach. However there are a couple of recommendations you may try to improve the performance (links point to each part of the documentation):
  • Disable the recovery ( the default is already disabled). 
  • Disable the rollback (the default is enabled).
  • Set a proper cache size (the default may be too much).

And here is an example of use:


SparkseeConfig cfg = new SparkseeConfig();
 cfg
.setRecoveryEnabled(false);
 cfg
.setRollbackEnabled(false);
 cfg
.setCacheMaxSize(THE_SIZE_IN_MB);
 
Sparksee sparksee = new Sparksee(cfg);

In order to set a good starting point for the cache size (setCacheMaxSize) we suggest giving half the total amount of ram memory you have available.

Also take into account that even with the before mentioned settings, as the database grows, it's normal that the time required for new inserts increases.

Best regards,


El dimarts, 3 febrer de 2015 21:25:10 UTC+1, Laura Daian va escriure:

oj

unread,
Feb 10, 2015, 2:28:22 AM2/10/15
to spar...@googlegroups.com
Hi, 

How many nodes / edges can we import per second generally using the ScriptParser? 
Are there any other setting apart from cache (in sparksee.cfg) we can configure?

I'm trying to load a dataset of about 200 million edges and 20 million nodes. 

Thanks!

c3po.ac

unread,
Feb 10, 2015, 9:33:25 AM2/10/15
to spar...@googlegroups.com

Hi,

We can't give you a general number because the insert performance depends on too many variables. One of the most important thing is the number, type and indexing of attributes for the nodes/edges that you are inserting. Other very important factors are the cache size and the disks used. In addition the insert rate may get slower when the database grows.

We suggest that you try it with your own data first, considering the recommendations about cache size, recovery and rollback we provided before in this question thread. In case you find it to be too slow for your requirements there is the "extent size" setting that  you could ty to get a significant boost of performance but taking into account that (only) this one in particular has it's drawbacks because it will not allow you to have the recovery functionality activated normally once the database is loaded.

If you want to try it, use the "setExtentSize" SparkseeConfig api to "64".
Or, if you are using the scriptparser, you may want to add this line to the "sparksee.cfg" file.
----------------------------------------------
sparksee.storage.extentsize=64
----------------------------------------------
The method description from here is wrong, but the argument description is ok.

Changing the extent size like in the example above implies opening the database always with this same setting.

Please consider it throroughly and only in case you definitely reach an unacceptable loading ratio, which should be very rare case. Also first, consider contacting us for some assistance with your loading or setting process.

Best regards

El dimarts, 10 febrer de 2015 8:28:22 UTC+1, oj va escriure:

Harsh Thakkar

unread,
May 17, 2016, 5:16:15 AM5/17/16
to Sparksee
Hi Laura,

Would you be able to share the .nt loading script with me? I am working on benchmarking different DBs across the domain. For this purpose we (for now) use BSBM. It is available in .nt, csv, and other formats. Also, we could have a discussion if you are interested in it .

Regards,
Reply all
Reply to author
Forward
0 new messages