Problem with big data source!

54 views
Skip to first unread message

Reihaneh Amini

unread,
Jan 3, 2017, 4:59:33 PM1/3/17
to Silk
Hi All,

I have a NTriple file size of 27 gb. I cannot upload it in workbench! 
Any one has any suggestion?
This is just one of my sources and my target source is also as big as this file.



Jindřich Mynarz

unread,
Jan 3, 2017, 5:12:49 PM1/3/17
to silk-di...@googlegroups.com
Hi,

you can load your data into an RDF store and load it to Silk via SPARQL endpoint. Out of the open-source offerings on the RDF store market, Virtuoso (https://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtBulkRDFLoader) can be one example of those capable of handling data of this size.

You can also reduce your dataset to only the resources that you link (perhaps instances of a given class) and the properties that you use in your linkage rule. Depending on your dataset, this may result in significant decrease in size of your dataset.

As a last resort, you can split your source file into smaller chunks and load each individually.

Best,

Jindřich

-- 
Jindřich Mynarz

--
You received this message because you are subscribed to the Google Groups "Silk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to silk-discussion+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reihaneh Amini

unread,
Jan 3, 2017, 9:53:20 PM1/3/17
to Silk
Thanks for the response!
Do you know what is the maximum size of data that I can upload there?
Is this fine of I use .nt format?

_Reihan
To unsubscribe from this group and stop receiving emails from it, send an email to silk-discussi...@googlegroups.com.

Robert Isele

unread,
Jan 4, 2017, 4:42:02 AM1/4/17
to silk-discussion
Dear Reihan,

the RDF dataset in Silk currently loads all data to an in-memory Jena Model and thus is limited by the available memory. I just improved the plugin description to make this clear.

As Jindřich wrote, the preferred way of handling large RDF datasets is to load them into an RDF store, such as Virtuoso.

For large datasets, there is also commercial support for processing them on a Spark Cluster. Please contact me in case you are interested in that.

Kind regards,
Robert



To unsubscribe from this group and stop receiving emails from it, send an email to silk-discussion+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages