SilkMR with files stored on HDFS?

62 views
Skip to first unread message

george....@gmail.com

unread,
May 7, 2013, 3:27:01 AM5/7/13
to silk-di...@googlegroups.com
Hello!

My name is George. I'm trying to build a link map from the music artists in Dbpedia to the music artists in Musicbrainz. Silk seems like a great tool for that.

Unfortunately, the datasets are rather large (~5 million artists in Musicbrainz). I tried running the single machine version on a maxed out Amazon machine (64GB of ram), but I was getting out of memory errors even with the Xmx set to 60GB. 

My next step is to try the MR version. I've got a cluster spun up on Amazon, but I'm struggling to figure out how to make it work. I've got a full export of Dbpedia and a full RDF-ization of Musicbrainz stored on HDFS, but I can't figure out how to make Silk read the files off of HDFS. When I try to specific data source addresses on HDFS, I get back an error message:

Exception in thread "main" java.util.NoSuchElementException: No plugin called 'file' found.
at de.fuberlin.wiwiss.silk.util.plugin.PluginFactory.apply(PluginFactory.scala:36)
at de.fuberlin.wiwiss.silk.datasource.Source$.fromXML(Source.scala:71)
...

I can't find any documentation about the plugin system, and I don't know Scala so I can't really figure it out from the source code. I'm hoping I just need to tweak the syntax of my LSL file to make it go. The example LSL file for the MR example in the documentation uses SPARQL endpoints instead of RDF files so I'm not sure how it's supposed to look. 


Unfortunately, there is no SPARQL endpoint for Musicbrainz so I can't substitute in an endpoint without setting one up from scratch. Also, I realize this task requires quite a lot of comparisons and probably would place an unreasonable load on the public Dbpedia endpoint.

I'd really appreciate any help! I'm considering doing a blog post on how to use SilkMR, so perhaps I could apply any lessons learned here to that.

Thanks!

-George

-------------------------------------------
T: @rogueleaderr
B: rogueleaderr.com
-------------------------------------------

Robert Isele

unread,
Jul 11, 2013, 1:25:32 PM7/11/13
to silk-di...@googlegroups.com
Hi,

the problem with the current implementation of the 'file' plugin that loads all data from an RDF dump is that it holds all entities in memory, and thus is not suitable for large files. At the moment, there are two options for using Silk with large files:
1) Loading the dump into an RDF store, such as Sesame, and using the SPARQL data source.
2) Using the Linked Data Integration Framework on Hadoop [1]

Cheers,
Robert

[1] http://ldif.wbsg.de/


--
You received this message because you are subscribed to the Google Groups "Silk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to silk-discussi...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages