Hello!
My name is George. I'm trying to build a link map from the music artists in Dbpedia to the music artists in Musicbrainz. Silk seems like a great tool for that.
Unfortunately, the datasets are rather large (~5 million artists in Musicbrainz). I tried running the single machine version on a maxed out Amazon machine (64GB of ram), but I was getting out of memory errors even with the Xmx set to 60GB.
My next step is to try the MR version. I've got a cluster spun up on Amazon, but I'm struggling to figure out how to make it work. I've got a full export of Dbpedia and a full RDF-ization of Musicbrainz stored on HDFS, but I can't figure out how to make Silk read the files off of HDFS. When I try to specific data source addresses on HDFS, I get back an error message:
Exception in thread "main" java.util.NoSuchElementException: No plugin called 'file' found.
at de.fuberlin.wiwiss.silk.util.plugin.PluginFactory.apply(PluginFactory.scala:36)
at de.fuberlin.wiwiss.silk.datasource.Source$.fromXML(Source.scala:71)
...
I can't find any documentation about the plugin system, and I don't know Scala so I can't really figure it out from the source code. I'm hoping I just need to tweak the syntax of my LSL file to make it go. The example LSL file for the MR example in the documentation uses SPARQL endpoints instead of RDF files so I'm not sure how it's supposed to look.
Unfortunately, there is no SPARQL endpoint for Musicbrainz so I can't substitute in an endpoint without setting one up from scratch. Also, I realize this task requires quite a lot of comparisons and probably would place an unreasonable load on the public Dbpedia endpoint.
I'd really appreciate any help! I'm considering doing a blog post on how to use SilkMR, so perhaps I could apply any lessons learned here to that.
Thanks!
-George
-------------------------------------------
T: @rogueleaderr
B:
rogueleaderr.com
-------------------------------------------