|t20130917: sieve3 horizontal subdivision slashes the size of Freebase||Paul Houle||9/17/13 1:37 PM|
there is a first draft of 'sieve3', which splits up an RDF data set into mutually exclusive parts. There is a list of rules that apply to the triple, matching a rule diverts a triple to a particular output, and triples that fail to match a pattern fall into the 'output'.
The horizontal subdivision looks like this
Here are the segments
'a' - rdfs:type
'key' -- keys represented as expanded strings
'keyNs' -- keys represented in the key namespace
`label` -- rdfs:label
`name` -- type.object.name entries that are probably duplicative of rdfs:label
`text` -- additional large text blobs
`web` -- links to external web sites
`links` -- all other triples where the ?o is a URI
`other` -- all other triples where the ?o is not a Literal
Overall this segmentation isn't all that different from how DBpedia is broken down.
Last night I downloaded 4.5 GB worth of data from `links` and `other` out of the 20 GB dump supplied by Freebase and I expect to be able to write interesting SPARQL queries against this. This process is fast, completing in about 0.5 hrs with a smallAwsCluster. I think all of these data sets could be of interest to people who are working with triple stores and with Hadoop since the physical separation can speed most operations up considerably.
The future plan for firming up sieve3 is to get spring configuration working inside Hadoop (I probably won't put spring in charge of Hadoop first) so that it will be easy to create new rule sets in either by writing Java or XML.
This data can be download from the requester paid bucket