|New rules added to FreebasePrefilter eliminates 62 GB of 'junk'||Paul Houle||8/14/13 11:40 AM|
From the pig work yesterday, I now know why the Freebase dump suddenly got bigger in the spring.
Early versions of the RDF dump lacked the very valuable "notable types" information which help you display a sensible subtitle for a topic like :Alyssia_Milano much better than if you chose them by chance.
They stirred this in, but they did it in a way that was very expensive. Although Freebase does publish a link to a notable type (a real RDF type) from which one could look up a type label, It also publishes full text labels for the notable type of each and every common.topic, and when you add it up it bulks up the expanded version of the Freebase dump by 62GB. Compression hides this from you, but it doesn't hide it from the memory bus of your CPU.
I think it will be will possible to load this into a triple store on a reasonable machine. I'm going to lay off the chopper while I focus on smoothing out operations, particularly automating the jobs. the prefilter and PSE3 will still exist as separate steps, but if I run them together that is one less step that can go wrong.