Requesting clarification on 4Store internal structure with regards to compression

9 views
Skip to first unread message

Fredah B

unread,
May 5, 2015, 6:46:52 AM5/5/15
to 4store-...@googlegroups.com

Dear Team,

My name is Fredah and studying at Oxford and plan on using your SPARQL engine for my project implementation. I’m impressed by the tremendous work you have put in to make this engine a success however I did notice that the underlying infrastructure and compression technique used are encapsulated. I need to fully understand how the data is processed from start to finish especially with regards to the compression. Are there by any chance papers that have been written that cover the compression and decompression used in your engine or is it possible to refer me to someone who may be able to explain it to me?

Also, is compression default or is turned on and off depending on the data load of the system? I was also wondering how you store the data internally. As in, what format is the data stored? Is it an internally created representation or one of the standard RDF representations?

I would really appreciate your assistance in answering these questions and look forward to hearing from you soon.

Best Regards,

Fredah

swh

unread,
May 5, 2015, 7:20:08 AM5/5/15
to 4store-...@googlegroups.com
Wow, big question!

There's a paper that covers the broad architecture here: http://4store.org/publications/harris-ssws09.pdf it's quite old now, but still basically true.

In terms of resource compression and data storage, the code is all around the rhash structure https://github.com/garlik/4store/blob/master/src/backend/rhash.c

The inline data structure is fs_rhash_entry, and there's a number of specific compression techniques, ie. one for fixed-bit date symbol coding, a few for different numeric types. It's extensible to allow different compression schemes. IT would probably have made sense to add an English language huffman coding too, but the inline space is quite limited.

There's also some common-prefix compression - e.g. you get a lot of URIs like http://somedomain.example/ontologies/myontology#Class, which get compressed down as <symbol23>Class, using a prefix trie. Sometimes the prefix compressed URI will fit inline, sometimes it overflows. https://github.com/garlik/4store/blob/master/src/backend/prefix-trie.c

Pure text data, or URIs with uncommon prefixes overflow into an external file (the "lex" file). Things that overflow that are also over a certain length (currently 100 bytes, more-or-less arbitrarily picked by trial and error IIRC) are compressed using deflate before being written to the lex file https://github.com/garlik/4store/blob/master/src/backend/rhash.c#L422 the compressed form is used if it saves bytes overall.

Hope that helps.

- Steve
Reply all
Reply to author
Forward
0 new messages