Hi all,
it was good to hear of the positive reception at WolframDataSummit! Have been busy working on getting the bugs that Tom identified with large documents ironed out. It was a great use case to test the limits of the system. It raised a few questions to do with where the bottlenecks are and how to lazily evaluate some computationally expensive operations. It also showed the error on the document list page and document view page where far too much data was being loaded into memory.
To solve the excessive computation problem I've implemented a state variable which is used in Document loading to describe which fields (hashes,bloom filter, text, cleaned text and meta) are actually needed for the particular operation. I've also revisited the hashing function (after speaking to Austin Appleby of MurmurHash fame) and discovered that rolling hashes are a much faster algorithm and also have the benefit of constant time whatever the window size. Out of interest, here it is (37 is a very magical number!):
inline vector<uint32_t> RabinKarp(const string& text,const size_t window_size,const uint64_t base=37){
vector<uint32_t> hashes;
size_t limit=text.size()-window_size+1;
const char* front=text.data();
const char* end=text.data()+window_size;
hashes.resize(limit);
uint64_t highBase=1;
uint64_t hash=0;
for (size_t i=0;i<window_size;i++){
highBase=(i==0)?1:highBase*base;
hash+=front[window_size-i-1]*highBase;
}
hashes[0]=hash;
for (vector<uint32_t>::iterator it=hashes.begin()+1,ite=hashes.end();it!=ite;++it){
hash=*it=((hash-(*front++)*highBase)*base+(*end++));
}
return hashes;
}
This can hash the bible in a constant 11ms which is 5x faster than the Murmur Hash algorithm for window size 20 and 17x faster for window size 100. I've also been looking at the text clean up algorithm which makes everything lower case and transforms punctuation to whitespace. I'm getting some significant speed increases here too.
Why am I doing this? Good question! Profiling has shown that these are the two limiting factors in the association task. One option is to cache the documents, preventing the same computation occurring twice, but an other option is to make the computation as quick as possible. It's an open question which is faster, but caching can make debugging harder and there's nothing to stop adding it later as well :)
The other item that came up with the document page is how to quickly, and with the least memory use, get a slice of the documents in a specified order. The current REST specification allows for the specification of one document type and optionally one document id. This is potentially limiting for doing interesting stuff such as associating 2 newspapers with all press releases, or searching against all congress bills from between 2006 and 2010. This problem also arises in how to serialise into the queue a specified association source and target. To solve these problems I've come up with this spec:
GET
http://server/document/1:10;32/ List documents with doctypes in the set (1,2,3,4,5,6,7,8,9,10,32)
GET
http://server/document/1:10;32/1:1000/ List documents with doctypes in the set (1,2,3,4,5,6,7,8,9,10,32) and docids in the set (1,2,...,999,1000)
GET http::/server/document/?order_by=-year List all documents by year in descending order
POST
http://server/association/1:10/11:20/ Associate documents with doctypes (1,2,3,4,5,6,7,8,9,10) with doctypes (11,12,13,14,15,16,17,18,19,20)
......
......
It's basically a simple range specification language, which when used with careful selection of doctypes when adding documents should make it a very powerful tool. It also helps massively with unifying how all the different parts speak to each other during the association task. The meta table index can be scanned very quickly for the correct set of documents using this form and communicated in a very lightweight way between the template renderers, the queue, the postings initialiser and the association task.
The consequence of all this development is that the master on github is stuck with the bugs that Tom found until I get this piece merged, which I estimate to be done by Monday (next week). It's a question for everyone if they happy to announce the current version, given it's current bugs, or wait for this release.
Cheers,
Donny.