SuperFastMatch status

Donovan Hide

unread,

Jun 30, 2011, 9:53:12 AM6/30/11

to superfa...@googlegroups.com

Hi,

firstly, and again, apologies for the radio silence. I've been deeply
submerged in hardcore hacking! To explain what I've been doing and
where I've got to it's probably worth comparing the specification of
the existing Churnalism server with the new superfastmatch
requirements for an 8GB corpora:

Churnalism server:

64GB memory
256GB SSD drive
48GB resident memory cost
48GB kyoto cabinet on-disk size

Current superfastmatch:

8GB memory for doc ids + 2GB memory overhead
Any speed disk drive is suitable
6GB kyoto cabinet on-disk size (due to heavy compression of original
docs and hashes)

As you can can see the hardware requirements have fallen considerably,
which makes superfastmatch much more accessible to developers, largely
due in part to a data structure released as open source by Google
Employee #1 (Craig Silverstein):

http://google-sparsehash.googlecode.com/svn/trunk/doc/implementation.html

I've been using this, along with the kyoto cabinet library to get some
very low memory usage for large corpora. What this means is that whole
index can be kept in memory and no disk seek will ever occur. This
makes it possible for the association step to occur at much higher
frequencies and to complete in much less time. This has a lot of uses,
especially in the case of rolling news. It also facilitates the
concept of alerts where if a new association appears for a document
compared to it's previous set of associations an email could be sent.

The whole server is now a single executable, with 5 library
dependencies. It is written in C++ and is multithreaded, so should be
able to deal with high incoming document load and high arbitrary
search load. The incoming documents are queued for either addition or
deletion to the index. This means that they can be added
asynchronously and the order of submission will be respected. This is
important for news where an article gets edited later in the day. It
also relieves the document submitter of tracking changes which is
useful when writing a stateless scraper.

I've attached a screenshot which shows a histogram of the index's
collision rates. It needs more explanation, but basically the higher
the line the more collisions and possibly churn per doc. The more
collisions the better the compression rate. However too many
collisions and false positives are returned. Finding the best balance
for a particular size and nature of corpora is a very interesting
field of study.

All sounds great, you're probably thinking. Next question: when can I
see it? Well, I've been leaving the visible interface to last, as is
the habit of most developers... I keep committing to dates and then
missing the deadline. This may be an annoying thing to do, but I think
it's best to just say it will be ready soon, but the wait will be
worth it. The current specification matches the use cases of both a
browser extension fed search engine and a back-end for associating
Congress bills.

The templates for the web interface will be override-able as well,
which will allow for effective white-labelling.

Hope this helps and sorry for the continual missed deadlines...

Cheers,
Donovan.

histogram.tiff

Tom Lee

unread,

Jun 30, 2011, 12:46:46 PM6/30/11

to superfa...@googlegroups.com

This sounds nearly miraculous! There's certainly no need to apologize -- I'm thrilled to hear about the progress that's been made, and that it means the technology will be so much more accessible. Naturally we're anxious to try running the code ourselves (interface or no), but from the activity on github I take it that there's still a little ways to go before these changes get pushed.

Donovan Hide

unread,

Jun 30, 2011, 1:05:44 PM6/30/11

to superfa...@googlegroups.com

Hi Tom,

you can track the current changes and read the source on this branch and path:

http://github.com/mediastandardstrust/superfastmatch/tree/mapreduce/server2/src

Mapreduce is a bit of misnomer now! I am anxious to give you something
to use too...

Cheers,
Donny.

James Turk

unread,

Jun 30, 2011, 2:56:52 PM6/30/11

to superfa...@googlegroups.com

Thanks Donny,

I'd basically echo Tom's comments, this all sounds really great and I
can't wait to get a chance to play with it. I'll keep an eye on the
branch and look forward to giving it a try.

-James

Reply all

Reply to author

Forward