SuperFastMatch 0.3 Release

Donovan Hide

unread,

May 17, 2011, 6:26:43 PM5/17/11

to superfa...@googlegroups.com

Hi,

as promised the (hopefully!) Paver-friendly release is available. You
should be able to follow these instructions and get the legislation
example running:

http://superfastmatch.org/install.html

and then:

http://superfastmatch.org/examples.html

The long URL's for the github pages were causing me concern, so I
sorted the new domain out! Don't worry I'll get some logos on the
title page tomorrow :)

Let me know how you get on. Some more thoughts here too:

http://superfastmatch.org/diary.html#week-4

Cheers,
Donny.

James Turk

unread,

May 19, 2011, 1:58:09 PM5/19/11

to superfa...@googlegroups.com

Hey,

I've been working to get the example running and recently ran into a
few issues during the scrape step.

When I run ``./manage.py scrape`` it runs mostly fine through the
109th congress, then during the 110th it starts slowing down
dramatically (>1min between logs from the script).

I'm not familiar enough with Kyoto to know if that is the issue, but
here's the last few lines of logging output from a recent run (the
response time seems like a likely hint)

I'm running this on a desktop with 8GB of Ram and a fairly fast (but
not SSD) hard drive, any tips on where I should look to debug this?

Oh also, I turned off the multithreading as it was taking over my
machine completely -- it runs into this issue either way.

-James

2011-05-19T11:21:06.023760-05:00: [INFO]: [SCRIPT]: Update Response
Time: 9669.177 secs Hash Time: 0.864 secs Text Length: 195073 Document
Id: 27044 Document Type: 10 Added: 171885 Deleted: 0 Verify Exists: 0
2011-05-19T11:21:06.023886-05:00: [INFO]: (127.0.0.1:40599): POST
/rpc/play_script HTTP/1.1: 200
2011-05-19T11:21:06.023926-05:00: [INFO]: disconnecting: expr=127.0.0.1:40599
2011-05-19T11:31:30.842236-05:00: [INFO]: connected: expr=127.0.0.1:49025
2011-05-19T12:04:52.315752-05:00: [INFO]: connected: expr=127.0.0.1:41711
2011-05-19T12:38:15.635286-05:00: [INFO]: connected: expr=127.0.0.1:44111
2011-05-19T12:41:38.728008-05:00: [INFO]: [SCRIPT]: Update Response
Time: 203.024 secs Hash Time: 0.007 secs Text Length: 2435 Document
Id: 27066 Document Type: 10 Added: 2387 Deleted: 0 Verify Exists: 0

Donovan Hide

unread,

May 19, 2011, 3:15:34 PM5/19/11

to superfa...@googlegroups.com

Hi James,

You've hit both a memory ceiling and a current limitation of
superfastmatch for bulk loading that the bulkload tool solved for
churnalism.com in a hacky way. Because you have 8GB of RAM, though you
should be able to get round it by substituting

paver kyototycoon

with

ktserver -port 1978 -tout 2000 -onr -scr
superfastmatch/scripts/search.lua
index.kct#ktopts=p#bnum=16777216#msiz=3g#pccap=3g

where you are dedicating 3GB to the memory mapped region and 3GB to
the page cache. Docs here:

http://fallabs.com/kyototycoon/command.html#ktserver
http://fallabs.com/kyotocabinet/spex.html#tips

Tuning Kyoto Cabinet is a bit of an art form, and there is an
additional caveat. Both ext3/4 and HFS+ have journaling enabled by
default as well as access time enabled. After a certain large number
of writes, the file system starts to chug a lot!

We solved this on our server by formatting the SSD drive with
reiserfs, although tweaking the writeback and noatime settings in
fstab for ext3/4 would have been another option.

The better solution is to order all the keys before insertion and then
leverage the sequential inserts benefits of the B-Tree KC file. This
is something that the bulk load tool does in collaboration with
pre-processing the document files, but an even better solution is to
use the in built map reduce functionality of KC discussed near the
bottom of this page:

http://fallabs.com/mikio/tech-en/promenade.cgi?id=24

I've talked to Martin about doing this work to move some of the heavy
processing out of the Python realm and into the Kyoto Tycoon Lua/C++
realm and that's something we might do to try and make superfastmatch
a bit more friendly to lower-spec machines.

Are you using OSX or Ubuntu on your desktop? Let me know if the
ktserver tweak works for you, if not do you have space for an extra
partition on your desktop drive?

Cheers,
Donny.

James Turk

unread,

May 26, 2011, 10:18:30 AM5/26/11

to superfa...@googlegroups.com

Sorry about the delay in response, I was giving a few things a try but still haven't been entirely successful.

I now have it running on a server with 16GB RAM - to get the import process to complete I had to change both msiz and pcap to 5g.

./manage.py scrape finished without issue in about the expected time

Do you have an estimate for how long ./manage.py associate should take? it ran overnight and failed at some point in the early morning, I started to see Search Response Times taking multiple minutes again.

How much memory are you giving it when you run it? I'd like to try and replicate the setup that you have the example working in as closely as we can.

-James

Donovan Hide

unread,

May 26, 2011, 10:46:19 AM5/26/11

to superfa...@googlegroups.com

Hi James,

no worries, these things take time :)

I too have acquired a 16GB Ram machine to run the legislation example.
In reality, this is probably too high a memory requirement and is a
product of the inefficiency of the inserting of docid,doctype pairs
into each hash key. If the inserts were pre-sorted by hash, the insert
time would be a fraction of the current wait, and the memory
requirements would evaporate. I am currently working on implementing
this using the in-built map reduce functionality of Kyoto Cabinet to
sort the hashes and then do the inserts in sequential order:

http://fallabs.com/kyotocabinet/api/classkyotocabinet_1_1MapReduce.html

The association task in the current superfastmatch is different from
the churnalism.com one in that stores every repeated fragment of text
individually rather than in a per document pickled json format. This
is more useful for analysis, but the current implementation keeps
duplicates as separate records, which is grossly inefficient. The
bottleneck is the postgres inserts of these fragments. I'm proposing
to shift the association and fragment parts out of postgres into Kyoto
Cabinet for improving insert times and decoupling the indexing logic
from django and python. This means that the algorithm for finding the
longest common substrings can run in C++ with Lua bindings. So far,
I've succeeded in reducing the time to compare the Bible with the
Koran from 20 seconds to 2 seconds, and believe I can get it lower. It
should also make it easier to convert the fragment index to a hash
table for much greater efficiency.

There was a bug with the associate and scrape tasks where they were
keeping the sql logs in memory due to DEBUG=True in the settings.py.
The latest commits should rectify this. If you re-run it's worth doing
the following to clear the existing index:

cd superfastmatch
rm index.kct
source .env/examples_env/bin/activate
cd examples/legislation
./manage.py dbshell < sql/reset.sql

If you do this and re-run the scraper and associate tasks should
finish, albeit in an inefficient manner. I'm currently working on
improving these bottlenecks, but I probably need a further two weeks
to complete them. The basic idea is to transplant all the indexing
logic (the Content,Fragment and Association models in models.py) into
Lua running on Kyoto Tycoon and put the algorithms into C++ (longest
common substrings and merging new doctypes and docids with existing)
with Lua bindings which will make use of mapreduce to diminish the
current high memory requirements and make the tasks more scalable. I
strongly believe that superfastmatch will deserve it's name if these
changes are made.

So, to conclude, I'd say that the current legislation example is more
of a taste of the results rather than the final implementation. The
greater variation in document length with Congress Bills highlighted
some untested bottlenecks which it is a fun challenge to solve!!

Hope that helps,
Donny.

James Turk

unread,

Jun 15, 2011, 6:03:26 PM6/15/11

to superfa...@googlegroups.com

Hi Donny,

Just wanted to check in on how the updated version was coming.
Managed to get some sample content running but I'm hoping in the next
week or two to work on connecting some of our state legislation to a
superfastmatch backend and if it's close I figure it'd be best to wait
until some of the decoupling you mentioned is in place.

Thanks,

James

Donovan Hide

unread,

Jun 16, 2011, 9:41:17 AM6/16/11

to superfa...@googlegroups.com

Hi James,

forming a habit of going quiet for a week or two :)

Have been working hard on getting superfastmatch to index, associate
and search as fast as possible using only the capabilities of the host
machine. It's been a learning experience in terms of C++ profiling and
why the stack is faster than the heap!

The other part of this improvement is to make the software REST-ful,
which should make it a lot easier to integrate with any frontend (such
as browser extensions or straight web pages) and any existing
scrapers. Have attached the spec that I have been working to. You
could code against this spec if you want, but I hope to have something
fully demonstrable for Martin's visit to Boston next week which might
be easier to try out.

Out of interest, what platform are you developing on and what platform
is the server you hope to deploy to? Will help me test my Makefile!

Cheers,
Donny.

REST-specification.txt

Donovan Hide

unread,

Jun 16, 2011, 9:43:27 AM6/16/11

to superfa...@googlegroups.com

Apologies for the horrible formatting in that spec, hopefully this one
will be correct!

REST-specification.txt

James Turk

unread,

Jun 16, 2011, 5:04:29 PM6/16/11

to superfa...@googlegroups.com

Thanks for this, it certainly sounds like it'll be immensely useful in
what we plan to do. I'll keep an eye out next week for the REST-ful
interface.