Just a few notes to share from a rainy afternoon's experiments with the software. I spun up an EC2 instance with 15.7GB of RAM in order to try a congressional record matching task that had proved to be too ambitious for my laptop. I used the stock Amazon Linux 64 bit AMI.
First, some notes on necessary dependencies (the Amazon AMI is very bare-bones). These commands got me to a point where the bootstrap script could complete -- I might have left something out, having reconstructed this from my command history, but this is at least most of it:
sudo yum update
sudo yum install git
sudo yum install svn
sudo yum install gcc
sudo yum install gcc-c++
sudo yum install zlib-devel
wget http://download.savannah.gnu.org/releases/libunwind/libunwind-0.99.tar.gz
tar xzf libunwind-0.99.tar.gz
cd libunwind-0.99
./configure && make && sudo make install
cd ..
rm -rf libunwind-0.99
However, I still ran into a bit of trouble with the dynamic libraries. For some reason I couldn't cajole ldconfig into processing /usr/local/lib, so ultimately I just copied the necessary libraries into /usr/lib (I would be more chagrined if this machine was going to exist beyond tonight).
I loaded the files from my local machine over SSH without incident -- ~130 files, 2.7G total. When I tried to view the result in my browser (over the tunneled connection), I got an error. Here's the console output from the server:
Last login: Mon Sep 5 19:57:57 2011 from xxxx.comcast.net
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/system-release-2011.02 for latest release notes. :-)
[ec2-user@ip-10-82-33-207 ~]$ cd sfm/superfastmatch/
[ec2-user@ip-10-82-33-207 superfastmatch]$ ./superfastmatch -reset -debug
2011-09-05T19:58:22.062895Z: [SYSTEM]: ================ [START]: pid=26216
2011-09-05T19:58:22.063061Z: [SYSTEM]: starting the server: expr=127.0.0.1:8080
2011-09-05T19:58:22.063152Z: [SYSTEM]: server socket opened: expr=127.0.0.1:8080 timeout=1.0
2011-09-05T19:58:22.063178Z: [SYSTEM]: listening server socket started: fd=9
2011-09-05T19:58:35.264731Z: [INFO]: connected: expr=127.0.0.1:36980
2011-09-05T19:58:37.157300Z: [INFO]: Queued document: (1,1) for indexing queue id:1 Response Time: 0.0303 secs
2011-09-05T19:58:37.157350Z: [INFO]: (127.0.0.1:36980): POST /document/1/1/ HTTP/1.1: 202
2011-09-05T19:58:37.176983Z: [INFO]: disconnecting: expr=127.0.0.1:36980
2011-09-05T19:58:38.110923Z: [INFO]: connected: expr=127.0.0.1:36981
2011-09-05T19:58:38.617096Z: [INFO]: Finished processing command queue
2011-09-05T19:58:47.070580Z: [INFO]: Queued document: (1,2) for indexing queue id:2 Response Time: 0.1344 secs
2011-09-05T19:58:47.070640Z: [INFO]: (127.0.0.1:36981): POST /document/1/2/ HTTP/1.1: 202
2011-09-05T19:58:47.110924Z: [INFO]: disconnecting: expr=127.0.0.1:36981
2011-09-05T19:58:48.718894Z: [INFO]: connected: expr=127.0.0.1:36982
2011-09-05T19:58:49.912017Z: [INFO]: Finished processing command queue
2011-09-05T19:59:09.690416Z: [INFO]: Queued document: (1,3) for indexing queue id:3 Response Time: 0.2751 secs
2011-09-05T19:59:09.690474Z: [INFO]: (127.0.0.1:36982): POST /document/1/3/
[ ...and so on ]
2011-09-05T20:57:52.335592Z: [INFO]: Queued document: (1,132) for indexing queue id:128 Response Time: 0.4621 secs
2011-09-05T20:57:52.335653Z: [INFO]: (127.0.0.1:48230): POST /document/1/132/ HTTP/1.1: 202
2011-09-05T20:57:52.453131Z: [INFO]: disconnecting: expr=127.0.0.1:48230
2011-09-05T20:57:54.302223Z: [INFO]: connected: expr=127.0.0.1:52435
2011-09-05T20:58:00.700804Z: [INFO]: Finished processing command queue
2011-09-05T20:58:18.218228Z: [INFO]: Queued document: (1,133) for indexing queue id:129 Response Time: 0.1626 secs
2011-09-05T20:58:18.218291Z: [INFO]: (127.0.0.1:52435): POST /document/1/133/ HTTP/1.1: 202
2011-09-05T20:58:18.235074Z: [INFO]: disconnecting: expr=127.0.0.1:52435
2011-09-05T20:58:19.750966Z: [INFO]: connected: expr=127.0.0.1:52436
2011-09-05T20:58:24.564755Z: [INFO]: Finished processing command queue
2011-09-05T20:58:41.833929Z: [INFO]: Queued document: (1,134) for indexing queue id:130 Response Time: 0.1415 secs
2011-09-05T20:58:41.833991Z: [INFO]: (127.0.0.1:52436): POST /document/1/134/ HTTP/1.1: 202
2011-09-05T20:58:41.856879Z: [INFO]: disconnecting: expr=127.0.0.1:52436
2011-09-05T20:58:43.314449Z: [INFO]: connected: expr=127.0.0.1:52437
2011-09-05T20:58:47.789669Z: [INFO]: Finished processing command queue
2011-09-05T20:59:05.197038Z: [INFO]: Queued document: (1,135) for indexing queue id:131 Response Time: 0.1715 secs
2011-09-05T20:59:05.197094Z: [INFO]: (127.0.0.1:52437): POST /document/1/135/ HTTP/1.1: 202
2011-09-05T20:59:05.248224Z: [INFO]: disconnecting: expr=127.0.0.1:52437
2011-09-05T20:59:13.959023Z: [INFO]: Finished processing command queue
2011-09-05T21:13:15.298532Z: [INFO]: connected: expr=127.0.0.1:50132
2011-09-05T21:13:15.331125Z: [INFO]: Response Time: 0.0010 secs
2011-09-05T21:13:15.331163Z: [INFO]: (127.0.0.1:50132): GET /queue/ HTTP/1.1: 200
2011-09-05T21:13:15.557704Z: [INFO]: Response Time: 0.0001 secs
2011-09-05T21:13:15.557755Z: [INFO]: (127.0.0.1:50132): GET /favicon.ico HTTP/1.1: 200
2011-09-05T21:13:20.912353Z: [INFO]: Response Time: 0.0001 secs
2011-09-05T21:13:20.912410Z: [INFO]: (127.0.0.1:50132): GET / HTTP/1.1: 200
2011-09-05T21:13:21.046808Z: [INFO]: Response Time: 0.0000 secs
2011-09-05T21:13:21.046854Z: [INFO]: (127.0.0.1:50132): GET /favicon.ico HTTP/1.1: 200
terminate called after throwing an instance of 'std::bad_alloc'
The debug document doesn't provide much help. Here's how it ends:
[...]
2011-09-05T20:58:46.116312Z: [DEBUG]: Saved: Document(1,134)
2011-09-05T20:58:47.735790Z: [DEBUG]: Adding: Document(1,134)
2011-09-05T20:59:10.745038Z: [DEBUG]: Saved: Document(1,135)
2011-09-05T20:59:13.910768Z: [DEBUG]: Adding: Document(1,135)
The server was running with only the -reset and -debug flags set. I then restarted it with just -debug -- after a long wait, I got:
[ec2-user@ip-10-82-33-207 superfastmatch]$ ./superfastmatch -debug
channel 2: open failed: connect failed: Connection refused
channel 2: open failed: connect failed: Connection refused
Error opening databases
2011-09-05T21:27:25.707812Z: [SYSTEM]: ================ [START]: pid=12379
2011-09-05T21:27:25.707954Z: [SYSTEM]: starting the server: expr=127.0.0.1:8080
2011-09-05T21:27:25.708041Z: [SYSTEM]: server socket opened: expr=127.0.0.1:8080 timeout=1.0
2011-09-05T21:27:25.708067Z: [SYSTEM]: listening server socket started: fd=5
2011-09-05T21:29:58.978421Z: [INFO]: connected: expr=127.0.0.1:56950
2011-09-05T21:29:59.017554Z: [INFO]: Response Time: 0.0141 secs
2011-09-05T21:29:59.017606Z: [INFO]: (127.0.0.1:56950): GET / HTTP/1.1: 200
2011-09-05T21:29:59.209966Z: [INFO]: Response Time: 0.0014 secs
2011-09-05T21:29:59.210011Z: [INFO]: (127.0.0.1:56950): GET /favicon.ico HTTP/1.1: 200
channel 3: open failed: connect failed: Connection refused
Killed
[ec2-user@ip-10-82-33-207 superfastmatch]$
The "channel" messages are from my attempts to refresh in my local machine's browser, of course. Interestingly, I managed to catch the process just as it died. I was trying to load /document/, which took a long time. top revealed that the process was steadily growing in memory requirements; as it hit 100%, the process died.
A bug or simply too much data? I certainly couldn't say, but given that I hadn't yet run the associate task -- which seems more likely to have a problematically O(N^2) flavor to it than just displaying an index of 130 loaded documents -- I'm hopeful that it's the former.
Out of curiousity, is there a rule of thumb that can be used for estimating physical memory requirements based on the document corpus and the various tunable hash parameters? I believe you started to walk me through some of this via Skype on Friday, but I'm afraid I don't remember the details.
Anyway, I hope these notes are of some use. I think we're in great shape for Thursday, and I have no need to get this particular task running -- it was just an exploratory exercise for me.
I should note that this was a fresh checkout of SFM, not the wolfram tag, which now that I think about it might have been a silly mistake on my part.
Tom