some notes on using SFM on an EC2 high-memory instance

Tom Lee

unread,

Sep 5, 2011, 5:42:04 PM9/5/11

to superfa...@googlegroups.com

Just a few notes to share from a rainy afternoon's experiments with the software. I spun up an EC2 instance with 15.7GB of RAM in order to try a congressional record matching task that had proved to be too ambitious for my laptop. I used the stock Amazon Linux 64 bit AMI.

First, some notes on necessary dependencies (the Amazon AMI is very bare-bones). These commands got me to a point where the bootstrap script could complete -- I might have left something out, having reconstructed this from my command history, but this is at least most of it:

sudo yum update
sudo yum install git
sudo yum install svn
sudo yum install gcc
sudo yum install gcc-c++
sudo yum install zlib-devel
wget http://download.savannah.gnu.org/releases/libunwind/libunwind-0.99.tar.gz
tar xzf libunwind-0.99.tar.gz
cd libunwind-0.99
./configure && make && sudo make install
cd ..
rm -rf libunwind-0.99

However, I still ran into a bit of trouble with the dynamic libraries. For some reason I couldn't cajole ldconfig into processing /usr/local/lib, so ultimately I just copied the necessary libraries into /usr/lib (I would be more chagrined if this machine was going to exist beyond tonight).

I loaded the files from my local machine over SSH without incident -- ~130 files, 2.7G total. When I tried to view the result in my browser (over the tunneled connection), I got an error. Here's the console output from the server:

Last login: Mon Sep 5 19:57:57 2011 from xxxx.comcast.net
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/system-release-2011.02 for latest release notes. :-)
[ec2-user@ip-10-82-33-207 ~]$ cd sfm/superfastmatch/
[ec2-user@ip-10-82-33-207 superfastmatch]$ ./superfastmatch -reset -debug
2011-09-05T19:58:22.062895Z: [SYSTEM]: ================ [START]: pid=26216
2011-09-05T19:58:22.063061Z: [SYSTEM]: starting the server: expr=127.0.0.1:8080
2011-09-05T19:58:22.063152Z: [SYSTEM]: server socket opened: expr=127.0.0.1:8080 timeout=1.0
2011-09-05T19:58:22.063178Z: [SYSTEM]: listening server socket started: fd=9
2011-09-05T19:58:35.264731Z: [INFO]: connected: expr=127.0.0.1:36980
2011-09-05T19:58:37.157300Z: [INFO]: Queued document: (1,1) for indexing queue id:1 Response Time: 0.0303 secs
2011-09-05T19:58:37.157350Z: [INFO]: (127.0.0.1:36980): POST /document/1/1/ HTTP/1.1: 202
2011-09-05T19:58:37.176983Z: [INFO]: disconnecting: expr=127.0.0.1:36980
2011-09-05T19:58:38.110923Z: [INFO]: connected: expr=127.0.0.1:36981
2011-09-05T19:58:38.617096Z: [INFO]: Finished processing command queue
2011-09-05T19:58:47.070580Z: [INFO]: Queued document: (1,2) for indexing queue id:2 Response Time: 0.1344 secs
2011-09-05T19:58:47.070640Z: [INFO]: (127.0.0.1:36981): POST /document/1/2/ HTTP/1.1: 202
2011-09-05T19:58:47.110924Z: [INFO]: disconnecting: expr=127.0.0.1:36981
2011-09-05T19:58:48.718894Z: [INFO]: connected: expr=127.0.0.1:36982
2011-09-05T19:58:49.912017Z: [INFO]: Finished processing command queue
2011-09-05T19:59:09.690416Z: [INFO]: Queued document: (1,3) for indexing queue id:3 Response Time: 0.2751 secs
2011-09-05T19:59:09.690474Z: [INFO]: (127.0.0.1:36982): POST /document/1/3/

[ ...and so on ]
2011-09-05T20:57:52.335592Z: [INFO]: Queued document: (1,132) for indexing queue id:128 Response Time: 0.4621 secs
2011-09-05T20:57:52.335653Z: [INFO]: (127.0.0.1:48230): POST /document/1/132/ HTTP/1.1: 202
2011-09-05T20:57:52.453131Z: [INFO]: disconnecting: expr=127.0.0.1:48230
2011-09-05T20:57:54.302223Z: [INFO]: connected: expr=127.0.0.1:52435
2011-09-05T20:58:00.700804Z: [INFO]: Finished processing command queue
2011-09-05T20:58:18.218228Z: [INFO]: Queued document: (1,133) for indexing queue id:129 Response Time: 0.1626 secs
2011-09-05T20:58:18.218291Z: [INFO]: (127.0.0.1:52435): POST /document/1/133/ HTTP/1.1: 202
2011-09-05T20:58:18.235074Z: [INFO]: disconnecting: expr=127.0.0.1:52435
2011-09-05T20:58:19.750966Z: [INFO]: connected: expr=127.0.0.1:52436
2011-09-05T20:58:24.564755Z: [INFO]: Finished processing command queue
2011-09-05T20:58:41.833929Z: [INFO]: Queued document: (1,134) for indexing queue id:130 Response Time: 0.1415 secs
2011-09-05T20:58:41.833991Z: [INFO]: (127.0.0.1:52436): POST /document/1/134/ HTTP/1.1: 202
2011-09-05T20:58:41.856879Z: [INFO]: disconnecting: expr=127.0.0.1:52436
2011-09-05T20:58:43.314449Z: [INFO]: connected: expr=127.0.0.1:52437
2011-09-05T20:58:47.789669Z: [INFO]: Finished processing command queue
2011-09-05T20:59:05.197038Z: [INFO]: Queued document: (1,135) for indexing queue id:131 Response Time: 0.1715 secs
2011-09-05T20:59:05.197094Z: [INFO]: (127.0.0.1:52437): POST /document/1/135/ HTTP/1.1: 202
2011-09-05T20:59:05.248224Z: [INFO]: disconnecting: expr=127.0.0.1:52437
2011-09-05T20:59:13.959023Z: [INFO]: Finished processing command queue
2011-09-05T21:13:15.298532Z: [INFO]: connected: expr=127.0.0.1:50132
2011-09-05T21:13:15.331125Z: [INFO]: Response Time: 0.0010 secs
2011-09-05T21:13:15.331163Z: [INFO]: (127.0.0.1:50132): GET /queue/ HTTP/1.1: 200
2011-09-05T21:13:15.557704Z: [INFO]: Response Time: 0.0001 secs
2011-09-05T21:13:15.557755Z: [INFO]: (127.0.0.1:50132): GET /favicon.ico HTTP/1.1: 200
2011-09-05T21:13:20.912353Z: [INFO]: Response Time: 0.0001 secs
2011-09-05T21:13:20.912410Z: [INFO]: (127.0.0.1:50132): GET / HTTP/1.1: 200
2011-09-05T21:13:21.046808Z: [INFO]: Response Time: 0.0000 secs
2011-09-05T21:13:21.046854Z: [INFO]: (127.0.0.1:50132): GET /favicon.ico HTTP/1.1: 200
terminate called after throwing an instance of 'std::bad_alloc'

The debug document doesn't provide much help. Here's how it ends:

[...]
2011-09-05T20:58:46.116312Z: [DEBUG]: Saved: Document(1,134)
2011-09-05T20:58:47.735790Z: [DEBUG]: Adding: Document(1,134)
2011-09-05T20:59:10.745038Z: [DEBUG]: Saved: Document(1,135)
2011-09-05T20:59:13.910768Z: [DEBUG]: Adding: Document(1,135)

The server was running with only the -reset and -debug flags set. I then restarted it with just -debug -- after a long wait, I got:

[ec2-user@ip-10-82-33-207 superfastmatch]$ ./superfastmatch -debug
channel 2: open failed: connect failed: Connection refused
channel 2: open failed: connect failed: Connection refused
Error opening databases
2011-09-05T21:27:25.707812Z: [SYSTEM]: ================ [START]: pid=12379
2011-09-05T21:27:25.707954Z: [SYSTEM]: starting the server: expr=127.0.0.1:8080
2011-09-05T21:27:25.708041Z: [SYSTEM]: server socket opened: expr=127.0.0.1:8080 timeout=1.0
2011-09-05T21:27:25.708067Z: [SYSTEM]: listening server socket started: fd=5
2011-09-05T21:29:58.978421Z: [INFO]: connected: expr=127.0.0.1:56950
2011-09-05T21:29:59.017554Z: [INFO]: Response Time: 0.0141 secs
2011-09-05T21:29:59.017606Z: [INFO]: (127.0.0.1:56950): GET / HTTP/1.1: 200
2011-09-05T21:29:59.209966Z: [INFO]: Response Time: 0.0014 secs
2011-09-05T21:29:59.210011Z: [INFO]: (127.0.0.1:56950): GET /favicon.ico HTTP/1.1: 200
channel 3: open failed: connect failed: Connection refused
Killed
[ec2-user@ip-10-82-33-207 superfastmatch]$

The "channel" messages are from my attempts to refresh in my local machine's browser, of course. Interestingly, I managed to catch the process just as it died. I was trying to load /document/, which took a long time. top revealed that the process was steadily growing in memory requirements; as it hit 100%, the process died.

A bug or simply too much data? I certainly couldn't say, but given that I hadn't yet run the associate task -- which seems more likely to have a problematically O(N^2) flavor to it than just displaying an index of 130 loaded documents -- I'm hopeful that it's the former.

Out of curiousity, is there a rule of thumb that can be used for estimating physical memory requirements based on the document corpus and the various tunable hash parameters? I believe you started to walk me through some of this via Skype on Friday, but I'm afraid I don't remember the details.

Anyway, I hope these notes are of some use. I think we're in great shape for Thursday, and I have no need to get this particular task running -- it was just an exploratory exercise for me.

I should note that this was a fresh checkout of SFM, not the wolfram tag, which now that I think about it might have been a silly mistake on my part.

Tom

Donovan Hide

unread,

Sep 5, 2011, 6:02:28 PM9/5/11

to superfa...@googlegroups.com

Hi Tom,

thanks for the detailed logs! Well done for getting it to work on the Fedora/Amazon distro!

Looking at the log I think you've found a bug with the queue page which, in trunk, loads the payloads (ie. the contents of the POST) into memory, which sucks up loads of memory. I've fixed this in my branch - should be pushed tomorrow or Wednesday.

I'd need access to the server and gdb to work out for definite if that is the cause of the bad_alloc error.

You might have spotted "Error opening databases" on the second run, this happens after a bad shutdown. The process should really exit with an error code if this happens. The reason for the bad shutdown is the bad_alloc error! If it takes ages to load, then running with -reset and reloading is the best route for now.

The hash_width is the biggest influence on memory usage. It defaults to 28, but 24 is good for small memory usage at the cost of slower search speed due to collisions. htop is the best tool for keeping an eye on memory usage - I did have some memory usage on the status page, but tcmalloc doesn't let you do memory leak checking on OS X which I've been doing. This will be reinstated soon.

If you want me to ssh into the machine I can definitely get it running well for you if you want? If you want to try again, try -reset and don't look at the queue page for now!

Cheers,

Donny.

Tom Lee

unread,

Sep 5, 2011, 6:20:05 PM9/5/11

to superfa...@googlegroups.com

Thanks, Donny! I haven't yet shut down the machine, so sure, if you can send me a public key I think I can probably get you into the instance. I was in the process of pulling down the ./data directory for future use before shutting down the instance, but from what you're saying it sounds like that may be corrupt and not usable anyway.

At this point it doesn't matter a tremendous amount to me whether this particular installation gets fixed or not, but if it's of forensic interest to you then I'll be happy to leave it online and get you access. Just let me know what you'd like to do.

Donovan Hide

unread,

Sep 5, 2011, 6:24:22 PM9/5/11

to superfa...@googlegroups.com

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAsRsMV4ndPJYZHjyfa/5Cbg8/Qlt2jLujz5UbHrvmh3IrOT8OHUl5OViNrD9EwiWTxi9z/8JQS+SJxeOemfPAwQhje8cvqI5VTCXIZ0DbqXSH3p7MK3LQ+TbVNihfb2gsPDm5gri6+fob5dEPfup3mOv+J/K0xPtA5C0PhcQMj25mgazePWKXPt8srhmHlJGkEj8ZGNIROWO3rLYC4XCr6D88FoWHfJQG8yyb27SRRFuT/YNlwsuYMZsc98gYytMitTbRHRAOCGjnDsYoDiR1j9DE3kQ/d5czIWm6VWYL3pIhK/IrndHfCShuefde/P1jxi5PbtyFvAz15iNrKl2xdQ== donov...@donovan-hides-macbook-2.local

Definitely not worth saving data directory. Reload should be much quicker than copying. Will get it working well!

Tom Lee

unread,

Sep 5, 2011, 6:37:45 PM9/5/11

to superfa...@googlegroups.com

Great! I'm sending you the machine name via email directly to you rather than to the SFM list.

Reply all

Reply to author

Forward