I'm going to write up Wide Finder, need code

35 views
Skip to first unread message

Tim Bray

unread,
Oct 14, 2008, 6:24:54 PM10/14/08
to wide-...@googlegroups.com
This has two steps. I'm presenting the work at an internal Sun
conference next week, just a poster session. Then, it would be nice
to turn this into a real paper and present it somewhere. I'll start
the writing and anybody who's contributed code and is willing to
review can be a co-author.

So, right now, I'd like to read the code. Currently, I can't find:

- Maurico Fernandez, various variations of wf2_multicore.ml
- Eric Wong.. um, I seem to have the git incantation wrong. I'm a git
n00b. How do I get the code?
- Ray Waldin - there are highlights & snippets at
http://blog.waldin.net/2008/06/diminishing-returns.html, can I see it
all?
- Alastair Rankine - uh,
http://girtby.net/archives/2008/7/3/wide-finder-2-the-widening ?
- Erik Engbrecht - no links?
- Alex - http://grep.ro/blog/2008/06/wide_finder_looking_for_bottlenecks
no code?
- Preston Bannister -
http://bannister.us/weblog/2008/06/12/wider-finder-final-result/ no
code?

I assume that the code is all scattered around /export/home on the
wf2.network.com machine, but I'd be happier if I knew I was looking at
the right stuff.

And if anyone thinks there are some obvious conclusions that have to
go in the initial write-up, please say so.

And... thanks for contributing. I think this is important work. -Tim

Ray Waldin

unread,
Oct 14, 2008, 7:54:20 PM10/14/08
to wide-...@googlegroups.com
Yes, of course. The the executable jar file referenced in my blog entries contains both class files and .scala source files. For convenience, I've created a source-only tarball which you can get here:  

http://waldin.net/downloads/WideFinder2c.tgz 

Also, the same files are in ~rwaldin/projects/wf2/src on wf2.network.com, in case that's easier for you to get at.

I'd be happy to review or contribute to the write-up in anyway I can, as time permits.

-Ray

Ray Waldin

unread,
Oct 15, 2008, 4:15:55 PM10/15/08
to wide-...@googlegroups.com, tim...@gmail.com
BTW, I found your email in my gmail spam bin.  I think the recent series of spam on the widefinder mailing list might be causing similar problems for other folks.  Maybe that explains the lack of response so far...

-Ray

Alastair Rankine

unread,
Oct 15, 2008, 7:02:26 PM10/15/08
to wide-...@googlegroups.com
Tim,

To answer your question, my code is in /export/home/alastair/wf2/wf2.cpp

I too was planning to give a talk on my experience with WF2 but canned
it, mainly due to positive result bias on my part.

In summary though, my experience with WF2 was one of mostly incorrectly
guessed bottlenecks.

My initial implementations of WF2 were developed using the smaller 1m
and 10m data sets. I spent some time profiling these using Apple's
(excellent) Shark tool, and was hoping that this would result in
reasonable times for the full data set. I was wrong.

My problem was that I had assumed that use of memory-mapped I/O would be
close to optimal for large data files, as it is for small data files. I
still think this was a reasonable assumption; and indeed it may be true
for different kernels.

In an attempt to improve I/O performance, I changed my code to use mmap
on sections of the input file as required, instead of the whole file at
once. This worked, and I believe is responsible for the timed run that I
posted on the wiki, but isn't as competitive as other implementations,
most notably Christoph Bartoschek's:
http://www.pontohonk.de/wide-finder2/wide-finder2.html.

Christoph had experimented with different I/O models and seemed to have
identified the POSIX API as the best-performing for the target machine
and dataset. After reading his writeup I started to adapt my
implementation to use the POSIX API (and the current version in the
source file above uses this), but realised after a while that there was
very little point proceeding, because a) Christoph had already come up
with a near-optimal solution, and also b) I was a bit burned out on the
project.

If I was doing it again I think I would approach it differently.
Traditional wisdom dictates that you get it working and *then* optimise
for performance based on analysis of bottlenecks, etc. However for this
problem there is a level of performance below which you can consider the
code to be simply not working. The naive ruby implementation is an
example of this: a 24hr run time is not really usable and hence might as
well be non-functional.

So I think the way to approach this problem is to establish a baseline
level of performance based on I/O, exploring how much parallelism can be
used, and then build onto that incrementally until the desired
functionality is in place. At each iteration, re-test to ensure that you
haven't suffered significant performance loss. I think this would have
saved significant development time; changing the I/O model was a very
time-consuming process and in retrospect I should have established the
best model up-front before attempting the task at hand. If I were
attempting it again I might look at something like Boost Asio, which is
an asynchronous I/O library.

Then there is the goal of the project as a whole, namely exploring
different programming models which allow developers to take advantage of
parallel hardware with minimal effort. I don't think I have anything to
add here, given how much time and effort I expended battling I/O models.

Despite not having any useful results to show, overall I found the
experience of writing a WF2 implementation to be incredibly valuable,
and it taught me a fair bit about coding for high performance and
scalability.

Wide Finder 3, anyone?

James Aylett

unread,
Oct 15, 2008, 7:31:34 PM10/15/08
to wide-...@googlegroups.com
On Wed, Oct 15, 2008 at 9:15 PM, Ray Waldin <r...@waldin.net> wrote:

> BTW, I found your email in my gmail spam bin. I think the recent series of
> spam on the widefinder mailing list might be causing similar problems for
> other folks. Maybe that explains the lack of response so far...

This is certainly what happened to me - for some reason Alastair's got
through however, which prompted me to go looking...

There isn't much I'd highlight, except that I got stuck with Python on
a combination of incomplete support for OS-level things like mmap()
and the complexity of identifying where my actual bottlenecks were (as
I didn't seem to be maxing out CPU at any time, and I wasn't hitting
the raw performance of the disks either). A case of the wrong tools,
really; but I had no idea at the beginning that they'd turn out to be
wrong. My take away: even beyond expected performance differences, all
platforms are not equal by any measure, and they may differ in ways
you don't notice until it's too late...

I believe Alex's python implementation was considerably simpler than
mine, but ran quite a bit faster, suggesting that most of the things I
was doing beyond the basic parallelisation were actually causing
things to run slower; but I wasn't able to figure out exactly what
before I ran out of time.

WF2 was a good dose of humility-inspiring failure for me :-)

James

Eric Wong

unread,
Oct 16, 2008, 4:49:52 AM10/16/08
to wide-...@googlegroups.com
On 10/14/08, Tim Bray <tim...@gmail.com> wrote:
>
> This has two steps. I'm presenting the work at an internal Sun
> conference next week, just a poster session. Then, it would be nice
> to turn this into a real paper and present it somewhere. I'll start
> the writing and anybody who's contributed code and is willing to
> review can be a co-author.
>
> So, right now, I'd like to read the code. Currently, I can't find:
>
> - Maurico Fernandez, various variations of wf2_multicore.ml
> - Eric Wong.. um, I seem to have the git incantation wrong. I'm a git
> n00b. How do I get the code?

git clone http://bogomips.org/wf2/wf2.git

To get to any arbitrary revision (especially those I mentioned in posts here)
in my repo:

git reset --hard <sha1>

I think my favorite[1] was with:
e80624c7513c15a4d6154fd4c0c18927cd5b1b5c

In case you're still having trouble, I've also packaged a tarball here:
http://bogomips.org/wf2/wf2-e80624c7513c15a4d6154fd4c0c18927cd5b1b5c.tar.gz

> I assume that the code is all scattered around /export/home on the
> wf2.network.com machine, but I'd be happier if I knew I was looking at
> the right stuff.

The stuff in my home directory was using awka; which I wasn't overly
happy with (it was faster, just not enough to really matter).

> And if anyone thinks there are some obvious conclusions that have to
> go in the initial write-up, please say so.

To me, the biggest challenge was dealing with a single huge log file on
one machine instead of split out files. Otherwise, I certainly didn't
need fancy tools to get good results.

The online/pipelined reduce logic using fifos and tail(1) is probably
the least straightforward part of it all; let me know if you need help
understanding it.

> And... thanks for contributing. I think this is important work. -Tim

You're welcome, and thank you for pulling this all together.

At $DAYJOB, I've had to semi-regularly hack together one-off Makefile +
awk + ssh/curl/netcat solutions to do log processing. We're also moving
towards MogileFS to more consistently handle storage of all the huge
files we get (data storage is/was the biggest challenge IMHO).

<bragging>
With MogileFS, I did a quick one-off last week and managed to rip
through 38G of gzipped Rails logs in ~7.5 minutes (on a first try :)

Granted I had access to some decently fancy x86-64 hardware (2 machines
with 16 cores each, GigE network), but I was still just using curl |
gawk (was too lazy to install mawk, actually) and using shell and make
as glue.

I'll try to put up better writeups somewhere of the stuff I'm doing with
MogileFS in the near future.
</bragging>

I've also been starting to implement the glue logic in Ruby (outsourcing
heavy-duty stuff to awk) to avoid needing file stamps that Make relies
on. The online/pipeline-reduce logic I implemented in sh + Make would
be much easier given Ruby's excellent threads API[1]

--
Eric Wong

[1] As an API, Ruby threads is really the best one I've seen. And for
glue code, the green-threaded MRI implementation doesn't bother me at
all, either, since I fork() within the threads themselves to spawn
ssh/curl/awk :)

Eric Wong

unread,
Oct 16, 2008, 4:56:47 AM10/16/08
to wide-finder
On Oct 16, 1:49 am, "Eric Wong" <normalper...@gmail.com> wrote:
> I think my favorite[1] was with:
>   e80624c7513c15a4d6154fd4c0c18927cd5b1b5c

I managed to corrupt my pointers, I meant to reference the message
I got about awka there:

http://groups.google.com/group/wide-finder/browse_thread/thread/c8e5c7f331ffe087/6cfb572e9d783507?lnk=gst&q=wong+awka#6cfb572e9d783507

Erik Engbrecht

unread,
Oct 16, 2008, 9:12:47 AM10/16/08
to wide-...@googlegroups.com
Tim,
  I'll get my code posted this weekend.
 
  Incidentally, this thread got marked as spam by GMail.  Wonder why...
 
-Erik

Mauricio Fernandez

unread,
Oct 17, 2008, 7:29:50 AM10/17/08
to wide-...@googlegroups.com
On Tue, Oct 14, 2008 at 03:24:54PM -0700, Tim Bray wrote:
>
> This has two steps. I'm presenting the work at an internal Sun
> conference next week, just a poster session. Then, it would be nice
> to turn this into a real paper and present it somewhere. I'll start
> the writing and anybody who's contributed code and is willing to
> review can be a co-author.
>
> So, right now, I'd like to read the code. Currently, I can't find:
>
> - Maurico Fernandez, various variations of wf2_multicore.ml

These are the versions I made, in chronological order (please find them
attached), along with some comments:

* wf2_simple.ml: naïve port of the reference implementation,
single-core and with line-oriented input. It is an order of magnitude faster
than the original Ruby version.

* wf2_multicore.ml: uses process-based parallelism. This was the first program
I executed on the server, and ran in ~10 minutes on its very first run, 8
minutes after a few tweaks on the third one.

* wf2_multicore2.ml: better parallelism obtained by processing the results
from the worker processes as they arrive instead of at once when they're all
done. Ran in 7 minutes.

* wf2_multicore2_block.ml: uses block- instead of line-oriented IO, unlike all
the previous entries. Processes the 42GB logs in 5 minutes, around 300 times
faster than the reference implementation.

* wf2_multicore2_block_small.ml + framework.ml:
the reusable functions from wf2_multicore2_block.ml are extracted into
a generic Framework module, leaving a minimal 54-line driver with the
code specific to Widefinder2; otherwise identical to
wf2_multicore2_block.ml.

The multicore versions use one of the two generic functions for parallel
processing in parallel.ml (wf2_multicore.ml uses invoke, the remaining
invoke').

Some conclusions:

* There's as much to gain from parallelism as from more efficient
single-core execution at this stage. The fastest programs are roughly 300X
faster than the reference implementation in Ruby, and need 15 to 30 times
less CPU time, with a gain attributable to parallelism ranging from 8X to
17X.

In fact, at least in the OCaml case, most of the engineering effort goes
into optimizations that apply to both the single- and the multi-core
cases. In the OCaml programs, parallelism imposes a minimal overhead; this
is shown clearly in wf2_multicore2_block_small.ml, where the "parallel tax"
is around 5 lines of code. There is no shared data, no locks, and none of
the complexity associated to threads.

* The processing speed is ultimately determined by the IO subsystem. Only C++
and OCaml were able to reach that bottleneck. Note the relatively large
difference in user CPU time between the top three entries, which all run in
~5 minutes clock time nevertheless. The entry taking the least CPU time (15%
less than the closest one) is not the fastest (it's 7% slower than said
entry). The OCaml entry is only 10% slower, even though it takes over twice
as much CPU time! When I optimized a single function (the one that splits a
line) in wf2_multicore2_block.ml, putting the CPU time below 1 hour (i.e.,
comparable to the fastest C++ implementations), I observed no noticeable
difference in the final execution speed. This shows that the bottleneck is
indeed IO.

To sum up, the final speedup can be broken down into these factors:
* 1 order of magnitude: use of a more efficient language implementation.
Does not necessarily cause an explosion in code size, as shown by the
wf2_simple.ml OCaml version which is in fact shorter than the reference
Ruby implementation by line count (this is far from being the case in
C++, though).
* 1 order of magnitude: parallelism. Not too onerous to achieve either, in
this case.
* a smaller factor around 2-4X: optimizations that apply regardless of
parallelism, the IO mechanism employed, etc.

--
Mauricio Fernandez - http://eigenclass.org

framework.ml
parallel.ml
wf2_multicore2_block.ml
wf2_multicore2_block_small.ml
wf2_multicore2.ml
wf2_multicore.ml
wf2_simple.ml

Erik Engbrecht

unread,
Oct 19, 2008, 8:59:17 PM10/19/08
to wide-...@googlegroups.com
/export/home/erik/WideFinder2.zip

It contains a NetBeans project, so if you use NetBeans and have the Scala plugin installed, it should be fairly easy to make work.  Otherwise it should be buildable with ant and Scala.
Reply all
Reply to author
Forward
0 new messages