done testing

Justin Sheehy

unread,

Jun 12, 2008, 4:05:25 PM6/12/08

to wide-...@googlegroups.com

Discovered a couple of lines in the log that caused problems - no
protocol after the URI, ridiculously long and invalid method, etc.

I'm off for now.

-Justin

Tim Bray

unread,

Jun 12, 2008, 4:38:59 PM6/12/08

to wide-...@googlegroups.com

On Thu, Jun 12, 2008 at 1:05 PM, Justin Sheehy <jus...@iago.org> wrote:
>
> Discovered a couple of lines in the log that caused problems - no
> protocol after the URI, ridiculously long and invalid method, etc.

Should we agree, for benchmarking purposes, that approximate results
are OK for Wide Finder 2? I.e. go ahead and split on spaces? -Tim

Justin Sheehy

unread,

Jun 12, 2008, 4:45:25 PM6/12/08

to wide-...@googlegroups.com

On Jun 12, 2008, at 4:38 PM, Tim Bray wrote:

> Should we agree, for benchmarking purposes, that approximate results
> are OK for Wide Finder 2? I.e. go ahead and split on spaces?

I suspect that the difference between approximate results and correct
results is likely to affect either speed or beauty of a given
implementation.

I wouldn't find an approximate solution objectionable, but would be much
more impressed by one that is beautiful, uses the performance aspects of
the 32-core system well, and is also correct.

-Justin

Erik Engbrecht

unread,

Jun 12, 2008, 4:51:29 PM6/12/08

to wide-...@googlegroups.com

-1

Real world code often has to deal with crap into, and the input we have here isn't really even crap. It just takes a regex to split apart instead of doing a quick-and-dirty split on whitespace, so requiring correct line processing is doing is knocking out an easy optimization that has limited applicability.

Brian Frank

unread,

Jun 12, 2008, 6:16:44 PM6/12/08

to wide-finder

> Should we agree, for benchmarking purposes, that approximate results
> are OK for Wide Finder 2? I.e. go ahead and split on spaces? -Tim

I vote for just splitting on spaces. To me this challenge about
how to tackle concurrency; error handling for a couple records
out of 218 million kind seems like a distraction. Then again
I just don't want to have to rework/rerun my program either :)

Ray Waldin

unread,

Jun 12, 2008, 6:50:10 PM6/12/08

to wide-...@googlegroups.com

Is there a correct way to parse the log files? I don't think so. Choose one regex, and you'll miss some records that another regex finds. If we require that submissions use a regular expression, then which regular expression? I know mine misses a couple of log entries that are not in the top ten.

I'm okay with leaving the reference implementation as is. Differences between each submission's results and the reference's results should be explained within reason. It should be obvious from the explanation whether the difference was caused by a fundamentally unsustainable design, where "correcting" the code would result in a major degradation in performance.

-Ray

Erik Engbrecht

unread,

Jun 12, 2008, 6:57:36 PM6/12/08

to wide-...@googlegroups.com

Frank,

It's not really error handling. There's no predefined grammar to which the input file is expected to conform. It just kind of is. The offending lines aren't invalid, just different.

Anyway, I think it should be handled because, as you say, this is about concurrency. Concurrency is real-world, simple, common applications. Taking on a problem that is too simple allows for easy optimizations that don't have much to do with concurrency but have a very noticeable impact on performance.

Of course I'm not sure how many people are actually optimizing that piece. It doesn't look like you are. I haven't, and I'll admit the idea debugging a regex to do it doesn't sound like fun. I think Ray put together a proper regex to do the splitting, so he's not cheating.

I don't know. Maybe I'm being too anal, because I intentionally coded mine so it wouldn't be fault tolerant (I think I'm a masochist) so I've spent a lot of time debugging mis-processed lines. I still vote for doing it right, but I won't lose any sleep if people decide we should go the other way.

Anyway, I looked at your code and have a couple comments:

1. You can use Runtime.availableProcessors to automatically size your workers according to the system you are running on.

2. I noticed some problems in my code on splitting on single space that are fixed by splitting on "\\s+"

Brian Frank

unread,

Jun 12, 2008, 6:58:06 PM6/12/08

to wide-finder

Maybe we should generate a new test file with the offending lines
removed?

Erik Engbrecht

unread,

Jun 12, 2008, 7:07:55 PM6/12/08

to wide-...@googlegroups.com

That could be a second benchmark. Write a program that strips out offending lines and writes out the result.

Having multiple benchmarks would encourage making pieces of the code reusable.

It could add some unicode characters while it's at it... ;-)

Of course everyone replicating 42GB files would probably use too much storage...

...or I could just put away my whip and say we can split on spaces...

On 6/12/08, Brian Frank <brian...@gmail.com> wrote:

Mauricio Fernandez

unread,

Jun 12, 2008, 7:42:08 PM6/12/08

to wide-...@googlegroups.com

On Thu, Jun 12, 2008 at 07:07:55PM -0400, Erik Engbrecht wrote:
> Of course everyone replicating 42GB files would probably use too much
> storage...
>
> ...or I could just put away my whip and say we can split on spaces...

FWIW, I did a quick run against O.10m (which involves no actual I/O) with a
regexp-based version and the one that splits on spaces; the difference is
below 5%, with both running in around 20s (my best time is ~18.5s with some
additional micro-optimization), so it's not really very important, as far as
optimizations go. Switching from line-oriented to block I/O makes a larger
difference, but I don't need that to be I/O bound anyway for O.all (as in:
iostat showing 100% use on c0t2d0 and c0t3d0).

IMO there's nothing wrong with the reference implementation, and a couple
incorrectly handled lines out of over 200 million aren't worth the bother of
running the Ruby script for >24H and having everybody run their programs
again.

--
Mauricio Fernandez - http://eigenclass.org

Preston L. Bannister

unread,

Jun 12, 2008, 11:19:57 PM6/12/08

to wide-finder

I am not much interested in approximate solutions, for my part.

If for this exercise someone else wants to do something different,
that's fine. The code is public, so you can interpret the results as
you wish.

Eric Wong

unread,

Jun 13, 2008, 4:13:32 AM6/13/08

to wide-finder

On Jun 12, 1:38 pm, "Tim Bray" <timb...@gmail.com> wrote:

> On Thu, Jun 12, 2008 at 1:05 PM, Justin Sheehy <jus...@iago.org> wrote:
>
> > Discovered a couple of lines in the log that caused problems - no
> > protocol after the URI, ridiculously long and invalid method, etc.
>
> Should we agree, for benchmarking purposes, that approximate results
> are OK for Wide Finder 2? I.e. go ahead and split on spaces? -Tim

I strongly concur. Outliers and corner cases for statistics are
exactly
that and spending too much time on that stuff is a harmful waste of
programmer time[1] and should not be encouraged.

This is especially true of web log processing due to uncontrollable
variables like bots, unread reader subscriptions, network/computer
(un)reliability, various forms of caching/readahead, etc. Thus data
gathered from web logs will never be a 100% accurate representation of
what users of your site are actually doing, but a good enough
approximation.

[1] - which is arguably the most important resource of all

James Aylett

unread,

Jun 13, 2008, 5:25:44 AM6/13/08

to wide-...@googlegroups.com

On Thu, Jun 12, 2008 at 06:57:36PM -0400, Erik Engbrecht wrote:

> Anyway, I think it should be handled because, as you say, this is about
> concurrency. Concurrency is real-world, simple, common applications.
> Taking on a problem that is too simple allows for easy optimizations that
> don't have much to do with concurrency but have a very noticeable impact on
> performance.

I think the same argument applies in the other direction: unless
someone can think of a clever way of parallelising the parsing of an
individual line, correct versus loose parsing is just an optimisation.

(That's only true if the cost of correct processing is so high as to
require task decomposition as well as data decomposition, of
course. That probably doesn't apply here, and in any case I think some
people are already decomposing reading and parsing/processing.)

James

--
/--------------------------------------------------------------------------\
James Aylett xapian.org
ja...@tartarus.org uncertaintydivision.org

Reply all

Reply to author

Forward