splitting on spaces doesn't work

1 view
Skip to first unread message

Erik Engbrecht

unread,
Jun 10, 2008, 11:43:28 PM6/10/08
to wide-...@googlegroups.com
I was fortunate enough to have my process error out on this line (notice that it's near the end of the dataset):

customer-reverse-entry.208.96.54.93 - - [12/Oct/2007:18:41:47 -0700] "GET /ongoing/What/The World/Products HTTP/1.0" 404 371 "-" "disco/Nutch-1.0-dev (experimental crawler; www.discoveryengine.com; disco...@discoveryengine.com)"

What you'll notice is that there is a space in the URL.  If you split on spaces HTTP/1.0 ends up in the status field instead of 404 being in the status field.  Ruby and I imagine many other languages happily parse HTTP/1.0 as 0.  Of course java.lang.Integer.parseInt() gives you a NumberFormatException, thus proving that just because your program successfully processed 40+ GB of data doesn't you program is correct.

So the "reference implementation" has a bug, even though this minor detail probably won't affect the actual output.

--
http://erikengbrecht.blogspot.com/

Tim Bray

unread,
Jun 11, 2008, 12:37:37 AM6/11/08
to wide-...@googlegroups.com
Yeah, Ray found a couple of others, also having to do with odd spaces
appearing here and there. I'll correct and re-run, but as you point
out, it probably doesn't affect the timing much. -T

Erik Engbrecht

unread,
Jun 11, 2008, 1:10:46 AM6/11/08
to wide-...@googlegroups.com
It could affect the timing of other implementations that may be using hand-optimized parsers instead of regular expressions...

But more importantly, at least in my opinion, it affects the "beautiful code" aspect.  Beautiful code is only really beautiful if is correct.  A function that splits a string based on whitespace is pretty short and possibly worth it for a speed boost, but once you add additional rules in there about quotes and such it quickly becomes ugly and the regex becomes more attractive.

Erich

unread,
Jun 11, 2008, 9:24:34 AM6/11/08
to wide-finder

I think this is where I have to disagree... In my opinion, when
parsing 42G of data, if you miss a few lines here and there, no big
loss, as you're really looking for big patterns, not 100.0000%
accuracy.

99.9% accuracy would seem fine for log rolling. If this was banking
data, not so much. :-)

And okay, I don't want to rewrite my parser - I want to spend time
figuring out my IO/memory bottlenecks, because those are the more
interesting part of this project :-)

--Erich

Justin Sheehy

unread,
Jun 11, 2008, 9:28:59 AM6/11/08
to wide-...@googlegroups.com

On Jun 11, 2008, at 9:24 AM, Erich wrote:

> I think this is where I have to disagree... In my opinion, when
> parsing 42G of data, if you miss a few lines here and there, no big
> loss, as you're really looking for big patterns, not 100.0000%
> accuracy.
>
> 99.9% accuracy would seem fine for log rolling. If this was banking
> data, not so much. :-)

There are plenty of companies where the two aren't that far apart.

If your business is the web, your server logs are (part of) your
accounting.

> And okay, I don't want to rewrite my parser - I want to spend time
> figuring out my IO/memory bottlenecks, because those are the more
> interesting part of this project :-)

Of course, you might have different bottlenecks if you had to have a
more interesting parser...

But I'm just being contrary.

-Justin

Erich

unread,
Jun 11, 2008, 9:36:42 AM6/11/08
to wide-finder
> If your business is the web, your server logs are (part of) your
> accounting.

We're on the web, and our server logs are rolled every 30 days, never
to be seen again :-) But yes, other businesses may be different

> Of course, you might have different bottlenecks if you had to have a
> more interesting parser...

What differences in processing time have you seen, if you've had both
a space-delimited and "more interesting" parser?

--Erich

Justin Sheehy

unread,
Jun 11, 2008, 9:45:00 AM6/11/08
to wide-...@googlegroups.com
On Jun 11, 2008, at 9:36 AM, Erich wrote:

>> If your business is the web, your server logs are (part of) your
>> accounting.
>
> We're on the web, and our server logs are rolled every 30 days, never
> to be seen again :-) But yes, other businesses may be different

Some fairly large businesses depend on web server logs for billing.

Of course, they tend to pre-process them heavily and stuff the content
of the logs into databases before trying to extract this kind of
information.
*shrug*

>> Of course, you might have different bottlenecks if you had to have a
>> more interesting parser...
>
> What differences in processing time have you seen, if you've had both
> a space-delimited and "more interesting" parser?

I haven't done a comparison, since I've made tweaks all along when
these kinds of things have come up. And since I haven't had much time
to work on WF2, I haven't taken slots away from more active participants
to do anything interesting on the full dataset.

My first implementation used Erlang binaries and did simple byte-
splitting,
but that looked ugly. Looking at the state of things again now, I'm
not sure
how trivial the parsing can get and still be correct for the purposes
of WF2.

-Justin

Erik Engbrecht

unread,
Jun 11, 2008, 10:15:34 AM6/11/08
to wide-...@googlegroups.com
Yes, processing logs often does not need to be an exact science, but in my opinion what we're trying to accomplish here should be applicable to more exacting domains.  This is really a toy problem intended to be representative of other problems.
 
All too often moving from the 80% solution to the 90% solution, or maybe in this case from the 99% solution to the 100% solution, often involves an exponential increase in complexity.
 
In this particular case it seems like one of the most effective ways to improve performance is to strip away layers of abstraction.  That's not surprising, but ultimately we don't want to have to work at a lower level of abstraction (or with significantly more complicated abstractions) in order to achieve good performance with lots of cores.
 
So, in my opinion a solution that almost works in this case is a little deceptive because the full solution might be slower (e.g. use regex instead of hand-rolled splitter) or contain a lot more LOCs.  Really smart people can always code something that is going to be really fast.  I'd like to see the starts of methods for making it so above average programmers can do it, and do it reliably.
 
But anyway, that's because of what I want to get out of it.  Others may differ...

 

James Aylett

unread,
Jun 11, 2008, 10:49:31 AM6/11/08
to wide-...@googlegroups.com
On Wed, Jun 11, 2008 at 09:45:00AM -0400, Justin Sheehy wrote:

> Some fairly large businesses depend on web server logs for billing.

Speaking as someone who's been in this situation: the effort involved
in 100% accuracy was never worth it for us. (Also, the effort in
*chasing invoices* wasn't always worth the time, but that's another
story.) YMMV, of course; a lot of it depends on the value of each line.

James

--
/--------------------------------------------------------------------------\
James Aylett xapian.org
ja...@tartarus.org uncertaintydivision.org

Erich

unread,
Jun 11, 2008, 11:00:38 AM6/11/08
to wide-finder

I added in some logic to handles spaces in URLs. Timing for O.10m is
the same as without it.

I'll try another full run at some point to verify, but I'll still be
around 15 minutes. I'm IO bound, getting only 50M/sec approx :-
( Granted, I'm only using a single thread for IO - not sure what the
7min guy(s) are using :-)

--Erich

Tim Bray

unread,
Jun 11, 2008, 11:42:30 AM6/11/08
to wide-...@googlegroups.com
On Wed, Jun 11, 2008 at 6:45 AM, Justin Sheehy <jus...@iago.org> wrote:

> Looking at the state of things again now, I'm
> not sure
> how trivial the parsing can get and still be correct for the purposes
> of WF2.

I'm pretty convinced that a regex-based approach is probably optimal,
you can handle all these complexities and corner-cases with
straightforward regex tweaks. Yeah, it's ugly, but it cordons of the
ugly part of the problem into a constrained and efficient solution.
-Tim

>
> -Justin
>
>
>
>
> >
>

Justin Sheehy

unread,
Jun 11, 2008, 12:38:11 PM6/11/08
to wide-...@googlegroups.com

Sure. I just meant that comparisons with purely space-based tokenizing
might not make sense.

-Justin


Erich

unread,
Jun 11, 2008, 12:46:41 PM6/11/08
to wide-finder

If you've got a regex expression for me to use, I'll throw it in and
see what happens :-)

Erich

unread,
Jun 11, 2008, 3:22:01 PM6/11/08
to wide-finder

I "borrowed" Ray's regex and plugged it into my java code... Bleah!
It's about twice as slow, whether I'm regex matching huge chunks all
at once, or a line at a time. Is everybody else really using regex
matching for their code?

--Erich

Ray Waldin

unread,
Jun 11, 2008, 3:37:26 PM6/11/08
to wide-...@googlegroups.com
Make sure you're not doing redundant work now, since my regex is designed to already do newline matching, field extraction, hit detection, and GET request filtering.  If your code does this too, that's way too much work.  I wrote up a little posting about my thoughts on this debate and about my regex in general here: http://blog.waldin.net/2008/06/regular-expression-vs-space-delimiting.html

-Ray

Erik Engbrecht

unread,
Jun 11, 2008, 4:01:39 PM6/11/08
to wide-...@googlegroups.com
I haven't done any performance profiling for wf2 yet, but for wf1 I found the trick to optimizing performance basically boiled down to reducing the number of complete scans over the data to some minimum number.
 
The fastest wf1 solutions did a single pass over the raw data coming from the OS io routines.  They avoided regex, used raw ascii, and didn't use normal native strings.
 
Java is a beast, because in idiomatic usage with standard libraries you pass over the data repeatedly.  Load it using a direct byte buffer, copy it to a regular byte buffer, transcode it into a char buffer, split the char buffer into lines, copy the lines into strings, split the strings into fields, do some matching/parsing on the fields.
 
It's amazing the JVM-based implementations are performing as well as they are.
 
In theory I think you could hand write a processor to do it in 1 or 2 total passes that would only be mildly more expensive than the decode step in the idiomatic Java solution.  Alternatively you could use a regex that operates on raw byte arrays.

 
On 6/11/08, Erich <erichb...@gmail.com> wrote:



--
http://erikengbrecht.blogspot.com/

Erich

unread,
Jun 11, 2008, 4:03:41 PM6/11/08
to wide-finder

I pretty much just ported your matching to my code, so I'm not doing
redundant anything. I tweaked a few things, and it actually runs
about the same speed on my MacBook. Waiting to see, but the regex
version has been much slower on the big box...

Hmf.

Ray Waldin

unread,
Jun 11, 2008, 4:09:10 PM6/11/08
to wide-...@googlegroups.com
Also, the code and regex have been modified quite a bit since <a href="http://blog.waldin.net/2008/06/multicore-widefinding.html">this blog posting</a>.  The most recent code and regex are all in <a href="http://blog.waldin.net/2008/06/results-so-far.html">my executable jar</a>.

-Ray

Erich

unread,
Jun 11, 2008, 4:26:37 PM6/11/08
to wide-finder

Yes, I copied the regex and code out of your jar - thanks :-)

On the big box, O.10m took 110sec with regex, while space splitting is
a nice 40sec. *sigh*

And to keep my sanity, I reran them both on my MacBook - both run in
13sec for O.1m... I don't understand what's going on here...

--Erich

Erich

unread,
Jun 11, 2008, 4:30:20 PM6/11/08
to wide-finder

On Jun 11, 3:01 pm, "Erik Engbrecht" <erik.engbre...@gmail.com> wrote:
> Java is a beast, because in idiomatic usage with standard libraries you pass
> over the data repeatedly. Load it using a direct byte buffer, copy it to a
> regular byte buffer, transcode it into a char buffer, split the char buffer
> into lines, copy the lines into strings, split the strings into fields, do
> some matching/parsing on the fields.
>
> It's amazing the JVM-based implementations are performing as well as they
> are.

Ha! :-) Well, my implementation WAS performing well until my 99.9%
accuracy was called into question ;-)

Oh, and Java's beast factor might be interesting, to see if it sucks
less, if we were doing something with Unicode strings instead of just
bytes... Maybe :-)

--Erich

Erik Engbrecht

unread,
Jun 11, 2008, 5:39:02 PM6/11/08
to wide-...@googlegroups.com
Anyone want to volunteer to write a program that goes through the log files and rewrites a bunch of the stuff to include unicode characters?

I think Java could use some libraries that maintain a similar level of abstraction without all the layers.  For example, splitting a stream into lines is a common enough task where I think you could push it into a special decoder and save a pass.  Ditto for doing some rudimentary field splitting.  You could eliminate two passes that way.  But the decoders don't do that.  They take a ByteBuffer and spit out a CharBuffer.

On the flip side, a lot of the baggage is pretty easy to parallelize.  Of course unicode must be decoded sequentially (bummer!) but there is an opportunity do do everything in a simple pipeline and keep several cores busy.  If cores are cheaper than programmers than it is a win.  It just looks embarrassing on a benchmark.

James Aylett

unread,
Jun 11, 2008, 6:12:20 PM6/11/08
to wide-...@googlegroups.com
On Wed, Jun 11, 2008 at 05:39:02PM -0400, Erik Engbrecht wrote:

> On the flip side, a lot of the baggage is pretty easy to parallelize. Of
> course unicode must be decoded sequentially (bummer!) but there is an
> opportunity do do everything in a simple pipeline and keep several cores
> busy. If cores are cheaper than programmers than it is a win. It just
> looks embarrassing on a benchmark.

:)

UTF-8 is re-synchronisable isn't it? I thought that was one of the
design goals. So a file in UTF-8 can be decoded in parallel without
undue effort.

Erik Engbrecht

unread,
Jun 11, 2008, 6:39:01 PM6/11/08
to wide-...@googlegroups.com
Awesome!  You're right.

Looks like I'm going to be writing a new UTF8 decoder...



--
http://erikengbrecht.blogspot.com/

Justin Sheehy

unread,
Jun 11, 2008, 9:11:00 PM6/11/08
to wide-...@googlegroups.com
While thinking about the space-splitting issue, I noticed an
interesting line in the log file. There's no spaces-related issue,
but it might throw off a naive attempt at grabbing referers.

proxy1.telecom.co.nz - - [18/Mar/2003:13:18:04 -0800]
"GET /ongoing/When/200x/2003/03/16/XML-Prog HTTP/1.0" 200 11672
"https://146.171.123.211/BM-Login/?\"http://www.tbray.org/ongoing/When/200x/2003/03/16/XML-Prog
\""
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"

(I added the three \n's)

-Justin


Preston L. Bannister

unread,
Jun 12, 2008, 3:47:24 AM6/12/08
to wide-finder
FYI - the full(?) regex I am using (for Perl) is in:
http://svn.bannister.us/public/wide-finder-1/scripts/report.pl

Note that I am writing for re-use, not for the fasted possible
performance.
My take on the purpose to Tim's example problem. Somewhat inline with
Erik's comments - going general rather than "simple" just for speed.


On Jun 11, 7:15 am, "Erik Engbrecht" <erik.engbre...@gmail.com> wrote:
> Yes, processing logs often does not need to be an exact science, but in my
> opinion what we're trying to accomplish here should be applicable to more
> exacting domains. This is really a toy problem intended to be
> representative of other problems.
>
> All too often moving from the 80% solution to the 90% solution, or maybe in
> this case from the 99% solution to the 100% solution, often involves an
> exponential increase in complexity.
>
> In this particular case it seems like one of the most effective ways to
> improve performance is to strip away layers of abstraction. That's not
> surprising, but ultimately we don't want to have to work at a lower level of
> abstraction (or with significantly more complicated abstractions) in order
> to achieve good performance with lots of cores.
>
> So, in my opinion a solution that almost works in this case is a little
> deceptive because the full solution might be slower (e.g. use regex instead
> of hand-rolled splitter) or contain a lot more LOCs. Really smart people
> can always code something that is going to be really fast. I'd like to see
> the starts of methods for making it so above average programmers can do it,
> and do it reliably.
>
> But anyway, that's because of what I want to get out of it. Others may
> differ...
>
> On 6/11/08, Erich <erichbrat...@gmail.com> wrote:
>
>
>
>
>
> > I think this is where I have to disagree... In my opinion, when
> > parsing 42G of data, if you miss a few lines here and there, no big
> > loss, as you're really looking for big patterns, not 100.0000%
> > accuracy.
>
> > 99.9% accuracy would seem fine for log rolling. If this was banking
> > data, not so much. :-)
>
> > And okay, I don't want to rewrite my parser - I want to spend time
> > figuring out my IO/memory bottlenecks, because those are the more
> > interesting part of this project :-)
>
> > --Erich
>
> --http://erikengbrecht.blogspot.com/
Reply all
Reply to author
Forward
0 new messages