> I think this is where I have to disagree... In my opinion, when
> parsing 42G of data, if you miss a few lines here and there, no big
> loss, as you're really looking for big patterns, not 100.0000%
> 99.9% accuracy would seem fine for log rolling. If this was banking
> data, not so much. :-)
There are plenty of companies where the two aren't that far apart.
If your business is the web, your server logs are (part of) your
> And okay, I don't want to rewrite my parser - I want to spend time
> figuring out my IO/memory bottlenecks, because those are the more
> interesting part of this project :-)
Of course, you might have different bottlenecks if you had to have a
more interesting parser...
But I'm just being contrary.
>> If your business is the web, your server logs are (part of) your
> We're on the web, and our server logs are rolled every 30 days, never
> to be seen again :-) But yes, other businesses may be different
Some fairly large businesses depend on web server logs for billing.
Of course, they tend to pre-process them heavily and stuff the content
of the logs into databases before trying to extract this kind of
>> Of course, you might have different bottlenecks if you had to have a
>> more interesting parser...
> What differences in processing time have you seen, if you've had both
> a space-delimited and "more interesting" parser?
I haven't done a comparison, since I've made tweaks all along when
these kinds of things have come up. And since I haven't had much time
to work on WF2, I haven't taken slots away from more active participants
to do anything interesting on the full dataset.
My first implementation used Erlang binaries and did simple byte-
but that looked ugly. Looking at the state of things again now, I'm
how trivial the parsing can get and still be correct for the purposes
> Some fairly large businesses depend on web server logs for billing.
Speaking as someone who's been in this situation: the effort involved
in 100% accuracy was never worth it for us. (Also, the effort in
*chasing invoices* wasn't always worth the time, but that's another
story.) YMMV, of course; a lot of it depends on the value of each line.
> Looking at the state of things again now, I'm
> not sure
> how trivial the parsing can get and still be correct for the purposes
> of WF2.
I'm pretty convinced that a regex-based approach is probably optimal,
you can handle all these complexities and corner-cases with
straightforward regex tweaks. Yeah, it's ugly, but it cordons of the
ugly part of the problem into a constrained and efficient solution.
Sure. I just meant that comparisons with purely space-based tokenizing
might not make sense.
> On the flip side, a lot of the baggage is pretty easy to parallelize. Of
> course unicode must be decoded sequentially (bummer!) but there is an
> opportunity do do everything in a simple pipeline and keep several cores
> busy. If cores are cheaper than programmers than it is a win. It just
> looks embarrassing on a benchmark.
UTF-8 is re-synchronisable isn't it? I thought that was one of the
design goals. So a file in UTF-8 can be decoded in parallel without
proxy1.telecom.co.nz - - [18/Mar/2003:13:18:04 -0800]
"GET /ongoing/When/200x/2003/03/16/XML-Prog HTTP/1.0" 200 11672
"Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)"
(I added the three \n's)