Hi Noah !
First of all, thanks for reporting.
This bug was very subtle and hard to debug, but we get ride of it !
Let me try to expalin what happen:
The extra CRLF appears only on Linux machines. This was due to the
call of "ftruncate" (line 703 of lib/private/wfile.c)
which in theory has to truncate the double CRLF before returning the
record's payload.
But unfortunatly, for small files (we don't know why this behaviour
don't affect medium and large record) the truncation
don't take place immediatly and is differed. So, you was right, on
Linux, you got this extra CRLF:
$ ./app/wrecordbody -o 688 -f wrecordbody-bug.warc.gz 2>/dev/null | od
-c | tail -3
"
http://www.archive-it.org/robots.txt" 20090608192954 application/
http; msgtype=response 1270
0000720 \n D i s a l l o w : / p u b l
0000740 i c / a d v a n c e d ? \n \r \n \r
0000760 \n
But on OSX, you got that (wich is correct):
$ ./app/wrecordbody -o 688 -f wrecordbody-bug.warc.gz 2>/dev/null |
od -c | tail -3
0000720 \n D i s a l l o w : / p u b l
0000740 i c / a d v a n c e d ? \n
0000755
To fix the problem, we FORCED a "disk flush" after the call to
"ftruncate". Problem solved !
To summurize, this bug only affect Linux. On the other OSes (ex.
OSX ...) the previous code works well.
Anyway, it's fixed now. Thanks for reporting again.
N.B: "warc-tools" subversion repo is up to date ... checkout the last
version please
Regards
Younès
Le 8 juin 09 à 21:46, unknown a écrit :