wrecordbody prints trailing CRLF CRLF

17 views
Skip to first unread message

siznax

unread,
Jun 5, 2009, 8:00:03 PM6/5/09
to warc-tools
Hello,

wrecordbody seems to print the two blank lines ("\r\n\r\n") following
the warc record body. It shouldn't do that, should it?

Noah

WARC

unread,
Jun 7, 2009, 3:11:33 AM6/7/09
to warc-...@googlegroups.com
Hi Noah,


> wrecordbody seems to print the two blank lines ("\r\n\r\n") following
> the warc record body. It shouldn't do that, should it?

Yes, it should. The specs say that :
WARC Record = Header + "\r\n\r\n" + Body + "\r\n\r\n".

So, 2 consecutives records are separated with a double CRLF.

> Noah

Regards
Younès

unknown

unread,
Jun 8, 2009, 3:46:18 PM6/8/09
to warc-tools, nle...@archive.org
After Steve kindly forwarded my original message for me, I discovered
that the behavior is inconsistent. Sometimes it prints "\r\n\r\n" and
sometimes it doesn't. Here is an example:

http://home.us.archive.org/~nlevitt/tmp/wrecordbody-bug.warc.gz
For me, this prints the extra "\r\n\r\n":
wrecordbody -o 688 -f wrecordbody-bug.warc.gz
but this one doesn't:
wrecordbody -o 1942 -f wrecordbody-bug.warc.gz

Also, it appears to never print the extra "\r\n\r\n" from an
uncompressed warc.

I think it should not print the "\r\n\r\n" because wrecordbody should
extract the body content block, but "\r\n\r\n" is a little piece of
the warc format. It's like printing the </closing-tag> when extracting
the body of an xml tag or something.

Noah
nle...@archive.org

WARC

unread,
Jun 16, 2009, 11:08:29 AM6/16/09
to warc-...@googlegroups.com, nle...@archive.org
Hi Noah !

First of all, thanks for reporting.

This bug was very subtle and hard to debug, but we get ride of it !

Let me try to expalin what happen:
The extra CRLF appears only on Linux machines. This was due to the
call of "ftruncate" (line 703 of lib/private/wfile.c)
which in theory has to truncate the double CRLF before returning the
record's payload.

But unfortunatly, for small files (we don't know why this behaviour
don't affect medium and large record) the truncation
don't take place immediatly and is differed. So, you was right, on
Linux, you got this extra CRLF:

$ ./app/wrecordbody -o 688 -f wrecordbody-bug.warc.gz 2>/dev/null | od
-c | tail -3
"http://www.archive-it.org/robots.txt" 20090608192954 application/
http; msgtype=response 1270
0000720 \n D i s a l l o w : / p u b l
0000740 i c / a d v a n c e d ? \n \r \n \r
0000760 \n

But on OSX, you got that (wich is correct):
$ ./app/wrecordbody -o 688 -f wrecordbody-bug.warc.gz 2>/dev/null |
od -c | tail -3
0000720 \n D i s a l l o w : / p u b l
0000740 i c / a d v a n c e d ? \n
0000755

To fix the problem, we FORCED a "disk flush" after the call to
"ftruncate". Problem solved !

To summurize, this bug only affect Linux. On the other OSes (ex.
OSX ...) the previous code works well.
Anyway, it's fixed now. Thanks for reporting again.

N.B: "warc-tools" subversion repo is up to date ... checkout the last
version please

Regards
Younès

Le 8 juin 09 à 21:46, unknown a écrit :
Reply all
Reply to author
Forward
0 new messages