Could you take a look at issue 72 (http://code.google.com/p/warc-tools/issues/detail?id=72)?
I tried to convert a 5.5GB ARC file to WARC, and got the following error:
Command:
../warc-tools-read-only/app/arc2warc -a six.arc.gz -f siz.warc
Output:
> debug: lib/private/wfile.c :1971:"couldn't add record to the warc file, maximum size reached"
Is this a WARC liit or a libwarc limit? What is the maximum size? The warc file seems to have been truncated at about 1.5GB. Your advice appreciated.
Thanks,
Gordon
You're the second one asking for that !!!
Actually, we're using a warc_u32_t (i.e. 32 bits unsigned int =
4,294,967,295 = 4Gb of length) to handle WARC file.
We tought that's a good strategy to let you think and avoid having big
WARCs. Something between
100 Mo and 600 Mo (and even 1Go) is a good choice in my opinion
(minimize the risk of data loss, pretty fast data copying ...).
This is what I.A, Hanzo and others use in general.
But as I see, you have to deal with ARC files bigger than that.
> Command:
> ../warc-tools-read-only/app/arc2warc -a six.arc.gz -f siz.warc
>
> Output:
>> debug: lib/private/wfile.c :1971:"couldn't add record to the warc
>> file, maximum size reached"
>
That's a normal behaviour with the actual settings Gordon.
In the file "app/arc2warc.c", you can find a 32 bits integer constant
called :
#define WARC_MAX_SIZE 1629145600
This is your limit actually. You can increase this value to 4Gb at max
and try again:
#define WARC_MAX_SIZE 4294967296
>
> Is this a WARC liit or a libwarc limit? What is the maximum size?
> The warc file seems to have been truncated at about 1.5GB. Your
> advice appreciated.
>
With a limit of "1629145600", your WARC file seems to be truncated to
1.5GB (=1629145600). But in reality, the WARC tools library is just
trying to tell you that this WARC file reached it's size limit. So,
open a new one and continue your conversion if you want from where you
left it.
At the end, you can concatenate them all and obtain your final ARC
file with:
cat siz*.warc > siz.warc
Anyway, a change we'll be made to fix the 32 bits limit asap. We'll
use instead a
warc_u64_t (i.e. 64 bits unsigned integer = 18,446,744,073,709,551,615
= 18.45 Exa bytes).
Hope this help you understand the "warc-tools" behaviour regarding
your problem.
N.B: in your example, you didn't compress the resulting WARC. In this
case, compression may help to save space Gordon !
Regards
Younès
As you may noticed, a need for big (more than 4Gb) WARC files support
was
suggested.
The "warc-tools" was updated to handle that today.
All the "C" commands were set with default 16Go size. You can now
increase
this value to whatever you want by changing:
#ifndef WARC_MAX_SIZE
/* 16 Go by default */
#define WARC_MAX_SIZE 17179869184ULL
#endif
Please, test ans report any bug or strange behviours if any !!!
NOTE: for C developers only
=======================
When creating a WARC file object, you need to be careful and cast the
maximum file size to
(warc_u64_t) or use a constant value with suffix "ULL" .
For example, te following constructors calls are similar:
#define WARC_MAX_SIZE 200ULL
w = bless (WFile, fname, WARC_MAX_SIZE, WARC_FILE_WRITER, cmode,
wdir);
w = bless (WFile, fname, WARC_MAX_SIZE, 200ULL, cmode, wdir);
w = bless (WFile, fname, WARC_MAX_SIZE, (warc_u64_t) 200, cmode,
wdir);
NOTE: for Python/Ruby developers
===========================
Nothing change for you as these languages are not affected with this
change.
Regards
Younès
As you may noticed, a need for big (more than 4Gb) WARC files support
was
suggested.
The "warc-tools" was updated to handle that today.
All the "C" commands were set with default 16Gb size. You can now
increase
this value to whatever you want by changing:
#ifndef WARC_MAX_SIZE
/* 16 Gb by default */
#define WARC_MAX_SIZE 17179869184ULL
#endif
Please, test ans report any bug or strange behviours if any !!!
NOTE: for C developers only
=======================
When creating a WARC file object, you need to be careful and cast the
maximum file size to
(warc_u64_t) or use a constant value with suffix "ULL" .
For example, te following constructors calls are similar:
#define WARC_MAX_SIZE 200ULL
w = bless (WFile, fname, WARC_MAX_SIZE, WARC_FILE_WRITER, cmode,
wdir);
w = bless (WFile, fname, 200ULL, WARC_FILE_WRITER, cmode, wdir);
w = bless (WFile, fname, (warc_u64_t) 200, WARC_FILE_WRITER,
> Thanks — I was actually only trying to make a large WARC file so I
> could run some tests to see how fast the library was
I see. If the speed matters (it does in general), you can get some
help from GCC itself. Edit
the "makefile" and do the following:
* Comment the line:
DFLAG = -g
* Uncomment the line
#CFLAGS_SPEED = -O3 -pipe
* Depending on your architecture (32 or 64 bits), uncomment one of the
2 directives:
# on 32bits machines, uncomment this too
#CFLAGS_SPEED_ARCH = -march=i686
# for 64bits machines uncomment this too
#CFLAGS_SPEED_ARCH = -march=x86-64
Re-build everything. You'll get bigger exec and libraries but they'll
run at max speed ;-)
Moreover, you can play with GZIP compression/uncompression buffers
too. Have a look to
the constants which control them in the makefile (but be careful when
choosing these buffers size values).
> , and whether memory use grows over time.
From the start, we used Valgrind to track memory leaks in our
development process.
It always showed 0 errors/leaks. We believe that the C code is pretty
safe and stable.
I hope that your extensive tests and benchmarks will confirm that fact.
> With small WARCs it is to quick to time in any
> reasonable way. (That large ARC was actually made by concatenating
> many
> ARCs together.)
> I'd be inclined to go with a much larger file size limit, with a 1.5GB
> limit there's a risk that if you're writing a (say) 2GB data file into
> WARC you'll run into problems. Sure you can code around them, but that
> is inconvenient, and the warc-tools are meant to make this type of
> thing
> simple.
Totally agree.
Hope this help.
Regards,
Younès
WarcTools has been updated with the major followin features:
* Support for large WARC files up to 18 Exa Byte size.
* Support for large WARC records up to 18 Exa Byte size each.
* Native Java wrapper with JNA (see contrib/java/install)
* JHove integration (see contrib/jhove/install)
* Proxy mode for browsing
* Many improvements ...
Please report any bug.
Regards
Younès
this sounds interesting.
what it does? validate every object inside the warc file?
ciao
--
<raff...@atomotic.com> [0x39C336BC]
> On Nov 17, 2008, at 4:10 AM, WARC wrote:
>> * JHove integration (see contrib/jhove/install)
>
> this sounds interesting.
> what it does? validate every object inside the warc file?
Basic validation. It does exactly what C "app/warcvalidator" command
does.
It's up to you to modify JHOVE code for advanced validation.
Regards
Younès