File size limits

17 views
Skip to first unread message

Gordon Paynter

unread,
Nov 6, 2008, 10:44:31 PM11/6/08
to warc-...@googlegroups.com
Hi Younes:

Could you take a look at issue 72 (http://code.google.com/p/warc-tools/issues/detail?id=72)?

I tried to convert a 5.5GB ARC file to WARC, and got the following error:


Command:
../warc-tools-read-only/app/arc2warc -a six.arc.gz -f siz.warc

Output:
> debug: lib/private/wfile.c :1971:"couldn't add record to the warc file, maximum size reached"


Is this a WARC liit or a libwarc limit? What is the maximum size? The warc file seems to have been truncated at about 1.5GB. Your advice appreciated.

Thanks,
Gordon

WARC

unread,
Nov 7, 2008, 4:33:06 AM11/7/08
to warc-...@googlegroups.com
Hi Gordon,

You're the second one asking for that !!!

Actually, we're using a warc_u32_t (i.e. 32 bits unsigned int =
4,294,967,295 = 4Gb of length) to handle WARC file.
We tought that's a good strategy to let you think and avoid having big
WARCs. Something between
100 Mo and 600 Mo (and even 1Go) is a good choice in my opinion
(minimize the risk of data loss, pretty fast data copying ...).
This is what I.A, Hanzo and others use in general.

But as I see, you have to deal with ARC files bigger than that.

> Command:
> ../warc-tools-read-only/app/arc2warc -a six.arc.gz -f siz.warc
>
> Output:
>> debug: lib/private/wfile.c :1971:"couldn't add record to the warc
>> file, maximum size reached"
>

That's a normal behaviour with the actual settings Gordon.
In the file "app/arc2warc.c", you can find a 32 bits integer constant
called :

#define WARC_MAX_SIZE 1629145600

This is your limit actually. You can increase this value to 4Gb at max
and try again:

#define WARC_MAX_SIZE 4294967296

>
> Is this a WARC liit or a libwarc limit? What is the maximum size?
> The warc file seems to have been truncated at about 1.5GB. Your
> advice appreciated.
>

With a limit of "1629145600", your WARC file seems to be truncated to
1.5GB (=1629145600). But in reality, the WARC tools library is just
trying to tell you that this WARC file reached it's size limit. So,
open a new one and continue your conversion if you want from where you
left it.
At the end, you can concatenate them all and obtain your final ARC
file with:

cat siz*.warc > siz.warc


Anyway, a change we'll be made to fix the 32 bits limit asap. We'll
use instead a
warc_u64_t (i.e. 64 bits unsigned integer = 18,446,744,073,709,551,615
= 18.45 Exa bytes).

Hope this help you understand the "warc-tools" behaviour regarding
your problem.

N.B: in your example, you didn't compress the resulting WARC. In this
case, compression may help to save space Gordon !

Regards
Younès

WARC

unread,
Nov 7, 2008, 3:25:02 PM11/7/08
to warc-...@googlegroups.com
Hi List,

As you may noticed, a need for big (more than 4Gb) WARC files support
was
suggested.

The "warc-tools" was updated to handle that today.
All the "C" commands were set with default 16Go size. You can now
increase
this value to whatever you want by changing:

#ifndef WARC_MAX_SIZE
/* 16 Go by default */
#define WARC_MAX_SIZE 17179869184ULL
#endif

Please, test ans report any bug or strange behviours if any !!!

NOTE: for C developers only
=======================
When creating a WARC file object, you need to be careful and cast the
maximum file size to
(warc_u64_t) or use a constant value with suffix "ULL" .

For example, te following constructors calls are similar:

#define WARC_MAX_SIZE 200ULL

w = bless (WFile, fname, WARC_MAX_SIZE, WARC_FILE_WRITER, cmode,
wdir);
w = bless (WFile, fname, WARC_MAX_SIZE, 200ULL, cmode, wdir);
w = bless (WFile, fname, WARC_MAX_SIZE, (warc_u64_t) 200, cmode,
wdir);


NOTE: for Python/Ruby developers
===========================
Nothing change for you as these languages are not affected with this
change.


Regards
Younès

WARC

unread,
Nov 7, 2008, 5:40:49 PM11/7/08
to warc-...@googlegroups.com, Andras A Benczur
Hi List,

As you may noticed, a need for big (more than 4Gb) WARC files support
was
suggested.

The "warc-tools" was updated to handle that today.

All the "C" commands were set with default 16Gb size. You can now

increase
this value to whatever you want by changing:

#ifndef WARC_MAX_SIZE
/* 16 Gb by default */
#define WARC_MAX_SIZE 17179869184ULL
#endif

Please, test ans report any bug or strange behviours if any !!!

NOTE: for C developers only
=======================
When creating a WARC file object, you need to be careful and cast the
maximum file size to
(warc_u64_t) or use a constant value with suffix "ULL" .

For example, te following constructors calls are similar:

#define WARC_MAX_SIZE 200ULL

w = bless (WFile, fname, WARC_MAX_SIZE, WARC_FILE_WRITER, cmode,
wdir);

w = bless (WFile, fname, 200ULL, WARC_FILE_WRITER, cmode, wdir);
w = bless (WFile, fname, (warc_u64_t) 200, WARC_FILE_WRITER,

Gordon Paynter

unread,
Nov 11, 2008, 3:45:07 PM11/11/08
to warc-...@googlegroups.com
Hi Younes:

Thanks — I was actually only trying to make a large WARC file so I
could run some tests to see how fast the library was, and whether memory
use grows over time. With small WARCs it is to quick to time in any
reasonable way. (That large ARC was actually made by concatenating many
ARCs together.)

I'd be inclined to go with a much larger file size limit, with a 1.5GB
limit there's a risk that if you're writing a (say) 2GB data file into
WARC you'll run into problems. Sure you can code around them, but that
is inconvenient, and the warc-tools are meant to make this type of thing
simple.

Gordon




>>> WARC <voidp...@gmail.com> 07/11/08 10:33 p.m. >>>

WARC

unread,
Nov 11, 2008, 4:41:39 PM11/11/08
to warc-...@googlegroups.com
Hi Gordon,

> Thanks — I was actually only trying to make a large WARC file so I
> could run some tests to see how fast the library was

I see. If the speed matters (it does in general), you can get some
help from GCC itself. Edit
the "makefile" and do the following:

* Comment the line:
DFLAG = -g

* Uncomment the line
#CFLAGS_SPEED = -O3 -pipe

* Depending on your architecture (32 or 64 bits), uncomment one of the
2 directives:
# on 32bits machines, uncomment this too
#CFLAGS_SPEED_ARCH = -march=i686
# for 64bits machines uncomment this too
#CFLAGS_SPEED_ARCH = -march=x86-64

Re-build everything. You'll get bigger exec and libraries but they'll
run at max speed ;-)

Moreover, you can play with GZIP compression/uncompression buffers
too. Have a look to
the constants which control them in the makefile (but be careful when
choosing these buffers size values).

> , and whether memory use grows over time.


From the start, we used Valgrind to track memory leaks in our
development process.
It always showed 0 errors/leaks. We believe that the C code is pretty
safe and stable.
I hope that your extensive tests and benchmarks will confirm that fact.


> With small WARCs it is to quick to time in any
> reasonable way. (That large ARC was actually made by concatenating
> many
> ARCs together.)

> I'd be inclined to go with a much larger file size limit, with a 1.5GB
> limit there's a risk that if you're writing a (say) 2GB data file into
> WARC you'll run into problems. Sure you can code around them, but that
> is inconvenient, and the warc-tools are meant to make this type of
> thing
> simple.

Totally agree.

Hope this help.

Regards,
Younès

WARC

unread,
Nov 16, 2008, 10:10:53 PM11/16/08
to warc-...@googlegroups.com
Hi List,

WarcTools has been updated with the major followin features:

* Support for large WARC files up to 18 Exa Byte size.
* Support for large WARC records up to 18 Exa Byte size each.
* Native Java wrapper with JNA (see contrib/java/install)
* JHove integration (see contrib/jhove/install)
* Proxy mode for browsing
* Many improvements ...

Please report any bug.

Regards
Younès

raffaele messuti

unread,
Nov 17, 2008, 6:41:37 AM11/17/08
to warc-...@googlegroups.com

On Nov 17, 2008, at 4:10 AM, WARC wrote:
> * JHove integration (see contrib/jhove/install)

this sounds interesting.
what it does? validate every object inside the warc file?


ciao


--
<raff...@atomotic.com> [0x39C336BC]


WARC

unread,
Nov 17, 2008, 8:13:46 AM11/17/08
to warc-...@googlegroups.com
Hi Raffaele,

> On Nov 17, 2008, at 4:10 AM, WARC wrote:
>> * JHove integration (see contrib/jhove/install)
>
> this sounds interesting.
> what it does? validate every object inside the warc file?


Basic validation. It does exactly what C "app/warcvalidator" command
does.
It's up to you to modify JHOVE code for advanced validation.

Regards
Younès

Bjarne Andersen

unread,
Nov 25, 2008, 5:03:29 PM11/25/08
to warc-...@googlegroups.com
I have some troubles running the warc-browser in proxymode:

twistd -ny proxy.tac
Traceback (most recent call last):
  File "/usr/local/bin/twistd", line 20, in <module>
    from twisted.scripts.twistd import run
  File "/usr/local/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 11, in <module>
    from twisted.application import app
  File "/usr/local/lib/python2.5/site-packages/twisted/application/app.py", line 8, in <module>
    from twisted.persisted import sob
  File "/usr/local/lib/python2.5/site-packages/twisted/persisted/sob.py", line 23, in <module>
    from zope.interface import implements, Interface
ImportError: No module named zope.interface

Any ideas?

best
Bjarne Andersen

Bjarne Andersen

unread,
Nov 25, 2008, 5:15:56 PM11/25/08
to warc-...@googlegroups.com
I got a little further installing Zope 3.3.1 with: python install.py install
(the configure script in Zope 3.3.1 distribution complained about only running with python 2.4.3 (and some other 2.4 versions) - but not with my 2.5.2. But python install.py install finished without any problems

 Starting up the warc-browser in proxy-mode now gives another error:

 twistd -ny proxy.tac
/usr/local/lib/python2.5/site-packages/twisted/internet/address.py:98: ComponentsDeprecationWarning: components.backwardsCompatImplements doesn't do anything in Twisted 2.3, stop calling it.
  components.backwardsCompatImplements(IPv4Address)
/usr/local/lib/python2.5/site-packages/twisted/internet/address.py:99: ComponentsDeprecationWarning: components.backwardsCompatImplements doesn't do anything in Twisted 2.3, stop calling it.
  components.backwardsCompatImplements(UNIXAddress)

Traceback (most recent call last):
  File "/usr/local/lib/python2.5/site-packages/twisted/application/app.py", line 614, in run
    runApp(config)
  File "/usr/local/lib/python2.5/site-packages/twisted/scripts/twistd.py", line 23, in runApp
    _SomeApplicationRunner(config).run()
  File "/usr/local/lib/python2.5/site-packages/twisted/application/app.py", line 330, in run
    self.application = self.createOrGetApplication()
  File "/usr/local/lib/python2.5/site-packages/twisted/application/app.py", line 416, in createOrGetApplication
    application = getApplication(self.config, passphrase)
--- <exception caught here> ---
  File "/usr/local/lib/python2.5/site-packages/twisted/application/app.py", line 427, in getApplication
    application = service.loadApplication(filename, style, passphrase)
  File "/usr/local/lib/python2.5/site-packages/twisted/application/service.py", line 368, in loadApplication
    application = sob.loadValueFromFile(filename, 'application', passphrase)
  File "/usr/local/lib/python2.5/site-packages/twisted/persisted/sob.py", line 214, in loadValueFromFile
    exec fileObj in d, d
  File "proxy.tac", line 28, in <module>
    from twisted.internet import reactor, protocol
  File "/usr/local/lib/python2.5/site-packages/twisted/internet/reactor.py", line 11, in <module>
    from twisted.internet import selectreactor
  File "/usr/local/lib/python2.5/site-packages/twisted/internet/selectreactor.py", line 21, in <module>
    from twisted.internet import posixbase
  File "/usr/local/lib/python2.5/site-packages/twisted/internet/posixbase.py", line 51, in <module>
    if platform.isWindows():
exceptions.AttributeError: Platform instance has no attribute 'isWindows'

failed to load application: Platform instance has no attribute 'isWindows'

I'm running on Red Hat linux - and it doesn't seem to be a Zope problem ??

best
Bjarne Andersen

mark williamson

unread,
Nov 27, 2008, 3:18:30 AM11/27/08
to warc-...@googlegroups.com
Hi Bjarne,

sorry for the delay - I sent an answer when you posted but for some
reason it didn't come through.

So this problem is something to do with your twisted installation. The
error is coming from the twisted imports.

I would check back on your twisted install step by step and failing
that perhaps ask on the twisted mailing list

cheers

mark

On Tue, Nov 25, 2008 at 10:15 PM, Bjarne Andersen
Reply all
Reply to author
Forward
0 new messages