CRC-checksum failed in gzip

andrea crotti

unread,

Aug 1, 2012, 6:39:57 AM8/1/12

to python-list

We're having some really obscure problems with gzip.
There is a program running with python2.7 on a 2.6.18-128.el5xen (red
hat I think) kernel.

Now this program does the following:
if filename == 'out2.txt':
out2 = open('out2.txt')
elif filename == 'out2.txt.gz'
out2 = open('out2.txt.gz')

text = out2.read()

out2.close()

very simple right? But sometimes we get a checksum error.
Reading the code I got the following:

- CRC is at the end of the file and is computed against the whole
file (last 8 bytes)
- after the CRC there is the \0000 marker for the EOF
- readline() doesn't trigger the checksum generation in the
beginning, but only when the EOF is reached
- until a file is flushed or closed you can't read the new content in it

but the problem is that we can't reproduce it, because doing it
manually on the same files it works perfectly,
and the same files some time work some time don't work.

The files are on a shared NFS drive, I'm starting to think that it's a
network/fs problem, which might truncate the file
adding an EOF before the end and thus making the checksum fail..
But is it possible?
Or what else could it be?

Laszlo Nagy

unread,

Aug 1, 2012, 6:47:00 AM8/1/12

to pytho...@python.org

On 2012-08-01 12:39, andrea crotti wrote:
> We're having some really obscure problems with gzip.
> There is a program running with python2.7 on a 2.6.18-128.el5xen (red
> hat I think) kernel.
>
> Now this program does the following:
> if filename == 'out2.txt':
> out2 = open('out2.txt')
> elif filename == 'out2.txt.gz'
> out2 = open('out2.txt.gz')

Gzip file is binary. You should open it in binary mode.

out2 = open('out2.txt.gz',"b")

Otherwise carriage return and newline characters will be converted (depending on the platform).

andrea crotti

unread,

Aug 1, 2012, 6:58:10 AM8/1/12

to Laszlo Nagy, pytho...@python.org

2012/8/1 Laszlo Nagy <gan...@shopzeus.com>:

> --
> http://mail.python.org/mailman/listinfo/python-list

Ah no sorry I just wrote wrong that part of the code, it was
otu2 = gzip.open('out2.txt.gz') because otherwise nothing would possibly work..

Laszlo Nagy

unread,

Aug 1, 2012, 7:11:18 AM8/1/12

to pytho...@python.org

> very simple right? But sometimes we get a checksum error.

Do you have a traceback showing the actual error?

>
> - CRC is at the end of the file and is computed against the whole
> file (last 8 bytes)
> - after the CRC there is the \0000 marker for the EOF
> - readline() doesn't trigger the checksum generation in the
> beginning, but only when the EOF is reached
> - until a file is flushed or closed you can't read the new content in it

How do you write the file? Is it written from another Python program?
Can we see the source code of that?

>
> but the problem is that we can't reproduce it, because doing it
> manually on the same files it works perfectly,
> and the same files some time work some time don't work.

The problem might be with the saved file. Once you get an error for a
given file, can you reproduce the error using the same file?

>
> The files are on a shared NFS drive, I'm starting to think that it's a
> network/fs problem, which might truncate the file
> adding an EOF before the end and thus making the checksum fail..
> But is it possible?
> Or what else could it be?

Can your try to run the same program on a local drive?

andrea crotti

unread,

Aug 1, 2012, 9:01:45 AM8/1/12

to Laszlo Nagy, pytho...@python.org

Full traceback:

Exception in thread Thread-8:
Traceback (most recent call last):
File "/user/sim/python/lib/python2.7/threading.py", line 530, in
__bootstrap_inner
self.run()
File "/user/sim/tests/llif/AutoTester/src/AutoTester2.py", line 67, in run
self.processJobData(jobData, logger)
File "/user/sim/tests/llif/AutoTester/src/AutoTester2.py", line 204,
in processJobData
self.run_simulator(area, jobData[1] ,log)
File "/user/sim/tests/llif/AutoTester/src/AutoTester2.py", line 142,
in run_simulator
report_file, percentage, body_text = SimResults.copy_test_batch(log, area)
File "/user/sim/tests/llif/AutoTester/src/SimResults.py", line 274,
in copy_test_batch
out2_lines = out2.read()
File "/user/sim/python/lib/python2.7/gzip.py", line 245, in read
self._read(readsize)
File "/user/sim/python/lib/python2.7/gzip.py", line 316, in _read
self._read_eof()
File "/user/sim/python/lib/python2.7/gzip.py", line 338, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0x4f675fba != 0xa9e45aL

- The file is written with the linux gzip program.
- no I can't reproduce the error with the same exact file that did
failed, that's what is really puzzling,
there seems to be no clear pattern and just randmoly fails. The file
is also just open for read from this program,
so in theory no way that it can be corrupted.

I also checked with lsof if there are processes that opened it but
nothing appears..

- can't really try on the local disk, might take ages unfortunately
(we are rewriting this system from scratch anyway)

Laszlo Nagy

unread,

Aug 1, 2012, 9:27:26 AM8/1/12

to andrea crotti, pytho...@python.org

- The file is written with the linux gzip program.
- no I can't reproduce the error with the same exact file that did
failed, that's what is really puzzling,

How do you make sure that no process is reading the file before it is
fully flushed to disk?

Possible way of testing for this kind of error: before you open a file,
use os.stat to determine its size, and write out the size and the file
path into a log file. Whenever an error occurs, compare the actual size
of the file with the logged value. If they are different, then you have
tried to read from a file that was growing at that time.

Suggestion: from the other process, write the file into a different file
(for example, "file.gz.tmp"). Once the file is flushed and closed, use
os.rename() to give its final name. On POSIX systems, the rename()
operation is atomic.

> there seems to be no clear pattern and just randmoly fails. The file
> is also just open for read from this program,
> so in theory no way that it can be corrupted.

Yes, there is. Gzip stores CRC for compressed *blocks*. So if the file
is not flushed to the disk, then you can only read a fragment of the
block, and that changes the CRC.

>
> I also checked with lsof if there are processes that opened it but
> nothing appears..

lsof doesn't work very well over nfs. You can have other processes on
different computers (!) writting the file. lsof only lists the processes
on the system it is executed on.

andrea crotti

unread,

Aug 1, 2012, 9:52:59 AM8/1/12

to Laszlo Nagy, pytho...@python.org

2012/8/1 Laszlo Nagy <gan...@shopzeus.com>:

>> there seems to be no clear pattern and just randmoly fails. The file
>> is also just open for read from this program,
>> so in theory no way that it can be corrupted.
>
> Yes, there is. Gzip stores CRC for compressed *blocks*. So if the file is
> not flushed to the disk, then you can only read a fragment of the block, and
> that changes the CRC.
>
>>
>> I also checked with lsof if there are processes that opened it but
>> nothing appears..
>
> lsof doesn't work very well over nfs. You can have other processes on
> different computers (!) writting the file. lsof only lists the processes on
> the system it is executed on.
>
>>
>> - can't really try on the local disk, might take ages unfortunately
>> (we are rewriting this system from scratch anyway)
>>
>

Thanks a lotl, someone that writes on the file while reading might be
an explanation, the problem is that everyone claims that they are only
reading the file.

Apparently this file is generated once and a long time after only read
by two different tools (in sequence), so this could not be possible
either in theory.. I'll try to investigate more in this sense since
it's the only reasonable explation I've found so far.

Laszlo Nagy

unread,

Aug 1, 2012, 10:17:25 AM8/1/12

to andrea crotti, pytho...@python.org

>
> Thanks a lotl, someone that writes on the file while reading might be
> an explanation, the problem is that everyone claims that they are only
> reading the file.

If that is true, then make that file system read only. Soon it will turn
out who is writing them. ;-)

>
> Apparently this file is generated once and a long time after only read
> by two different tools (in sequence), so this could not be possible
> either in theory.. I'll try to investigate more in this sense since
> it's the only reasonable explation I've found so far.
>

Safe solution would be to develop a system where files go through
"states" in a predefined order:

* allow programs to write into files with .incomplete extension.
* allow them to rename the file to .complete.
* create a single program that renames .complete files to .gz files
AFTER making them read-only for everybody else.
* readers should only read .gz file
* .gz files are then guaranteed to be complete.

Steven D'Aprano

unread,

Aug 1, 2012, 12:17:44 PM8/1/12

to

On Wed, 01 Aug 2012 14:01:45 +0100, andrea crotti wrote:

> Full traceback:
>
> Exception in thread Thread-8:

"DANGER DANGER DANGER WILL ROBINSON!!!"

Why didn't you say that there were threads involved? That puts a
completely different perspective on the problem.

I *was* going to write back and say that you probably had either file
system corruption, or network errors. But now that I can see that you
have threads, I will revise that and say that you probably have a bug in
your thread handling code.

I must say, Andrea, your initial post asking for help was EXTREMELY
misleading. You over-simplified the problem to the point that it no
longer has any connection to the reality of the code you are running.
Please don't send us on wild goose chases after bugs in code that you
aren't actually running.

> there seems to be no clear pattern and just randmoly fails.

When you start using threads, you have to expect these sorts of
intermittent bugs unless you are very careful.

My guess is that you have a bug where two threads read from the same file
at the same time. Since each read shares state (the position of the file
pointer), you're going to get corruption. Because it depends on timing
details of which threads do what at exactly which microsecond, the effect
might as well be random.

Example: suppose the file contains three blocks A B and C, and a
checksum. Thread 8 starts reading the file, and gets block A and B. Then
thread 2 starts reading it as well, and gets half of block C. Thread 8
gets the rest of block C, calculates the checksum, and it doesn't match.

I recommend that you run a file system check on the remote disk. If it
passes, you can eliminate file system corruption. Also, run some network
diagnostics, to eliminate corruption introduced in the network layer. But
I expect that you won't find anything there, and the problem is a simple
thread bug. Simple, but really, really hard to find.

Good luck.

--
Steven

andrea crotti

unread,

Aug 1, 2012, 12:38:56 PM8/1/12

to Steven D'Aprano, pytho...@python.org

2012/8/1 Steven D'Aprano <steve+comp....@pearwood.info>:

Thanks a lot, that makes a lot of sense.. I haven't given this detail
before because I didn't write this code, and I forgot that there were
threads involved completely, I'm just trying to help to fix this bug.

Your explanation makes a lot of sense, but it's still surprising that
even just reading files without ever writing them can cause troubles
using threads :/

Laszlo Nagy

unread,

Aug 1, 2012, 1:05:11 PM8/1/12

to andrea crotti, pytho...@python.org

> Thanks a lot, that makes a lot of sense.. I haven't given this detail
> before because I didn't write this code, and I forgot that there were
> threads involved completely, I'm just trying to help to fix this bug.
>
> Your explanation makes a lot of sense, but it's still surprising that
> even just reading files without ever writing them can cause troubles
> using threads :/

Make sure that file objects are not shared between threads. If that is
possible. It will probably solve the problem (if that is related to
threads).

andrea crotti

unread,

Aug 1, 2012, 1:17:56 PM8/1/12

to Laszlo Nagy, pytho...@python.org

2012/8/1 Laszlo Nagy <gan...@shopzeus.com>:

Well I just have to create a lock I guess right?
with lock:
# open file
# read content

Laszlo Nagy

unread,

Aug 1, 2012, 1:57:19 PM8/1/12

to andrea crotti, pytho...@python.org

>> Make sure that file objects are not shared between threads. If that is
>> possible. It will probably solve the problem (if that is related to
>> threads).
>
> Well I just have to create a lock I guess right?

That is also a solution. You need to call file.read() inside an acquired
lock.

> with lock:
> # open file
> # read content
>

But not that way! Your example will keep the lock acquired for the
lifetime of the file, so it cannot be shared between threads.

More likely:

## Open file
lock = threading.Lock()
fin = gzip.open(file_path...)
# Now you can share the file object between threads.

# and do this inside any thread:
## data needed. block until the file object becomes usable.
with lock:
data = fin.read(....) # other threads are blocked while I'm reading
## use your data here, meanwhile other threads can read

Ulrich Eckhardt

unread,

Aug 2, 2012, 4:49:07 AM8/2/12

to

Am 01.08.2012 19:57, schrieb Laszlo Nagy:
> ## Open file
> lock = threading.Lock()
> fin = gzip.open(file_path...)
> # Now you can share the file object between threads.
>
> # and do this inside any thread:
> ## data needed. block until the file object becomes usable.
> with lock:
> data = fin.read(....) # other threads are blocked while I'm reading
> ## use your data here, meanwhile other threads can read

Technically, that is correct, but IMHO its complete nonsense to share
the file object between threads in the first place. If you need the data
in two threads, just read the file once and then share the read-only,
immutable content. If the file is small or too large to be held in
memory at once, just open and read it on demand. This also saves you
from having to rewind the file every time you read it.

Am I missing something?

Uli

andrea crotti

unread,

Aug 2, 2012, 5:26:51 AM8/2/12

to Steven D'Aprano, pytho...@python.org

2012/8/1 Steven D'Aprano <steve+comp....@pearwood.info>:
>

> When you start using threads, you have to expect these sorts of
> intermittent bugs unless you are very careful.
>
> My guess is that you have a bug where two threads read from the same file
> at the same time. Since each read shares state (the position of the file
> pointer), you're going to get corruption. Because it depends on timing
> details of which threads do what at exactly which microsecond, the effect
> might as well be random.
>
> Example: suppose the file contains three blocks A B and C, and a
> checksum. Thread 8 starts reading the file, and gets block A and B. Then
> thread 2 starts reading it as well, and gets half of block C. Thread 8
> gets the rest of block C, calculates the checksum, and it doesn't match.
>
> I recommend that you run a file system check on the remote disk. If it
> passes, you can eliminate file system corruption. Also, run some network
> diagnostics, to eliminate corruption introduced in the network layer. But
> I expect that you won't find anything there, and the problem is a simple
> thread bug. Simple, but really, really hard to find.
>
> Good luck.

One last thing I would like to do before I add this fix is to actually
be able to reproduce this behaviour, and I thought I could just do the
following:

import gzip
import threading

class OpenAndRead(threading.Thread):
def run(self):
fz = gzip.open('out2.txt.gz')
fz.read()
fz.close()

if __name__ == '__main__':
for i in range(100):
OpenAndRead().start()

But no matter how many threads I start, I can't reproduce the CRC
error, any idea how I can try to help it happening?

The code in run should be shared by all the threads since there are no
locks, right?

Laszlo Nagy

unread,

Aug 2, 2012, 6:14:14 AM8/2/12

to pytho...@python.org

> Technically, that is correct, but IMHO its complete nonsense to share
> the file object between threads in the first place. If you need the
> data in two threads, just read the file once and then share the
> read-only, immutable content. If the file is small or too large to be
> held in memory at once, just open and read it on demand. This also
> saves you from having to rewind the file every time you read it.
>
> Am I missing something?

We suspect that his program reads the same file object from different
threads. At least this would explain his problem. I agree with you -
usually it is not a good idea to share a file object between threads.
This is what I told him the first time. But it is not in our hands - he
already has a program that needs to be fixed. It might be easier for him
to protect read() calls with a lock. Because it can be done
automatically, without thinking too much.

Laszlo Nagy

unread,

Aug 2, 2012, 6:21:24 AM8/2/12

to andrea crotti, pytho...@python.org

> One last thing I would like to do before I add this fix is to actually
> be able to reproduce this behaviour, and I thought I could just do the
> following:
>
> import gzip
> import threading
>
>
> class OpenAndRead(threading.Thread):
> def run(self):
> fz = gzip.open('out2.txt.gz')
> fz.read()
> fz.close()
>
>
> if __name__ == '__main__':
> for i in range(100):
> OpenAndRead().start()
>
>
> But no matter how many threads I start, I can't reproduce the CRC
> error, any idea how I can try to help it happening?

Your example did not share the file object between threads. Here an
example that does that:

class OpenAndRead(threading.Thread):
def run(self):
global fz
fz.read(100)

if __name__ == '__main__':

fz = gzip.open('out2.txt.gz')

for i in range(10):
OpenAndRead().start()

Try this with a huge file. And here is the one that should never throw
CRC error, because the file object is protected by a lock:

class OpenAndRead(threading.Thread):
def run(self):
global fz
global fl
with fl:
fz.read(100)

if __name__ == '__main__':

fz = gzip.open('out2.txt.gz')

fl = threading.Lock()
for i in range(2):
OpenAndRead().start()

>
> The code in run should be shared by all the threads since there are no
> locks, right?

The code is shared but the file object is not. In your example, a new
file object is created, every time a thread is started.

andrea crotti

unread,

Aug 2, 2012, 6:57:06 AM8/2/12

to Laszlo Nagy, pytho...@python.org

2012/8/2 Laszlo Nagy <gan...@shopzeus.com>:

Ok sure that makes sense, but then this explanation is maybe not right
anymore, because I'm quite sure that the file object is *not* shared
between threads, everything happens inside a thread..

I managed to get some errors doing this with a big file

class OpenAndRead(threading.Thread):
def run(self):
global fz
fz.read(100)

if __name__ == '__main__':

fz = gzip.open('bigfile.avi.gz')
for i in range(20):
OpenAndRead().start()

and it doesn't fail without the *global*, but this is definitively not
what the code does, because every thread gets a new file object, it's
not shared..

Anyway we'll read once for all the threads or add the lock, and
hopefully it should solve the problem, even if I'm not convinced yet
that it was this.

andrea crotti

unread,

Aug 2, 2012, 6:59:28 AM8/2/12

to Laszlo Nagy, pytho...@python.org

2012/8/2 andrea crotti <andrea....@gmail.com>:

>
> Ok sure that makes sense, but then this explanation is maybe not right
> anymore, because I'm quite sure that the file object is *not* shared
> between threads, everything happens inside a thread..
>
> I managed to get some errors doing this with a big file
> class OpenAndRead(threading.Thread):
> def run(self):
> global fz
> fz.read(100)
>
> if __name__ == '__main__':
>
> fz = gzip.open('bigfile.avi.gz')
> for i in range(20):
> OpenAndRead().start()
>
> and it doesn't fail without the *global*, but this is definitively not
> what the code does, because every thread gets a new file object, it's
> not shared..
>
> Anyway we'll read once for all the threads or add the lock, and
> hopefully it should solve the problem, even if I'm not convinced yet
> that it was this.

Just for completeness as suggested this also does not fail:

class OpenAndRead(threading.Thread):
def __init__(self, lock):
threading.Thread.__init__(self)
self.lock = lock

def run(self):
global fz
with self.lock:

fz.read(100)

if __name__ == '__main__':

lock = threading.Lock()

fz = gzip.open('bigfile.avi.gz')
for i in range(20):

OpenAndRead(lock).start()