Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Python3.1: gzip encoding with UTF-8 fails

287 views
Skip to first unread message

Johannes Bauer

unread,
Dec 20, 2009, 11:08:33 AM12/20/09
to
Hello group,

with this following program:

#!/usr/bin/python3
import gzip
x = gzip.open("testdatei", "wb")
x.write("ᅵ")
x.close()

I get a broken .gzip file when decompressing:

$ cat testdatei |gunzip
ᅵ
gzip: stdin: invalid compressed data--length error

As it only happens with UTF-8 characters, I suppose the gzip module
writes a length of 1 in the gzip file header (one character "ᅵ"), but
then actually writes 2 characters (0xc3 0xa4).

Is there a solution?

Regards,
Johannes

--
"Aus starken Potentialen kᅵnnen starke Erdbeben resultieren; es kᅵnnen
aber auch kleine entstehen - und "du" wirst es nicht fᅵr mᅵglich halten
(!), doch sieh': Es kᅵnnen dabei auch gar keine Erdbeben resultieren."
-- "Rᅵdiger Thomas" alias Thomas Schulz in dsa ᅵber seine "Vorhersagen"
<1a30da36-68a2-4977...@q14g2000vbi.googlegroups.com>

Diez B. Roggisch

unread,
Dec 20, 2009, 11:52:24 AM12/20/09
to
Johannes Bauer schrieb:

> Hello group,
>
> with this following program:
>
> #!/usr/bin/python3
> import gzip
> x = gzip.open("testdatei", "wb")
> x.write("ᅵ")
> x.close()
>
> I get a broken .gzip file when decompressing:
>
> $ cat testdatei |gunzip
> ᅵ
> gzip: stdin: invalid compressed data--length error
>
> As it only happens with UTF-8 characters, I suppose the gzip module

UTF-8 is not unicode. Even if the source-encoding above is UTF-8, I'm
not sure what is used to encode the unicode-string when it's written.

> writes a length of 1 in the gzip file header (one character "ᅵ"), but
> then actually writes 2 characters (0xc3 0xa4).
>
> Is there a solution?

What about writinga bytestring by explicitly decoding the string to
utf-8 first?

x.write("ᅵ".encode("utf-8"))


Diez

Mark Tolonen

unread,
Dec 20, 2009, 1:01:49 PM12/20/09
to pytho...@python.org
> "Diez B. Roggisch" <de...@nospam.web.de> wrote in message
> news:7p7328F...@mid.uni-berlin.de...

While that works, it still seems like a bug in gzip. If gzip.open is
replaced with a simple open:

# coding: utf-8
import gzip
x = open("testdatei", "wb")


x.write("ᅵ")
x.close()

The result is:

Traceback (most recent call last):
File
"C:\dev\python3\Lib\site-packages\Pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec(codeObj, __main__.__dict__)
File "<auto import>", line 1, in <module>
File "y.py", line 4, in <module>
x.write("ᅵ")
TypeError: must be bytes or buffer, not str

Opening a file in binary mode should require a bytes or buffer object.

-Mark


Antoine Pitrou

unread,
Dec 21, 2009, 7:24:09 AM12/21/09
to pytho...@python.org

Hello,

Le Sun, 20 Dec 2009 17:08:33 +0100, Johannes Bauer a écrit :
>
> #!/usr/bin/python3
> import gzip
> x = gzip.open("testdatei", "wb")

> x.write("ä")

The bug here is that you are trying to write an unicode text string ("ä")
to a binary file (a gzip file). This bug has been fixed now; in the next
3.x versions it will raise a TypeError:

>>> x = gzip.open("testdatei", "wb")

>>> x.write("ä")


Traceback (most recent call last):

File "<stdin>", line 1, in <module>
File "/home/antoine/py3k/__svn__/Lib/gzip.py", line 227, in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffff


TypeError: must be bytes or buffer, not str

You have to encode manually if you want to write text strings to a gzip
file:

>>> x = gzip.open("testdatei", "wb")

>>> x.write("ä".encode('utf8'))


Regards

Antoine.


0 new messages