distinction between unzipping bytes and unzipping a file

webcomm

unread,

Jan 9, 2009, 2:49:23 PM1/9/09

to

Hi,
In python, is there a distinction between unzipping bytes and
unzipping a binary file to which those bytes have been written?

The following code is, I think, an example of writing bytes to a file
and then unzipping...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."

If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.

I find that I am able to unzip the resulting data.zip using the unix
unzip command, but the file inside contains some FFFD characters, as
described in this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
I don't know if the unwanted characters might be the result of my
trying to write and unzip a file, rather than unzipping the bytes.
The file does contain a semblance of what I ultimately want -- it's
not all garbage.

Apologies if it's not appropriate to start a new thread for this. It
just seems like a different topic than how to deal with the resulting
FFFD characters.

Thanks for your help,
Ryan

webcomm

unread,

Jan 9, 2009, 2:54:14 PM1/9/09

to

On Jan 9, 2:49 pm, webcomm <rya...@gmail.com> wrote:
> decoded = base64.b64decode(datum)
> #datum is a base64 encoded string of data downloaded from a web
> service
> f = open('data.zip', 'wb')
> f.write(decoded)
> f.close()
> x = zipfile.ZipFile('data.zip', 'r')

Sorry, that code is not what I mean to paste. This is what I
intended...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()

x = popen("unzip data.zip")

Steve Holden

unread,

Jan 9, 2009, 3:15:29 PM1/9/09

to pytho...@python.org

webcomm wrote:
> Hi,
> In python, is there a distinction between unzipping bytes and
> unzipping a binary file to which those bytes have been written?
>
> The following code is, I think, an example of writing bytes to a file
> and then unzipping...
>
> decoded = base64.b64decode(datum)
> #datum is a base64 encoded string of data downloaded from a web
> service
> f = open('data.zip', 'wb')
> f.write(decoded)
> f.close()
> x = zipfile.ZipFile('data.zip', 'r')
>
> After looking at the preceding code, the provider of the web service
> gave me this advice...
> "Instead of trying to create a file, take the unzipped bytes and get a
> Unicode string of text from it."
>

Not terribly useful advice, but one presumes he she or it was trying to
be helpful.

> If so, I'm not sure how to do what he's suggesting, or if it's really
> different from what I've done.
>

Well, what you have done appears pretty wrong to me, but let's take a
look. What's datum? You appear to be treating it as base64-encoded data;
is that correct? Have you examined it?

f = open('data.zip', 'wb')

opens the file data.zip for writing in binary. Not as a zip file, you
understand, just as a regular file. I suspect here you really needed

f = zipfile.ZipFile('data.zip', 'w')

Now, of course, you need to remember what zipfiles contain. Which is
other files. So the data you *write* tot he zipfile has to be associated
with a filename in the archive. Of course you don't have the data in a
file, you have it in a string, so you would use

f.writestr("somefile.dat", decoded)
f.close()

You have now written a zip file containing a single "somefile.dat" file
with the decoded base64 data in it. Open it with Winzip or one of its
buddies and see if anyone barfs.

> I find that I am able to unzip the resulting data.zip using the unix
> unzip command, but the file inside contains some FFFD characters, as
> described in this thread...
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
> I don't know if the unwanted characters might be the result of my
> trying to write and unzip a file, rather than unzipping the bytes.
> The file does contain a semblance of what I ultimately want -- it's
> not all garbage.
>

But it's certainly not a zip file.

> Apologies if it's not appropriate to start a new thread for this. It
> just seems like a different topic than how to deal with the resulting
> FFFD characters.
>

Don't worry about it.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/

MRAB

unread,

Jan 9, 2009, 3:16:01 PM1/9/09

to pytho...@python.org

webcomm wrote:
> Hi,
> In python, is there a distinction between unzipping bytes and
> unzipping a binary file to which those bytes have been written?
>

Python's zipfile module can only read and write zip files; it can't
compress or decompress data as a bytestring.

> The following code is, I think, an example of writing bytes to a file
> and then unzipping...
>
> decoded = base64.b64decode(datum)
> #datum is a base64 encoded string of data downloaded from a web
> service
> f = open('data.zip', 'wb')
> f.write(decoded)
> f.close()
> x = zipfile.ZipFile('data.zip', 'r')
>
> After looking at the preceding code, the provider of the web service
> gave me this advice...
> "Instead of trying to create a file, take the unzipped bytes and get a
> Unicode string of text from it."
>
> If so, I'm not sure how to do what he's suggesting, or if it's really
> different from what I've done.
>

If what you've been given is data which has been zipped and then base-64
encoded, then I can't see that you might be doing wrong.

webcomm

unread,

Jan 9, 2009, 3:32:28 PM1/9/09

to

It's data that has been compressed then base64 encoded by the web
service. I'm supposed to download it, then decode, then unzip. They
provide a C# example of how to do this on page 13 of
http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf

If you have a minute, see also this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/d72d883409764559/5b9eceeee3e77dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4

Chris Mellon

unread,

Jan 9, 2009, 4:08:35 PM1/9/09

to pytho...@python.org

When they say "zip", they're talking about a zlib compressed stream of
bytes, not a zip archive.

You want to base64 decode the data, then zlib decompress it, then
finally interpret it as (I think) UTF-16, as that's what Windows
usually means when it says "Unicode".

decoded = base64.b64decode(datum)
decompressed = zlib.decompress(decoded)
result = decompressed.decode('utf-16')

Chris Mellon

unread,

Jan 9, 2009, 4:12:42 PM1/9/09

to pytho...@python.org

On Fri, Jan 9, 2009 at 3:08 PM, Chris Mellon <ark...@gmail.com> wrote:
> On Fri, Jan 9, 2009 at 2:32 PM, webcomm <rya...@gmail.com> wrote:

> When they say "zip", they're talking about a zlib compressed stream of
> bytes, not a zip archive.
>
> You want to base64 decode the data, then zlib decompress it, then
> finally interpret it as (I think) UTF-16, as that's what Windows
> usually means when it says "Unicode".
>
> decoded = base64.b64decode(datum)
> decompressed = zlib.decompress(decoded)
> result = decompressed.decode('utf-16')
>

And of course as *soon* as I write that, I read the appendix on the
documentation in full and turn out to be wrong. Ignore me *sigh*.

It would really help if you could post a sample file somewhere.

webcomm

unread,

Jan 9, 2009, 4:56:56 PM1/9/09

to

On Jan 9, 4:12 pm, "Chris Mellon" <arka...@gmail.com> wrote:
> It would really help if you could post a sample file somewhere.

Here's a sample with some dummy data from the web service:
http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...

f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character

If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#

If I unzip it like this...
getzip('data.zip', ignoreable=30000)
...using the function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.

John Machin

unread,

Jan 9, 2009, 6:07:10 PM1/9/09

to

On Jan 10, 8:56 am, webcomm <rya...@gmail.com> wrote:
> On Jan 9, 4:12 pm, "Chris Mellon" <arka...@gmail.com> wrote:
>
> > It would really help if you could post a sample file somewhere.
>
> Here's a sample with some dummy data from the web service:http://webcomm.webfactional.com/htdocs/data.zip
>
> That's the zip created in this line of my code...
> f = open('data.zip', 'wb')

Your original problem is identical to that already reported by Chris
Mellon (gratuitous \0 bytes appended to the real archive contents).
Here's the output of the diagnostic gadget that I posted a few minutes
ago:
..........................................................
C:\downloads>python zip_susser_v2.py data.zip
archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
signature : 'PK\x05\x06'
this_disk_num : 0
central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
central_dir_overall_num_entries : 1
central_dir_size : 50
central_dir_offset : 844
comment_size : 0

expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
comment (first 100 bytes):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00'
...................................

>
> If I open the file it contains as unicode in my text editor (EditPlus)
> on Windows XP, there is ostensibly nothing wrong with it. It looks
> like valid XML.

Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
God^H^H^HGates intended:

>>> buff = open('data', 'rb').read()
>>> buff[:100]
'<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
\x00<\x00B\x0
0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
\x000\x00.\x000\x000\x000\x000\x0
0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
\x00S\x00t\x0
0a\x00t\x00'
>>> buff[:100].decode('utf_16_le')
u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
>>>

> But if I return it to my browser with python+django,
> there are bad characters every other character

Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.

>
> If I unzip it like this...
> popen("unzip data.zip")
> ...then the bad characters are 'FFFD' characters as described and

> pictured here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

Yup, you've somehow pushed your utf_16_le-encoded data through some
decoder that doesn't like '\x00' and is replacing it with U+FFFD whose
name is (funnily enough) REPLACEMENT CHARACTER and whose meaning is
"big fat Unicode version of the question mark".

>
> If I unzip it like this...
> getzip('data.zip', ignoreable=30000)

> ...using the function at...http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543

> ...then the bad characters are \x00 characters.

Hmmm ... shouldn't make a difference how you extracted 'data' from
'data.zip'.

Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html

Cheers,
John

webcomm

unread,

Jan 10, 2009, 2:15:45 PM1/10/09

to

On Jan 9, 6:07 pm, John Machin <sjmac...@lexicon.net> wrote:
> Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
> God^H^H^HGates intended:
>
> >>> buff = open('data', 'rb').read()
> >>> buff[:100]
>
> '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
> \x00<\x00B\x0
> 0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
> \x000\x00.\x000\x000\x000\x000\x0
> 0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
> \x00S\x00t\x0
> 0a\x00t\x00'>>> buff[:100].decode('utf_16_le')

There it is. Thanks.

> u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
>
>
>
> > But if I return it to my browser with python+django,
> > there are bad characters every other character
>
> Please consider that we might have difficulty guessing what "return it
> to my browser with python+django" means. Show actual code.

I did stop and consider what code to show. I tried to show only the
code that seemed relevant, as there are sometimes complaints on this
and other groups when someone shows more than the relevant code. You
solved my problem with decode('utf_16_le'). I can't find any
description of that encoding on the WWW... and I thought *everything*
was on the WWW. :)

I didn't know the data was utf_16_le-encoded because I'm getting it
from a service. I don't even know if *they* know what encoding they
used. I'm not sure how you knew what the encoding was.

> Please consider reading the Unicode HOWTO athttp://docs.python.org/howto/unicode.html

Probably wouldn't hurt, though reading that HOWTO wouldn't have given
me the encoding, I don't think.

-Ryan

> Cheers,
> John

John Machin

unread,

Jan 10, 2009, 4:18:14 PM1/10/09

to

Try searching using the official name UTF-16LE ... looks like a blind
spot in the approximate matching algorithm(s) used by the search engine
(s) that you tried :-(

> I didn't know the data was utf_16_le-encoded because I'm getting it
> from a service. I don't even know if *they* know what encoding they
> used. I'm not sure how you knew what the encoding was.

Actually looked at the raw data. Pattern appeared to be an alternation
of 1 "meaningful" byte and one zero ('\x00') byte: => UTF16*. No BOM
('\xFE\xFF' or '\xFF\xFE') at start of file: => UTF16-?E. First byte
is meaningful: => UTF16-LE.

> > Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html
>
> Probably wouldn't hurt,

Definitely won't hurt. Could even help.

> though reading that HOWTO wouldn't have given
> me the encoding, I don't think.

It wasn't intended to give you the encoding. Just read it.

Cheers,
John