Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

saving octet-stream png file

1,729 views
Skip to first unread message

Larry Martell

unread,
Aug 19, 2016, 1:11:52 PM8/19/16
to
I have some python code (part of a django app) that processes a
request that contains a png file. The request is send with
content_type = 'application/octet-stream'

In the python code I want to write this data to a file and still have
it still be a valid png file.

The data I get looks like this:

u'\ufffdPNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\ufffd\x00\x00\x01\ufffd
......'

If I try and write that to a file it fails with a UnicodeEncodeError.
If I write it with encode('utf8') it writes the file, but then it's no
longer a valid png file.

Anyone know how I can do this?

Chris Angelico

unread,
Aug 19, 2016, 1:24:44 PM8/19/16
to
At that point, you've already lost information. Each U+FFFD (shown as
"\ufffd" above) is a marker saying "a byte here was not valid UTF-8"
(or whatever was being used). Something somewhere took the .png file's
bytes and tried to interpret them as text, which they're not.

What sent you that data? How did you receive it?

ChrisA

Terry Reedy

unread,
Aug 19, 2016, 2:03:53 PM8/19/16
to
On 8/19/2016 1:10 PM, Larry Martell wrote:
> I have some python code (part of a django app) that processes a
> request that contains a png file. The request is send with
> content_type = 'application/octet-stream'

An 'octet' is a byte of 8 bits. So the content is a stream of bytes and
MUST NOT be decoded as unicode text.

> In the python code I want to write this data to a file and still have
> it still be a valid png file.
>
> The data I get looks like this:
>
> u'\ufffdPNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\ufffd\x00\x00\x01\ufffd
> ......'

The data you got looked like b'...PNG...' where the *ascii* codes for
"PNG' identify it as a png byte stream. It was mistakenly decoded to
unicode text by something. Png bytes must be decoded, when decoded, to
a png image. You want to write the bytes to a file exactly as received,
without decoding.

> If I try and write that to a file it fails with a UnicodeEncodeError.
> If I write it with encode('utf8') it writes the file, but then it's no
> longer a valid png file.

The data ceased representing a png image as soon as wrongfully decoded
as unicode text.


--
Terry Jan Reedy

Larry Martell

unread,
Aug 19, 2016, 3:01:18 PM8/19/16
to
On Fri, Aug 19, 2016 at 1:24 PM, Chris Angelico <ros...@gmail.com> wrote:
> On Sat, Aug 20, 2016 at 3:10 AM, Larry Martell <larry....@gmail.com> wrote:
>> I have some python code (part of a django app) that processes a
>> request that contains a png file. The request is send with
>> content_type = 'application/octet-stream'
>>
>> In the python code I want to write this data to a file and still have
>> it still be a valid png file.
>>
>> The data I get looks like this:
>>
>> u'\ufffdPNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\ufffd\x00\x00\x01\ufffd
>> ......'
>>
>> If I try and write that to a file it fails with a UnicodeEncodeError.
>> If I write it with encode('utf8') it writes the file, but then it's no
>> longer a valid png file.
>>
>> Anyone know how I can do this?
>
> At that point, you've already lost information. Each U+FFFD (shown as
> "\ufffd" above) is a marker saying "a byte here was not valid UTF-8"
> (or whatever was being used). Something somewhere took the .png file's
> bytes and tried to interpret them as text, which they're not.
>
> What sent you that data? How did you receive it?

The request is sent by a client app written in C++ with Qt. It's
received by a django based server. I am trying to port a falcon server
to django. The falcon server code did this:

form = cgi.FieldStorage(fp=req.stream, environ=req.env)

and then wrote the png like this:

fd.write(form[key].file.read())

Whereas in the django server I am doing:

fd.write(request.POST[key])

I've never used the cgi module. I guess I can try that. I've written a
lot with django but never had to receive a PNG file.

Larry Martell

unread,
Aug 19, 2016, 3:46:37 PM8/19/16
to
No joy using cgi.FieldStorage. The request I get (of type
django.core.handlers.wsgi.WSGIRequest) does not have a stream method.
I'm sure there's some way to do this, but I have not come up with
anything googling. Going to try the django list.

Chris Kaynor

unread,
Aug 19, 2016, 4:25:01 PM8/19/16
to
On Fri, Aug 19, 2016 at 12:00 PM, Larry Martell <larry....@gmail.com>
I don't know Django, however a quick search makes it seem like you might
need to use request.FILES[key] (1) rather than request.POST[key]. You may
also be able to use request.POST if you set request.encoding first (2). If
both of those fail, you may need to use request.body and parse the HTTP
form data manually, though I'd imagine there is an easier way.

[1]
https://docs.djangoproject.com/en/1.10/ref/request-response/#django.http.HttpRequest.FILES

[2]
https://docs.djangoproject.com/en/1.10/ref/request-response/#django.http.HttpRequest.encoding

Lawrence D’Oliveiro

unread,
Aug 19, 2016, 4:51:23 PM8/19/16
to
On Saturday, August 20, 2016 at 6:03:53 AM UTC+12, Terry Reedy wrote:
>
> An 'octet' is a byte of 8 bits.

Is there any other size of byte?

Random832

unread,
Aug 19, 2016, 5:10:24 PM8/19/16
to
Not very often anymore. Used to be some systems had 9-bit bytes, and of
course a lot of communication protocols only supported 7-bit data bytes.
"Byte" is a technical term in the C and C++ standards meaning the
smallest addressable unit even if that is a larger word.

Steve D'Aprano

unread,
Aug 19, 2016, 9:09:37 PM8/19/16
to
Depends what you mean by "byte", but the short answer is "Yes".

In the C/C++ standard, bytes must be at least eight bytes. As the below FAQ
explains, that means that on machines like the PDP-10 a C++ compiler will
define bytes to be 32 bits.

One common definition of "byte" is the smallest addressable unit of memory.
On that basis, there have been machines like the Control Data 6600 where a
byte was 60 bits. Honeywell machines used 9 bits.

Digital signal processes (DSPs) frequently have bytes with more than eight
bits, such as Texas Instruments C54x DSPs (16 bit bytes), BelaSigna DSPs
(24 bits) and DSP56K/Symphony Audio DSPs (24 bits).

The Saturn CPU (used in the HP-48SX/GX calculator line) addresses memory
4-bit bytes.

Windows CE took the unusual, and non-conformant, approach of running on
hardware with 16 bit bytes and simply not defining "char" (and
presumably "byte") in their C compiler.


See:

https://isocpp.org/wiki/faq/intrinsic-types
http://stackoverflow.com/questions/5516044/system-where-1-byte-8-bit
http://stackoverflow.com/questions/2098149/what-platforms-have-something-other-than-8-bit-char
http://programmers.stackexchange.com/questions/120126/what-is-the-history-of-why-bytes-are-eight-bits



--
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

Random832

unread,
Aug 19, 2016, 11:24:04 PM8/19/16
to
On Fri, Aug 19, 2016, at 21:09, Steve D'Aprano wrote:
> Depends what you mean by "byte", but the short answer is "Yes".
>
> In the C/C++ standard, bytes must be at least eight bytes. As the below
> FAQ
> explains, that means that on machines like the PDP-10 a C++ compiler will
> define bytes to be 32 bits.

I assume you mean 36, but I think this is mixing up two separate parts
of the FAQ along with some theory discussion. AFAIK all historical C
implementations for the PDP-10 and other 18- or 36-bit word systems have
used the "9-bit bytes, and an extra offset member in char and void
pointers if necessary" solution rather than the "word-sized byte"
solution.

In principle, I think a close reading of the standard would allow for an
implementation can have 'skipped bits' as discussed there so long as
there is no significance to the 'missing' bits in *any* type - so you
could legally have 8-bit bytes on a PDP-10 so long as you also had
16-bit shorts, 32-bit long/pointer/float, and in general *never ever*
broke the illusion that only the bits addressable via char pointers
exist. Modern implementations aren't expected to expose ECC bits, after
all.

Such an implementation would probably be best done by ignoring the high
bits of each word, which are easily masked off and often have no real
use anyway in pointers (How many of these systems supported 2^36 words =
288 modern gigabytes of memory?), rather than by ignoring a single bit
"between" each byte. AIUI many PDP-10 applications only used 18 bits for
word pointers.

Marko Rauhamaa

unread,
Aug 20, 2016, 3:51:13 AM8/20/16
to

Random832 <rand...@fastmail.com>:
> On Fri, Aug 19, 2016, at 16:51, Lawrence D’Oliveiro wrote:
>> On Saturday, August 20, 2016 at 6:03:53 AM UTC+12, Terry Reedy wrote:
>> > An 'octet' is a byte of 8 bits.
>> Is there any other size of byte?
> Not very often anymore.

The main difference between an octet and a byte is that a "byte" is used
when talking about computers while an "octet" is a term used in
telecommunications protocols.

(If I'm not mistaken, "octet" is also French for "byte".)

Somewhat analogously, a programmer can rely on integers being
2's-complement. IOW, even in Python,

-X == ~X + 1

for any integer.

2'scomplement arithmetics is quite often taken advantage of in C
programming. Unfortunately, with the castration of signed integers with
the most recent C standards, 2's-complement has been dangerously broken.


Marko

Random832

unread,
Aug 20, 2016, 2:10:23 PM8/20/16
to
On Sat, Aug 20, 2016, at 03:50, Marko Rauhamaa wrote:
> 2'scomplement arithmetics is quite often taken advantage of in C
> programming. Unfortunately, with the castration of signed integers with
> the most recent C standards, 2's-complement has been dangerously broken.

No part of any version of the C standard has ever allowed signed integer
overflow to work as defined behavior the way a generation of programmers
assumed it did. What changed was advances in compiler optimization
technology, not a standards change.

Grant Edwards

unread,
Aug 20, 2016, 3:55:50 PM8/20/16
to
On 2016-08-19, Random832 <rand...@fastmail.com> wrote:
> On Fri, Aug 19, 2016, at 16:51, Lawrence D’Oliveiro wrote:
> Not very often anymore. Used to be some systems had 9-bit bytes, and of
> course a lot of communication protocols only supported 7-bit data bytes.
> "Byte" is a technical term in the C and C++ standards meaning the
> smallest addressable unit even if that is a larger word.

Last time I looked a lot of DSP chips still have "byte" sizes larger
than 8 bits. IIRC, 16, 24, and 32 bits are common byte sizes.

Not that Python runs on any of them...

--
Grant Edwards grant.b.edwards Yow! A can of ASPARAGUS,
at 73 pigeons, some LIVE ammo,
gmail.com and a FROZEN DAQUIRI!!

Marko Rauhamaa

unread,
Aug 20, 2016, 4:36:25 PM8/20/16
to
Random832 <rand...@fastmail.com>:

> On Sat, Aug 20, 2016, at 03:50, Marko Rauhamaa wrote:
>> 2'scomplement arithmetics is quite often taken advantage of in C
>> programming. Unfortunately, with the castration of signed integers
>> with the most recent C standards, 2's-complement has been dangerously
>> broken.
>
> No part of any version of the C standard has ever allowed signed
> integer overflow to work as defined behavior the way a generation of
> programmers assumed it did.

Standard or no, it was widely taken advantage of and very useful.

> What changed was advances in compiler optimization technology, not a
> standards change.

I wonder how much is gained by those optimizations. The loss to code
quality is significant. What C standards have done is they have all but
deprecated the use of signed integers. If you have to do integer
arithmetics, cast everything into unsigned first.

I think it's terrible that in C,

x + y + z

might not yield

x + y + z

even if all of

{ x, y, z, x + y + z }

are inside the valid signed integer range.


Marko

Jon Ribbens

unread,
Aug 21, 2016, 5:25:00 PM8/21/16
to
On 2016-08-19, Larry Martell <larry....@gmail.com> wrote:
> fd.write(request.POST[key])

You could try:

request.encoding = "iso-8859-1"
fd.write(request.POST[key].encode("iso-8859-1"))

It's hacky and nasty and there might be a better "official" method
but I think it should work.

Larry Martell

unread,
Aug 22, 2016, 10:38:01 AM8/22/16
to
On Fri, Aug 19, 2016 at 4:24 PM, Chris Kaynor <cka...@zindagigames.com> wrote:
> On Fri, Aug 19, 2016 at 12:00 PM, Larry Martell <larry....@gmail.com>
Thanks for the reply. When I get the request, request.FILES is empty.
Yet the content type is multipart/form-data and the method is POST:

(Pdb) print request.META['CONTENT_TYPE']
multipart/form-data;
boundary="boundary_.oOo._NzEwNjIzMTM4MTI4NjUxOTM5OQ==MTY2NjE4MDk5Nw=="

(Pdb) print request.META['REQUEST_METHOD']
POST

(Pdb) print request.FILES
<MultiValueDict: {}>

Tried setting request.encoding, but that messes up the request structure:

(Pdb) type(request.POST[key])
<type 'unicode'>
(Pdb) request.encoding = "iso-8859-1"
(Pdb) type(request.POST[key])
*** MultiValueDictKeyError:
"u'right-carotidartery:63B2E474-D690-445F-B92A-31EBADDC9D93.png'"

Larry Martell

unread,
Aug 22, 2016, 10:38:55 AM8/22/16
to
For some reason that messes up the request structure:

Larry Martell

unread,
Aug 22, 2016, 12:51:33 PM8/22/16
to
> Tried setting request.encoding, but that messes up the request structure:
>
> (Pdb) type(request.POST[key])
> <type 'unicode'>
> (Pdb) request.encoding = "iso-8859-1"
> (Pdb) type(request.POST[key])
> *** MultiValueDictKeyError:
> "u'right-carotidartery:63B2E474-D690-445F-B92A-31EBADDC9D93.png'"


Thanks to everyone for they replied. I solved this by changing the
client to set the file and filename fields each part of the multipart
and then I was able it iterate through request.FILES and successfully
write the files as PNG.

Larry Martell

unread,
Aug 22, 2016, 1:22:39 PM8/22/16
to
Many, many years ago, probably c. 1982 my Dad came into my house and
saw a Byte Magazine laying on the coffee table. He asked "What is a
byte?" I replied "Half a word." He then asked "What is the other half
of the word?" I said "That is also a byte." He thought for a moment,
then said "So the full word is 'byte byte'?"

Jon Ribbens

unread,
Aug 22, 2016, 1:25:26 PM8/22/16
to
Sounds like you should be filing a bug report with Django.

Wildman

unread,
Aug 22, 2016, 3:46:14 PM8/22/16
to
LOL! Did you explain to him that a full word could also be
'nibble nibble nibble nibble' or 'bit bit bit bit bit bit
bit bit bit bit bit bit bit bit bit bit'?

--
<Wildman> GNU/Linux user #557453
The cow died so I don't need your bull!

Larry Martell

unread,
Aug 22, 2016, 5:24:45 PM8/22/16
to
Turns out that is expected behavior -- request.encoding clears
existing GET/POST data. Not sure how it's useful to set that then.

Jon Ribbens

unread,
Aug 22, 2016, 5:29:08 PM8/22/16
to
Yeah it clears the cached parsed data, but when you fetch a value
afterwards it should automatically recalculate the data, not fail.
0 new messages