Problem with psycopg2, bytea, and memoryview

Frank Millman

unread,

Jul 31, 2013, 5:44:01 AM7/31/13

to pytho...@python.org

Hi all

I don't know if this question is more appropriate for the psycopg2 list, but
I thought I would ask here first.

I have some binary data (a gzipped xml object) that I want to store in a
database. For PostgreSQL I use a column with datatype 'bytea', which is
their recommended way of storing binary strings.

I use psycopg2 to access the database. It returns binary data in the form of
a python 'memoryview'.

My problem is that, after a roundtrip to the database and back, the object
no longer compares equal to the original.

>>> memoryview(b'abcdef') == b'abcdef'
True
>>> cur.execute('create table fmtemp (code int, xml bytea)')
>>> cur.execute('insert into fmtemp values (%s, %s)', (1, b'abcdef'))
>>> cur.execute('select * from fmtemp where code =1')
>>> row = cur.fetchone()
>>> row
(1, <memory at 0xb725f77c>)
>>> row[1] == b'abcdef'
False
>>> row[1].tobytes() == b'abcdef'
True
>>>

Using MS SQL Server and pyodbc, it returns a byte string, not a memoryview,
and it does compare equal with the original.

I can hack my program to use tobytes(), but it would add complication, and
it would be database-specific. I would prefer a cleaner solution.

Does anyone have any suggestions?

Versions - Python: 3.3.2 PostgreSQL: 9.2.4 psycopg2: 2.5

Frank Millman

Antoine Pitrou

unread,

Jul 31, 2013, 5:50:37 AM7/31/13

to pytho...@python.org

Frank Millman <frank <at> chagford.com> writes:
>
> I have some binary data (a gzipped xml object) that I want to store in a
> database. For PostgreSQL I use a column with datatype 'bytea', which is
> their recommended way of storing binary strings.
>
> I use psycopg2 to access the database. It returns binary data in the form of
> a python 'memoryview'.
>

[...]

>
> Using MS SQL Server and pyodbc, it returns a byte string, not a memoryview,
> and it does compare equal with the original.
>
> I can hack my program to use tobytes(), but it would add complication, and
> it would be database-specific. I would prefer a cleaner solution.

Just cast the result to bytes (`bytes(row[1])`). It will work both with bytes
and memoryview objcts.

Regards

Antoine.

Frank Millman

unread,

Jul 31, 2013, 7:43:05 AM7/31/13

to pytho...@python.org

"Antoine Pitrou" <soli...@pitrou.net> wrote in message
news:loom.2013073...@post.gmane.org...

Thanks for that, Antoine. It is an improvement over tobytes(), but i am
afraid it is still not ideal for my purposes.

At present, I loop over a range of columns, comparing 'before' and 'after'
values, without worrying about their types. Strings are returned as str,
integers are returned as int, etc. Now I will have to check the type of each
column before deciding whether to cast to 'bytes'.

Can anyone explain *why* the results do not compare equal? If I understood
the problem, I might be able to find a workaround.

Frank

Antoine Pitrou

unread,

Jul 31, 2013, 9:07:33 AM7/31/13

to pytho...@python.org

Frank Millman <frank <at> chagford.com> writes:
>
> Thanks for that, Antoine. It is an improvement over tobytes(), but i am
> afraid it is still not ideal for my purposes.

I would suggest asking the psycopg2 project why they made this choice, and
if they would reconsider. Returning a memoryview doesn't make much sense IMHO.

For example, the standard sqlite3 module returns bytes for BLOB columns,
and str for TEXT columns:
http://docs.python.org/3.4/library/sqlite3.html#introduction

> Can anyone explain *why* the results do not compare equal? If I understood
> the problem, I might be able to find a workaround.

Well, under recent Python versions, they should compare equal:

Python 3.2.3 (default, Oct 19 2012, 19:53:16)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> memoryview(b"abc") == b"abc"
True

Regards

Antoine.

Frank Millman

unread,

Jul 31, 2013, 9:41:41 AM7/31/13

to pytho...@python.org

"Antoine Pitrou" <soli...@pitrou.net> wrote in message
news:loom.2013073...@post.gmane.org...

> Frank Millman <frank <at> chagford.com> writes:
>>
>> Thanks for that, Antoine. It is an improvement over tobytes(), but i am
>> afraid it is still not ideal for my purposes.
>
> I would suggest asking the psycopg2 project why they made this choice, and
> if they would reconsider. Returning a memoryview doesn't make much sense
> IMHO.
>

I'll try it, and see what they say.

> For example, the standard sqlite3 module returns bytes for BLOB columns,
> and str for TEXT columns:
> http://docs.python.org/3.4/library/sqlite3.html#introduction
>
>> Can anyone explain *why* the results do not compare equal? If I
>> understood
>> the problem, I might be able to find a workaround.
>
> Well, under recent Python versions, they should compare equal:
>
> Python 3.2.3 (default, Oct 19 2012, 19:53:16)
> [GCC 4.7.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> memoryview(b"abc") == b"abc"
> True
>

I am using Python 3.3.2.

If I try your example above, it does work.

However, for some reason, after a round-trip to the server, they do not
compare equal.

See my original post for a full example.

Frank

Neil Cerutti

unread,

Jul 31, 2013, 10:08:44 AM7/31/13

to

On 2013-07-31, Frank Millman <fr...@chagford.com> wrote:
>
> "Antoine Pitrou" <soli...@pitrou.net> wrote in message
> news:loom.2013073...@post.gmane.org...
>> Frank Millman <frank <at> chagford.com> writes:
>>>
>>> I have some binary data (a gzipped xml object) that I want to store in a
>>> database. For PostgreSQL I use a column with datatype 'bytea', which is
>>> their recommended way of storing binary strings.
>>>
>>> I use psycopg2 to access the database. It returns binary data
>>> in the form of a python 'memoryview'.
>>>
>> [...]
>>>
>>> Using MS SQL Server and pyodbc, it returns a byte string, not
>>> a memoryview, and it does compare equal with the original.
>>>
>>> I can hack my program to use tobytes(), but it would add
>>> complication, and it would be database-specific. I would
>>> prefer a cleaner solution.
>>
>> Just cast the result to bytes (`bytes(row[1])`). It will work
>> both with bytes and memoryview objcts.
>

> Thanks for that, Antoine. It is an improvement over tobytes(),
> but i am afraid it is still not ideal for my purposes.
>
> At present, I loop over a range of columns, comparing 'before'
> and 'after' values, without worrying about their types. Strings
> are returned as str, integers are returned as int, etc. Now I
> will have to check the type of each column before deciding
> whether to cast to 'bytes'.
>
> Can anyone explain *why* the results do not compare equal? If I
> understood the problem, I might be able to find a workaround.

A memoryview will compare equal to another object that supports
the buffer protocol when the format and shape are also equal. The
database must be returning chunks of binary data in a different
shape or format than you are writing it.

Perhaps psycopg2 is returning a chunk of ints when you have
written a chunk of bytes. Check the .format and .shape members of
the return value to see.

>>> x = memoryview(b"12345")
>>> x.format
'B'
>>> x.shape
(5,)
>>> x == b"12345"
True

My guess is you're getting format "I" from psycopg2. Hopefully
there's a way to coerce your desired "B" format interpretation of
the raw data using psycopg2's API.

--
Neil Cerutti

Terry Reedy

unread,

Jul 31, 2013, 1:59:20 PM7/31/13

to pytho...@python.org

On 7/31/2013 9:07 AM, Antoine Pitrou wrote:
> Frank Millman <frank <at> chagford.com> writes:
>>
>> Thanks for that, Antoine. It is an improvement over tobytes(), but i am
>> afraid it is still not ideal for my purposes.
>

> I would suggest asking the psycopg2 project why they made this choice, and
> if they would reconsider. Returning a memoryview doesn't make much sense IMHO.

I agree.
"memoryview objects allow Python code to access the internal data of an
object that supports the buffer protocol without copying."
Example: the binary image data of an image object.
They are not intended to be a standalone objects when there is an
obvious alternative (in this case, bytes).

--
Terry Jan Reedy

dieter

unread,

Aug 1, 2013, 2:05:37 AM8/1/13

to pytho...@python.org

"Frank Millman" <fr...@chagford.com> writes:

> ...

> At present, I loop over a range of columns, comparing 'before' and 'after'
> values, without worrying about their types. Strings are returned as str,
> integers are returned as int, etc. Now I will have to check the type of each
> column before deciding whether to cast to 'bytes'.

Of course, you could implement your own "equality" function ("my_equal")
and replace "a == b" by "my_equal(a, b)".

In "my_equal", you could encapsulate whatever logic is necessary
for a reliable equality check.

Frank Millman

unread,

Aug 1, 2013, 4:03:03 AM8/1/13

to pytho...@python.org

"Neil Cerutti" <ne...@norwich.edu> wrote in message
news:b5sk3c...@mid.individual.net...

Thanks very much for the explanation, Neil.

I tried what you suggested, and the object returned by psycopg2 has a format
of 'c' and a shape of (5,).

I don't know what it means, but luckily I have found a workaround. I
enquired on the psycopg2 list, and someone explained how I can create an
extension that forces it to return 'bytes' instead of a 'memoryview'. I
tested it and it works. Problem solved :-)

For the record, I passed on the suggestion from Antoine and Terry that they
change their program to return 'bytes'. It will be interesting to see if
anyone responds.

Thanks again to all for your help.

Frank

Frank Millman

unread,

Aug 2, 2013, 2:23:01 AM8/2/13

to pytho...@python.org

"Terry Reedy" <tjr...@udel.edu> wrote in message
news:ktbj9i$4au$1...@ger.gmane.org...

For the record, I forwarded this to the psycopg2 list, and got the following
reply from Daniele Varrazzo -

<reply>
Hi Frank, thank you for forwarding the thread.

Until a not very long time ago, this was exactly the case: upon
reading bytea we were handed over some data to pass to a libpq
function to decode. The resulting decoded string was to be released by
PQfreemem, and the memoryview was the right object to reconcile the
lifetime of the python object with this deallocation requirement.
Later things have changed: because of the change in bytea format in
PostgreSQL 9.0 (the new format was the default and wasn't handled by
libpq < 9.0, which created widespread problems) we wrote our own
parser. As a consequence the memoryview is not really needed anymore,
but we preferred to avoid breaking programs already using the current
interface.

So yes: bytes (and str on Py2) is definitely a better object to get in
Python. It is already on the list of what I want in a version that
would break compatibility in order to gain feature or improve the
adapter in other ways (e.g. psycopg3).
</reply>

Frank