SQLite: OperationalError when decoding 0x92 in TEXT column

459 views
Skip to first unread message

Erich Blume

unread,
Feb 4, 2014, 5:15:47 AM2/4/14
to sqlal...@googlegroups.com
I am working on a binding to a SQLite database that I do not control the creation of, with the aid of reflection. I'm running in to what I believe are very basic UTF-8 decoding errors. For instance, a TEXT cell has the byte '0x92' in it and is causing an OperationalError. Presumably, this is because 0x92 (by itself) is not a valid encoding for any Unicode code point. I would prefer that the decoding from UTF-8 to be forced, perhaps by dropping the bad byte. How can I do this?

The database has a table with a column called 'description', which is of type TEXT. The "PRAGMA encoding" is left at 'UTF-8', thank goodness. One of the rows, however, contains within its otherwise ascii byte contents the singleton byte '0x92'. Based on the context of the sentence, it seems that this was intended to be encoded as a single quotation mark, some googling suggests 'RIGHT SINGLE QUOTATION MARK' in unicode, which is '0xE2 0x80 0x99'. I gather that MSSQL (which was the original source of the data in this database) uses Microsofts' infernal web encodings sometimes and that is probably the source of this byte.

The issue is this: I really need to read this data! It would be *ideal* to have the aid of something like python's 'replace' decoding handler but failing that just eliding the byte would do fine in a pinch.

When fetching this row in Python 3.3 with SQLAlchemy 0.9.1 my session looks vaguely like this (with the text and stack trace truncated out for brevity).

      File "/usr/local/Cellar/python3/3.3.3/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/sqlalchemy/engine/result.py", line 760, in <listcomp>
        return [process_row(metadata, row, processors, keymap)
      sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'description' with text <...>

Is there some way to accomplish this?

Thanks!

Simon King

unread,
Feb 4, 2014, 6:26:31 AM2/4/14
to sqlal...@googlegroups.com
The String-related column types have a "unicode_error" parameter which
sounds like it might be what you want:

http://docs.sqlalchemy.org/en/rel_0_9/core/types.html#sqlalchemy.types.String.params.unicode_error

Note the various warnings around it though...

Hope that helps,

Simon

Erich Blume

unread,
Feb 4, 2014, 6:28:47 AM2/4/14
to sqlal...@googlegroups.com
Thanks Simon,

Do you know how I might use that with reflection? There's several hundred of these columns, I'd hate to have to override each one individually - that sort of defeats the purpose of reflection.

One thought I just had was perhaps I could subclass the Text type and then override the ischema_names for SQLite for TEXT type. That'd do the trick, I suspect!



--
You received this message because you are subscribed to a topic in the Google Groups "sqlalchemy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/sqlalchemy/T--Ftk5EVZg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to sqlalchemy+...@googlegroups.com.
To post to this group, send email to sqlal...@googlegroups.com.
Visit this group at http://groups.google.com/group/sqlalchemy.
For more options, visit https://groups.google.com/groups/opt_out.

Simon King

unread,
Feb 4, 2014, 7:33:46 AM2/4/14
to sqlal...@googlegroups.com
I've not done much with reflection, but perhaps you could use the
column_reflect event:

http://docs.sqlalchemy.org/en/rel_0_8/core/events.html#sqlalchemy.events.DDLEvents.column_reflect

Simon
> You received this message because you are subscribed to the Google Groups
> "sqlalchemy" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Erich Blume

unread,
Feb 6, 2014, 6:59:03 PM2/6/14
to sqlal...@googlegroups.com
Hmm, this one has me stumped. As best I can tell after poking at it using the column_reflect event, a custom dialect, etc. - the issue here is that in pysqlite.py we (in my Python 3.3 install) are selecting `sqlite3.dbapi2` as the dbapi interface, but we aren't telling sqlite3 anything about how to treat unicode errors. From what I am reading (but it seems inconsistent, maybe?) sqlite3 automatically decodes all database retrieved values from their bytes for text fields, returning unicode strings. Except... that doesn't always seem to be true. I hex-edited a db file to change the utf-8 string "hello" to "hell" + 0x92 and sqlite3 switched from returning "hello" to b"hell\x92", or something like that - I've been poking at this for so long I've lost track of that transcript.

One can override sqlite3's text factory, apparently, with (for instance) `sqlite3.text_factory = lambda x: x.decode('utf-8', errors='ignore')`. Maybe the key is to try and find a way to trigger that from sqlalchemy? I tried and failed, maybe someone else can point me back to the path?

Just to re-summarize the problem: In python 3, I'm getting errors trying to read a row from a sqlite database that has a TEXT column with an invalid utf-8 sequence (specifically, the singleton bye '0x92'). I'd love to just have sqlalchemy move along and ignore the byte, but I'm not clear how to do that.


On Tuesday, February 4, 2014 4:33:46 AM UTC-8, Simon King wrote:
I've not done much with reflection, but perhaps you could use the
column_reflect event:
<snip>

Michael Bayer

unread,
Feb 6, 2014, 7:19:47 PM2/6/14
to sqlal...@googlegroups.com

On Feb 6, 2014, at 6:59 PM, Erich Blume <blume...@gmail.com> wrote:

> Hmm, this one has me stumped. As best I can tell after poking at it using the column_reflect event, a custom dialect, etc. - the issue here is that in pysqlite.py we (in my Python 3.3 install) are selecting `sqlite3.dbapi2` as the dbapi interface, but we aren't telling sqlite3 anything about how to treat unicode errors. From what I am reading (but it seems inconsistent, maybe?) sqlite3 automatically decodes all database retrieved values from their bytes for text fields, returning unicode strings. Except... that doesn't always seem to be true. I hex-edited a db file to change the utf-8 string "hello" to "hell" + 0x92 and sqlite3 switched from returning "hello" to b"hell\x92", or something like that - I've been poking at this for so long I've lost track of that transcript.
>
> One can override sqlite3's text factory, apparently, with (for instance) `sqlite3.text_factory = lambda x: x.decode('utf-8', errors='ignore')`. Maybe the key is to try and find a way to trigger that from sqlalchemy? I tried and failed, maybe someone else can point me back to the path?
>
> Just to re-summarize the problem: In python 3, I'm getting errors trying to read a row from a sqlite database that has a TEXT column with an invalid utf-8 sequence (specifically, the singleton bye '0x92'). I'd love to just have sqlalchemy move along and ignore the byte, but I'm not clear how to do that.

Pysqlite (e.g. sqlite3 module) returns TEXT as Python unicode out of the gate. That exception message is being raised by sqlite3 itself, SQLAlchemy is just a pass-through, as the string type knows on sqlite that the value is already unicode.

you might need to CAST() the value as BINARY perhaps, not sure. You’d first want to get a plain sqlite3 script to do what you want. Setting a “text_factory” at the module level of sqlite3 is certainly easy enough but that seems way too broad. Ideally you’d want to be able to get the value on a per-column basis.


signature.asc

Erich Blume

unread,
Feb 9, 2014, 5:34:53 AM2/9/14
to sqlal...@googlegroups.com
Thanks for the help on this, everyone! I found two ways to deal with this and figured I should share in case it comes up in the future.

The first approach, the one I went with (because in my case, fidelity was not as important) was to alter the 'text_factory' the sqlite3 uses. One trick here was that to access this parameter for a sqlalchemy connection object 'conn' as returned from an engine, I had to do it like so:

    conn.connection.connection.text_factory

The extra indirection on .connection* was because the connection's internal dbapi connection was in fact a proy '_ConnectionFairy' instance. A small issue but one that's difficult to notice at first as it raises no error to set a new property on an object that will never use that property.

The second approach I found was more involved but would be more appropriate for large projects that didn't want to have this behavior on every column. In this approach you still change the text_factory of sqlite3's connection, but instead change it to `bytes` (or perhaps an identity? ie lambda x: x - not sure what is best). This will cause sqlite3 to return encoded utf-8 bytes instead of unicode strings.

Then, you have to tell SQLAlchemy to convert these strings to unicode. I did not persue this approach far enough to find the right set of arguments but I imagine this would be very simple - set 'force_unicode' to True, I suspect, would be all you would need.

Finally, for the column with the invalid utf-8 sequences, just also set the `unicode_error` to your preferred resolution strategy - usually 'ignore' or 'replace'.

I suppose it is possible that this could incur a performance penalty - the sqlite3 de/encoding process is done in a compiled C module and as such could possibly be faster than using native python for the task. I suspect though that the module is just calling to the usual Python library functions for encoding/decoding (but did not check). So that may be of a concern if you follow this approach.

Thanks again!

Michael Bayer

unread,
Feb 9, 2014, 10:06:11 AM2/9/14
to sqlal...@googlegroups.com

On Feb 9, 2014, at 5:34 AM, Erich Blume <blume...@gmail.com> wrote:

>
> Then, you have to tell SQLAlchemy to convert these strings to unicode. I did not persue this approach far enough to find the right set of arguments but I imagine this would be very simple - set 'force_unicode' to True, I suspect, would be all you would need.

if the SQLite connection is set up to return bytes ahead of when SQLAlchemy does anything with the connection, this will be automatic. DBAPIs vary so much in this regard that we test the connection when the dialect first connects. but if you’re using connection events to achieve this, the event needs to be set in a certain way at the moment to make sure you get the connection before sqlalchemy does anything with it, there’s a ticket to document that.

>
> Finally, for the column with the invalid utf-8 sequences, just also set the `unicode_error` to your preferred resolution strategy - usually 'ignore' or 'replace'.
>
> I suppose it is possible that this could incur a performance penalty - the sqlite3 de/encoding process is done in a compiled C module and as such could possibly be faster than using native python for the task.

SQLAlchemy’s C extensions do the encoding/decoding and this process has been enhanced in 0.9 to also take the job of an expensive and sometimes-necessary “check if it’s already unicode” step. I’ve already observed that SQLA’s C exts seem to be faster than MySQLdb’s “use_unicode”, Postgresqls unicode extension (which is unfortunate, we use that anyway) and using a unicode type with a cx_oracle outputtypehandler (which we’ve also stopped using as users complained about performance).
signature.asc
Reply all
Reply to author
Forward
0 new messages