PEP 249 Compliant error handling

Israel Brewster

unread,

Oct 17, 2017, 1:43:16 PM10/17/17

to

I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D database, and I've run into a situation where corrupted string data from the database can cause the module to error out. Specifically, when decoding the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense, given that the string data got corrupted somehow, but the question is "what is the proper way to deal with this in the module?" Should I just throw an error on bad data? Or would it be better to set the errors parameter to something like "replace"? The former feels a bit more "proper" to me (there's an error here, so we throw an error), but leaves the end user dead in the water, with no way to retrieve *any* of the data (from that row at least, and perhaps any rows after it as well). The latter option sort of feels like sweeping the problem under the rug, but does at least leave an error character in the string to let them know there was an error, and will allow retrieval of any good data.

Of course, if this was in my own code I could decide on a case-by-case basis what the proper action is, but since this a module that has to work in any situation, it's a bit more complicated.
-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

MRAB

unread,

Oct 17, 2017, 2:38:47 PM10/17/17

to

On 2017-10-17 18:26, Israel Brewster wrote:
> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D database, and I've run into a situation where corrupted string data from the database can cause the module to error out. Specifically, when decoding the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense, given that the string data got corrupted somehow, but the question is "what is the proper way to deal with this in the module?" Should I just throw an error on bad data? Or would it be better to set the errors parameter to something like "replace"? The former feels a bit more "proper" to me (there's an error here, so we throw an error), but leaves the end user dead in the water, with no way to retrieve *any* of the data (from that row at least, and perhaps any rows after it as well). The latter option sort of feels like sweeping the problem under the rug, but does at least leave an error character in the string to l
> et them know there was an error, and will allow retrieval of any good data.
>
> Of course, if this was in my own code I could decide on a case-by-case basis what the proper action is, but since this a module that has to work in any situation, it's a bit more complicated.
>

If a particular text field is corrupted, then raising UnicodeDecodeError
when trying to get the contents of that field as a Unicode string seems
reasonable to me.

Is there a way to get the contents as a bytestring, or to get the
contents with a different errors parameter, so that the user has the
means to fix it (if it's fixable)?

Israel Brewster

unread,

Oct 17, 2017, 3:25:59 PM10/17/17

to

That's certainly a possibility, if that behavior conforms to the DB API "standards". My concern in this front is that in my experience working with other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns designated as type VARCHAR or TEXT are returned as strings (unicode in python 2, although that may have been a setting I used), not bytes. The other complication here is that the 4D database doesn't use the UTF-8 encoding typically found, but rather UTF-16LE, and I don't know how well this is documented. So not only is the bytes representation completely unintelligible for human consumption, I'm not sure the average end-user would know what decoding to use.

In the end though, the main thing in my mind is to maintain "standards" compatibility - I don't want to be returning bytes if all other DB API modules return strings, or visa-versa for that matter. There may be some flexibility there, but as much as possible I want to conform to the majority/standard/whatever

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

> --
> https://mail.python.org/mailman/listinfo/python-list

Karsten Hilbert

unread,

Oct 17, 2017, 3:38:41 PM10/17/17

to

> That's certainly a possibility, if that behavior conforms to the DB API "standards". My concern in this front is that in my experience working with other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns designated as type VARCHAR or TEXT are returned as strings (unicode in python 2, although that may have been a setting I used), not bytes. The other complication here is that the 4D database doesn't use the UTF-8 encoding typically found, but rather UTF-16LE, and I don't know how well this is documented. So not only is the bytes representation completely unintelligible for human consumption, I'm not sure the average end-user would know what decoding to use.
>
> In the end though, the main thing in my mind is to maintain "standards" compatibility - I don't want to be returning bytes if all other DB API modules return strings, or visa-versa for that matter. There may be some flexibility there, but as much as possible I want to conform to the majority/standard/whatever

The thing here is that you don't want to return data AS IF it was correct despite it having been
"corrected" by some driver logic.

What might be interesting to users is to set an attribute on the cursor, say,

cursor.faulty_data = unicode(faulty_data, errors='replace')

or some such in order to improve error messages to the end user.

Karsten

MRAB

unread,

Oct 17, 2017, 4:02:46 PM10/17/17

to

On 2017-10-17 20:25, Israel Brewster wrote:
>
>> On Oct 17, 2017, at 10:35 AM, MRAB <pyt...@mrabarnett.plus.com

> That's certainly a possibility, if that behavior conforms to the DB
> API "standards". My concern in this front is that in my experience
> working with other PEP 249 modules (specifically psycopg2), I'm pretty
> sure that columns designated as type VARCHAR or TEXT are returned as
> strings (unicode in python 2, although that may have been a setting I
> used), not bytes. The other complication here is that the 4D database
> doesn't use the UTF-8 encoding typically found, but rather UTF-16LE,
> and I don't know how well this is documented. So not only is the bytes
> representation completely unintelligible for human consumption, I'm
> not sure the average end-user would know what decoding to use.
>
> In the end though, the main thing in my mind is to maintain
> "standards" compatibility - I don't want to be returning bytes if all
> other DB API modules return strings, or visa-versa for that matter.
> There may be some flexibility there, but as much as possible I want to
> conform to the majority/standard/whatever
>

The average end-user might not know which encoding is being used, but
providing a way to read the underlying bytes will give a more
experienced user the means to investigate and possibly fix it: get the
bytes, figure out what the string should be, update the field with the
correctly decoded string using normal DB instructions.

Abdur-Rahmaan Janhangeer

unread,

Oct 18, 2017, 5:47:17 AM10/18/17

to

all corruption systematically ignored but data piece logged in for analysis

Abdur-Rahmaan Janhangeer,
Mauritius
abdurrahmaanjanhangeer.wordpress.com

On 17 Oct 2017 21:43, "Israel Brewster" <isr...@ravnalaska.net> wrote:

> I have written and maintain a PEP 249 compliant (hopefully) DB API for the
> 4D database, and I've run into a situation where corrupted string data from
> the database can cause the module to error out. Specifically, when decoding
> the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode
> bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense,
> given that the string data got corrupted somehow, but the question is "what
> is the proper way to deal with this in the module?" Should I just throw an
> error on bad data? Or would it be better to set the errors parameter to
> something like "replace"? The former feels a bit more "proper" to me
> (there's an error here, so we throw an error), but leaves the end user dead
> in the water, with no way to retrieve *any* of the data (from that row at
> least, and perhaps any rows after it as well). The latter option sort of
> feels like sweeping the problem under the rug, but does at least leave an

> error character in the string to l

> et them know there was an error, and will allow retrieval of any good
> data.
>
> Of course, if this was in my own code I could decide on a case-by-case
> basis what the proper action is, but since this a module that has to work
> in any situation, it's a bit more complicated.

Israel Brewster

unread,

Oct 18, 2017, 12:23:16 PM10/18/17

to

> On Oct 18, 2017, at 1:46 AM, Abdur-Rahmaan Janhangeer <arj.p...@gmail.com> wrote:
>
> all corruption systematically ignored but data piece logged in for analysis

Thanks. Can you expound a bit on what you mean by "data piece logged in" in this context? I'm not aware of any logging specifications in the PEP 249, and would think that would be more end-user configured rather than module level.

-----------------------------------------------
Israel Brewster
Systems Analyst II
Ravn Alaska
5245 Airport Industrial Rd
Fairbanks, AK 99709
(907) 450-7293
-----------------------------------------------

>

> Abdur-Rahmaan Janhangeer,
> Mauritius
> abdurrahmaanjanhangeer.wordpress.com <http://abdurrahmaanjanhangeer.wordpress.com/>

>
> On 17 Oct 2017 21:43, "Israel Brewster" <isr...@ravnalaska.net <mailto:isr...@ravnalaska.net>> wrote:
> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D database, and I've run into a situation where corrupted string data from the database can cause the module to error out. Specifically, when decoding the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense, given that the string data got corrupted somehow, but the question is "what is the proper way to deal with this in the module?" Should I just throw an error on bad data? Or would it be better to set the errors parameter to something like "replace"? The former feels a bit more "proper" to me (there's an error here, so we throw an error), but leaves the end user dead in the water, with no way to retrieve *any* of the data (from that row at least, and perhaps any rows after it as well). The latter option sort of feels like sweeping the problem under the rug, but does at least leave an error character in the string to l
> et them know there was an error, and will allow retrieval of any good data.
>
> Of course, if this was in my own code I could decide on a case-by-case basis what the proper action is, but since this a module that has to work in any situation, it's a bit more complicated.
> -----------------------------------------------
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709

> (907) 450-7293 <tel:%28907%29%20450-7293>
> -----------------------------------------------
>
>
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list <https://mail.python.org/mailman/listinfo/python-list>

Israel Brewster

unread,

Oct 18, 2017, 12:33:06 PM10/18/17

to

On Oct 17, 2017, at 12:02 PM, MRAB <pyt...@mrabarnett.plus.com> wrote:
>
> On 2017-10-17 20:25, Israel Brewster wrote:
>>
>>> On Oct 17, 2017, at 10:35 AM, MRAB <pyt...@mrabarnett.plus.com <mailto:pyt...@mrabarnett.plus.com>> wrote:
>>>

>>> On 2017-10-17 18:26, Israel Brewster wrote:
>>>> I have written and maintain a PEP 249 compliant (hopefully) DB API for the 4D database, and I've run into a situation where corrupted string data from the database can cause the module to error out. Specifically, when decoding the string, I get a "UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 86-87: illegal UTF-16 surrogate" error. This makes sense, given that the string data got corrupted somehow, but the question is "what is the proper way to deal with this in the module?" Should I just throw an error on bad data? Or would it be better to set the errors parameter to something like "replace"? The former feels a bit more "proper" to me (there's an error here, so we throw an error), but leaves the end user dead in the water, with no way to retrieve *any* of the data (from that row at least, and perhaps any rows after it as well). The latter option sort of feels like sweeping the problem under the rug, but does at least leave an error character in the s
>>> tring to
>>> l
>>>> et them know there was an error, and will allow retrieval of any good data.
>>>> Of course, if this was in my own code I could decide on a case-by-case basis what the proper action is, but since this a module that has to work in any situation, it's a bit more complicated.

>>> If a particular text field is corrupted, then raising UnicodeDecodeError when trying to get the contents of that field as a Unicode string seems reasonable to me.
>>>
>>> Is there a way to get the contents as a bytestring, or to get the contents with a different errors parameter, so that the user has the means to fix it (if it's fixable)?
>>
>> That's certainly a possibility, if that behavior conforms to the DB API "standards". My concern in this front is that in my experience working with other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns designated as type VARCHAR or TEXT are returned as strings (unicode in python 2, although that may have been a setting I used), not bytes. The other complication here is that the 4D database doesn't use the UTF-8 encoding typically found, but rather UTF-16LE, and I don't know how well this is documented. So not only is the bytes representation completely unintelligible for human consumption, I'm not sure the average end-user would know what decoding to use.
>>
>> In the end though, the main thing in my mind is to maintain "standards" compatibility - I don't want to be returning bytes if all other DB API modules return strings, or visa-versa for that matter. There may be some flexibility there, but as much as possible I want to conform to the majority/standard/whatever
>>
> The average end-user might not know which encoding is being used, but providing a way to read the underlying bytes will give a more experienced user the means to investigate and possibly fix it: get the bytes, figure out what the string should be, update the field with the correctly decoded string using normal DB instructions.

I agree, and if I was just writing some random module I'd probably go with it, or perhaps with the suggestion offered by Karsten Hilbert. However, neither answer addresses my actual question, which is "how does the STANDARD (PEP 249 in this case) say to handle this, or, baring that (since the standard probably doesn't explicitly say), how do the MAJORITY of PEP 249 compliant modules handle this?" Not what is the *best* way to handle it, but rather what is the normal, expected behavior for a Python DB API module when presented with bad data? That is, how does psycopg2 behave? pyodbc? pymssql (I think)? Etc. Or is that portion of the behavior completely arbitrary and different for every module?

It may well be that one of the suggestions *IS* the normal, expected, behavior, but it sounds more like you are suggesting how you think would be best to handle it, which is appreciated but not actually what I'm asking :-) Sorry if I am being difficult.

> --
> https://mail.python.org/mailman/listinfo/python-list

Karsten Hilbert

unread,

Oct 18, 2017, 3:47:16 PM10/18/17

to

On Wed, Oct 18, 2017 at 08:32:48AM -0800, Israel Brewster wrote:

> actual question, which is "how does the STANDARD (PEP 249 in
> this case) say to handle this, or, baring that (since the
> standard probably doesn't explicitly say), how do the
> MAJORITY of PEP 249 compliant modules handle this?" Not what
> is the *best* way to handle it, but rather what is the
> normal, expected behavior for a Python DB API module when
> presented with bad data? That is, how does psycopg2 behave?
> pyodbc? pymssql (I think)? Etc. Or is that portion of the
> behavior completely arbitrary and different for every module?

For what it is worth, psycopg2 does not give you bad data to
the best of my knowledge. In fact, given PostgreSQL's quite
tight checking of text data to be "good" psycopg2 hardly has
a chance to give you bad data. Most times the database itself
will detect the corruption and not even hand the data to
psycopg2.

IMHO a driver should not hand over to the client any bad data
unless explicitely told to do so, which latter case does not
seem to be covered by the DB-API specs, regardless of what
the majority of drivers might do these days.

2 cent...

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346

Cameron Simpson

unread,

Oct 18, 2017, 5:55:03 PM10/18/17

to

On 17Oct2017 21:38, Karsten Hilbert <Karsten...@gmx.net> wrote:
>> That's certainly a possibility, if that behavior conforms to the DB API "standards". My concern in this front is that in my experience working with other PEP 249 modules (specifically psycopg2), I'm pretty sure that columns designated as type VARCHAR or TEXT are returned as strings (unicode in python 2, although that may have been a setting I used), not bytes. The other complication here is that the 4D database doesn't use the UTF-8 encoding typically found, but rather UTF-16LE, and I don't know how well this is documented. So not only is the bytes representation completely unintelligible for human consumption, I'm not sure the average end-user would know what decoding to use.
>>
>> In the end though, the main thing in my mind is to maintain "standards" compatibility - I don't want to be returning bytes if all other DB API modules return strings, or visa-versa for that matter. There may be some flexibility there, but as much as possible I want to conform to the majority/standard/whatever
>
>

>The thing here is that you don't want to return data AS IF it was correct despite it having been
>"corrected" by some driver logic.

I just want to say that I think this is correct and extremely important.

>What might be interesting to users is to set an attribute on the cursor, say,
> cursor.faulty_data = unicode(faulty_data, errors='replace')
>or some such in order to improve error messages to the end user.

Or perhaps more conveniently for the end user, possibly an option supplied at
db connect time, though I'd entirely understand wanting a cursor option so that
one can pick and choose in a fine grained fashion.

Cheers,
Cameron Simpson <c...@cskk.id.au> (formerly c...@zip.com.au)