Avoiding unicode in memo fields

241 views
Skip to first unread message

Lucas Taylor

unread,
Feb 21, 2013, 3:04:17 PM2/21/13
to python...@googlegroups.com
I am working with a FoxPro 2 dbf with an undeclared/default codepage ('\x00').  The originating application relies on control characters embedded in the memo field for print jobs, e.g '\x8f' to start printing, '\x7f\x84' to begin bold text, and so on. These are outside the ascii range so whenever a memo field is fetched or written, the default behavior of encoding/decoding to unicode fails. What I want to do is to declare that type 'M' memo fields should be treated as binary/general, but I can't alter the structure of the dbf or set the codepage (it would be meaningless to the originating application).

I have worked around this by creating a subclass of FpTable that overrides _field_types() and provides alternative methods for memo Retrieve and Update operations. These just get/set bytes instead of attempting to encode/decode anything.

'M': {
        'Type': 'Memo', 'Retrieve': retrieve_memo_bytes, 'Update': update_memo_bytes, 'Blank': lambda x: '\x00\x00\x00\x00', 'Init': add_vfp_memo,
        'Class': unicode, 'Empty': unicode, 'flags': ('binary', 'nocptrans', 'null', ),
        },


My question is: Is there a cleaner way to achieve the same results w/o this minor copy/paste surgery? Some way to declare "Treat 'M' memos as bytes"?


Thanks,

Lucas

Ethan Furman

unread,
Feb 21, 2013, 9:07:51 PM2/21/13
to python...@googlegroups.com
On 02/21/2013 12:04 PM, Lucas Taylor wrote:
> I am working with a FoxPro 2 dbf with an undeclared/default codepage ('\x00'). The originating application relies on
> control characters embedded in the memo field for print jobs, e.g '\x8f' to start printing, '\x7f\x84' to begin bold
> text, and so on. These are outside the ascii range so whenever a memo field is fetched or written, the default behavior
> of encoding/decoding to unicode fails. What I want to do is to declare that type 'M' memo fields should be treated as
> binary/general, but I can't alter the structure of the dbf or set the codepage (it would be meaningless to the
> originating application).
>
> I have worked around this by creating a subclass of FpTable that overrides _field_types() and provides alternative
> methods for memo Retrieve and Update operations. These just get/set bytes instead of attempting to encode/decode anything.
>
> 'M': {
> 'Type': 'Memo', 'Retrieve': *retrieve_memo_bytes*, 'Update': *update_memo_bytes*, 'Blank': lambda x:
> '\x00\x00\x00\x00', 'Init': add_vfp_memo,
> 'Class': unicode, 'Empty': unicode, 'flags': ('binary', 'nocptrans', 'null', ),
> },
>
>
> My question is: Is there a cleaner way to achieve the same results w/o this minor copy/paste surgery? Some way to
> declare "Treat 'M' memos as bytes"?

I don't remember for sure, but I don't think specifying a different code page when you open the table will save the
change to disk;
in other words:

table = dbf.Table('ascii_table_with_control_chars', codepage='latin1')

will use the latin1 codepage (latin1 is the first 256 unicode chars) but the table will still be ascii when you're done.
(Make a backup in case my memory is failing me!)

--
~Ethan~

Ethan Furman

unread,
Feb 28, 2013, 9:09:26 PM2/28/13
to python...@googlegroups.com
Lucas,

Were you able to test this, and did it work?

Also, I just updated the dbf repository -- did you get a notification?

--
~Ethan~

Lucas Taylor

unread,
Mar 1, 2013, 1:20:46 PM3/1/13
to python...@googlegroups.com
I did, but there are a few issues:

'latin1' isn't an available codepage. I thought cp1252 would be the best match, but it happens to have a few undefined/unused code points.

However, it appears that regardless of the codepage specified, the ascii codec is used for decoding:

table = Table('test_cp1252', 'memo M', dbf_type='fp', codepage='cp1252')
with table:
    table.append({'memo': 'Test, NOPE' + chr(143)})
DbfError: unable to write updates to disk, original data restored: UnicodeDecodeError('ascii', 'Test, NOPE\x8f', 10, 11, 'ordinal not in range(128)')

table = Table('test_mac_roman', 'memo M', dbf_type='fp', codepage='mac_roman')
with table:
    table.append({'memo': 'Test, NOPE' + chr(143)})
DbfError: unable to write updates to disk, original data restored: UnicodeDecodeError('ascii', 'Test, NOPE\x8f', 10, 11, 'ordinal not in range(128)')


Now, in my case I don't want any decoding to occur...I just want to treat the Memo as binary. I was looking for a way to pass a flag to Table(...) or somehow specify that the Memo should be treated as binary.


(Yes, I did receive a notice from Bitbucket..thanks!)


Ethan Furman

unread,
Mar 1, 2013, 1:47:01 PM3/1/13
to python...@googlegroups.com, et...@stoneleaf.us
On 03/01/2013 10:20 AM, Lucas Taylor wrote:
> On Feb 28, 2013, at 7:09 PM, Ethan Furman wrote:
>>
>> Were you able to test this, and did it work?
>
> I did, but there are a few issues:
>
> 'latin1' isn't an available codepage. I thought cp1252 would be the best match, but it happens to have a few
> undefined/unused code points.
>
> However, it appears that regardless of the codepage specified, the ascii codec is used for decoding:
>
> table = Table('test_cp1252', 'memo M', dbf_type='fp', codepage='cp1252')
> with table:
> table.append({'memo': 'Test, NOPE' + chr(143)})
> DbfError: unable to write updates to disk, original data restored: UnicodeDecodeError('ascii', 'Test, NOPE\x8f', 10, 11,
> 'ordinal not in range(128)')
>
> table = Table('test_mac_roman', 'memo M', dbf_type='fp', codepage='mac_roman')
> with table:
> table.append({'memo': 'Test, NOPE' + chr(143)})
> DbfError: unable to write updates to disk, original data restored: UnicodeDecodeError('ascii', 'Test, NOPE\x8f', 10, 11,
> 'ordinal not in range(128)')
>
>
> Now, in my case I don't want *any* decoding to occur...I just want to treat the Memo as binary. I was looking for a way
> to pass a flag to Table(...) or somehow specify that the Memo should be treated as binary.

You'll need to change the default encoding:

dbf.default_codepage = 'latin1'

Then, for a test, change the ascii codec out for latin1 as well:

dbf.code_pages['\x00'] = ('latin1', 'no translation')

This should make it so that any non-unicode data round-trips back to what it started as.

--
~Ethan~

Lucas Taylor

unread,
Mar 1, 2013, 4:32:48 PM3/1/13
to python...@googlegroups.com
--

That is a fine workaround for utilizing latin1, and should help.

Testing this out has uncovered what looks to be a bug in the character/memo update functions (update_character, update_memo, update_vfp_memo)

These functions always use the default input_decoder (ascii), instead of the decoder derived from the codepage. So, if you attempt to update a field using bytes, it will attempt to first decode to unicode and fail unless the bytes are within the ascii range. If you start out with unicode, the functions use the correct encoder:

table.append({'memo': u'\x8f'}) --> OK
table.append({'memo': '\x8f'}) --> Always fails, regardless of codepage setting

It looks to be an easy fix and if I get time I'll send a pull request.



Thanks for your help!



Ethan Furman

unread,
Mar 1, 2013, 4:59:49 PM3/1/13
to python...@googlegroups.com
On 03/01/2013 01:32 PM, Lucas Taylor wrote:
> On Mar 1, 2013, at 11:47 AM, Ethan Furman wrote:
>>
>> You'll need to change the default encoding:
>>
>> dbf.default_codepage = 'latin1'
>>
>> Then, for a test, change the ascii codec out for latin1 as well:
>>
>> dbf.code_pages['\x00'] = ('latin1', 'no translation')
>>
>> This should make it so that any non-unicode data round-trips back to what it started as.
>
> That is a fine workaround for utilizing latin1, and should help.
>
> Testing this out has uncovered what looks to be a bug in the character/memo update functions (update_character,
> update_memo, update_vfp_memo)
>
> These functions always use the default input_decoder (ascii), instead of the decoder derived from the codepage. So, if
> you attempt to update a field using bytes, it will attempt to first decode to unicode and fail unless the bytes are
> within the ascii range. If you start out with unicode, the functions use the correct encoder:
>
> table.append({'memo': u'\x8f'}) --> OK
> table.append({'memo': '\x8f'}) --> Always fails, regardless of codepage setting
>
> I've logged an issue in my fork w/ more info:
> https://bitbucket.org/ltvolks/python-dbase/issue/5/updating-character-or-memo-fields-with
> It looks to be an easy fix and if I get time I'll send a pull request.

The input_decoder is for decoding strings /from the program/, which could easily be different from whatever code page
the dbf is using.

The `dbf.default_codepage` in my earlier message should actually be `dbf.input_decoding` -- sorry for the mixup.

--
~Ethan~
Reply all
Reply to author
Forward
0 new messages