openpyxl interprets Bashkir-currency-formatted values as dates?

484 views
Skip to first unread message

John Hays

unread,
Jun 16, 2015, 10:04:33 AM6/16/15
to openpyx...@googlegroups.com
So this is may sound like a wonky super-specific issue but it's a big deal in my particular implementation if the issue happens to be deeper than just this particular currency. Hopefully I'm just being dumb. As part of a larger program implementing openpyxl, I've included the ability to perform regex queries against cell values as read by openpyxl. It's been working fine against most spreadsheets but I ran across a strange problem with cells formatted as Bashkir currency. I checked to see if it happened with any numbers formatted as currency and I didn't have this problem with the few other formats I tested (Turkish, Konkani, Dollars). I don't necessarily need the ability to read Bashkir currency values...I just happened to stumble across this while I testing to see if the different Excel number formats would affect openpyxl and regex cell interpretation.

I'm using Python 3.4.3, Excel 2013, and openpyxl 2.2.3.

Here's the code I used to re-create the issue outside of my program:
wb = load_workbook(filename='myfile.xlsx',guess_types=False)
ws = wb.worksheets[0]
for row in ws.rows:
        for cell in row:
            print(cell.value)

Here's the source cell values (w/ Format Cells -> Currency -> Bashkir applied in Excel 2013).
9000
9200
800
9420
9422
32123

A simple regex query such as "\d{4}-\d{2}" will match all of the above. "\d{4}$" would match none; if the cell.values were being read as expected, I should have 4 hits (9000,9200,9420,9422). Here's the print(cell.value) output in console.

1924-08-21 00:00:00
1925-03-09 00:00:00
1902-03-10 00:00:00
1925-10-15 00:00:00
1925-10-17 00:00:00
1987-12-12 00:00:00

If I copy all these values to a new worksheet through openpyxl this is what I get back (sans any formatting):
9000
9200
800
9420
9422
32123

So while the end result looks fine, I use my regex query to determine if the cells get written to a new worksheet based on their cell.value. If they get mis-interpreted as dates, my regex queries won't match against values I'm needing to extract. I've attached the source worksheet .xml if it helps.
sheet1.xml

John Hays

unread,
Jun 16, 2015, 10:55:30 AM6/16/15
to openpyx...@googlegroups.com
It looks like I could just use cell.internal_value to ensure the number stored in the sheet1.xml doesn't get mangled but is there any downside to this? I think I remember reading somewhere on SO that cell.internal_value shouldn't be used except in very specific circumstances.

Charlie Clark

unread,
Jun 16, 2015, 1:25:13 PM6/16/15
to openpyx...@googlegroups.com
Hiya John,

Am .06.2015, 16:04 Uhr, schrieb John Hays <femur...@gmail.com>:

> So this is may sound like a wonky super-specific issue but it's a big
> deal
> in my particular implementation if the issue happens to be deeper than
> just
> this particular currency. Hopefully I'm just being dumb. As part of a
> larger program implementing openpyxl, I've included the ability to
> perform
> regex queries against cell values as read by openpyxl. It's been working
> fine against most spreadsheets but I ran across a strange problem with
> cells formatted as Bashkir currency. I checked to see if it happened with
> any numbers formatted as currency and I didn't have this problem with the
> few other formats I tested (Turkish, Konkani, Dollars). I don't
> necessarily
> need the ability to read Bashkir currency values...I just happened to
> stumble across this while I testing to see if the different Excel number
> formats would affect openpyxl and regex cell interpretation.

All these things are wonky until you come across in your work.

> I'm using Python 3.4.3, Excel 2013, and openpyxl 2.2.3.
> Here's the code I used to re-create the issue outside of my program:
> wb = load_workbook(filename='myfile.xlsx',guess_types=False) # False has
> been the default for a while
> ws = wb.worksheets[0]
> for row in ws.rows: # for row in ws: also works
> So while the *end* result looks fine, I use my regex query to determine
> if the cells get written to a new worksheet based on their cell.value.
> If they get mis-interpreted as dates, my regex queries won't match
> against values
> I'm needing to extract. I've attached the source worksheet .xml if it
> helps.

Unfortunately, you come across one of the bigger turds in the
specification. It makes me weep when I think of it: dates and times are
implemented in Excel via the number format. On their own they are just
numbers relevant to particular epochs. FWIW in response to a bug on the
lack of proper support for datetimes in the format I recently got this
response:
"""
Hi Charlie - after much discussion of your defect report
(https://onedrive.live.com/view.aspx/Public%20Documents/2014/DR-14-0012.docx?cid=C8BA0861DC5E4ADC),
WG4 have decided to close it without action. Honestly, though, I do feel
your pain about this one as the standard is somewhat confusing…
"""

The strict specification does actually support a datetime cell type but
you still have to assign a format to it and very few applications actually
support it. :-( The side-effect of this is a significant performance
impact for all these cells.

So, when reading a cell you *have* to take the formatting to account.
There are some (locale-specific for extra fun) builtins that correspond to
datetimes otherwise you have to rely on the power your regex to determine
whether a number format actually is datetime or just some special
accounting formatting from Timbuktu, Tashkent, Ouagadougou, or wherever!
It's almost certain that this is at fault here. Your sample sheet.xml is
of no use, puny earthling! You also need to look at the styles.xml and
work your way through the style chain. Or, simply look at the number
format:

cell.number_format
and whether openpyxl has decided this is a date or not
cell.is_date

In which the culprit is a sneaky function:
from openpyxl.styles.numbers import is_date_format

The code and associated regex could almost certainly be improved…

OTOH it's perfectly okay to use internal_value, if that's what you want.
It's a public attribute that most people won't want but it's there for a
reason.

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

John Hays

unread,
Jun 17, 2015, 9:30:51 AM6/17/15
to openpyx...@googlegroups.com
Thanks for the explanation! I dug into my styles.xml and found that the Bashkir numFmtId is 164 which is also happens to be a hard-coded value inside the NumberFormatDescriptor class within openpyxl.styles.styleable. I'm fairly new to programming and a total Python greenhorn so my ability to follow other people's code is rather laughable but it did seem a bit coincidental as I wasn't able to repeat this issue with many other obscure currency formats. Could easily be nothing but I figured I'd mention it JIC.

In any case, internal_value seems to be safest option to run regex queries against.

Thanks again!

Charlie Clark

unread,
Jun 17, 2015, 9:39:10 AM6/17/15
to openpyx...@googlegroups.com
Am .06.2015, 15:30 Uhr, schrieb John Hays <femur...@gmail.com>:

> Thanks for the explanation!

> I dug into my styles.xml and found that the
> Bashkir numFmtId is 164 which is also happens to be a hard-coded value
> inside the NumberFormatDescriptor class within openpyxl.styles.styleable.

This is sort of random. There is no explicit upper limit for built-in
formats but the reverse-engineered convention is that everything up to and
including 164 are built-in. The documentation itself, however, only goes
up to 81.

What does the format actually look like? cell.number_format? Can you
provide the styles.xml?

> I'm fairly new to programming and a total Python greenhorn so my ability
> to follow other people's code is rather laughable but it did seem a bit
> coincidental as I wasn't able to repeat this issue with many other
> obscure currency formats. Could easily be nothing but I figured I'd
> mention it JIC.

This is one of the hairiest parts of the code because we convert some
weird denormalised pointer nonsense into Pythonic objects.

> In any case, internal_value seems to be safest option to run regex
> queries against.

That's too broad a definition. In almost all cases you won't want to do
that.

John Hays

unread,
Jun 17, 2015, 10:33:02 AM6/17/15
to openpyx...@googlegroups.com
> In any case, internal_value seems to be safest option to run regex  
> queries against.

That's too broad a definition. In almost all cases you won't want to do  
that.

Ah, I get what you're saying. For now I've built in the ability to run the query against a specific column to prevent unwanted matches and since I'm the only one running the code I can always double check to see how Excel is actually storing the value in the worksheet .xml file and adjust my regex query appropriately. As you suggested earlier I'll be putting in formatting checks in at some point to correct these shortcomings.

I've attached my styles.xml...

...and this is what gets returned from (cell.number_format).encode('utf-8'):
b'#,##0\\ [$\xe2\x82\xbd-46D]'

(I'm using Powershell at the moment and the .encode was the only way I could get output without an 'UnicodeEncodeError')
styles.xml

Charlie Clark

unread,
Jun 17, 2015, 10:51:02 AM6/17/15
to openpyx...@googlegroups.com
Am .06.2015, 16:33 Uhr, schrieb John Hays <femur...@gmail.com>:

> Ah, I get what you're saying. For now I've built in the ability to run
> the query against a specific column to prevent unwanted matches and
> since I'm
> the only one running the code I can always double check to see how Excel
> is actually storing the value in the worksheet .xml file and adjust my
> regex
> query appropriately. As you suggested earlier I'll be putting in
> formatting checks in at some point to correct these shortcomings.

> I've attached my styles.xml...
> ...and this is what gets returned from
> (cell.number_format).encode('utf-8'):
> b'#,##0\\ [$\xe2\x82\xbd-46D]'

Interesting. What does cell.is_date say?

John Hays

unread,
Jun 17, 2015, 10:59:45 AM6/17/15
to openpyx...@googlegroups.com
cell.is_date returns 'True'

Charlie Clark

unread,
Jun 17, 2015, 1:56:40 PM6/17/15
to openpyx...@googlegroups.com
Am .06.2015, 16:33 Uhr, schrieb John Hays <femur...@gmail.com>:

> b'#,##0\\ [$\xe2\x82\xbd-46D]'

You can actually see it in the XML source for the styles:

"#,##0\ [$₽-46D] - apparently that's the the unicode for Rouble

https://msdn.microsoft.com/en-us/library/dd318693(VS.85).aspx tells us
that the 46D is the Bashkir locale. Our naive lookup is interpreting the
"D" as indication of a date format. Looks like this needs improving.

Charlie Clark

unread,
Jun 17, 2015, 4:27:02 PM6/17/15
to openpyx...@googlegroups.com
Am .06.2015, 19:56 Uhr, schrieb Charlie Clark
<charli...@clark-consulting.eu>:

> https://msdn.microsoft.com/en-us/library/dd318693(VS.85).aspx tells us
> that the 46D is the Bashkir locale. Our naive lookup is interpreting the
> "D" as indication of a date format. Looks like this needs improving.

This now seems fixed and should be in the next release.

John Hays

unread,
Jun 17, 2015, 5:10:06 PM6/17/15
to openpyx...@googlegroups.com
Excellent! Thank you for your assistance; I'm looking forward to using the new version.

Charlie Clark

unread,
Jun 18, 2015, 8:50:09 AM6/18/15
to openpyx...@googlegroups.com
Am .06.2015, 23:10 Uhr, schrieb John Hays <femur...@gmail.com>:

> Excellent! Thank you for your assistance; I'm looking forward to using
> the
> new version.

In case you missed it, the new version was released yesterday.
Reply all
Reply to author
Forward
0 new messages