Problem reading xlsx file created from Google Docs

936 views
Skip to first unread message

edward....@gmail.com

unread,
Feb 5, 2014, 8:48:36 AM2/5/14
to openpyx...@googlegroups.com
I am using openpyxl==1.8.2.
When I attempt to read the attached file created in Google Docs, I get the following error:

  File "C:\Projects\CrossMgr\Excel.py", line 114, in __init__
    self.book = load_workbook( filename = filename, use_iterators = True )
  File "build\bdist.win32\egg\openpyxl\reader\excel.py", line 136, in load_workbook

  File "build\bdist.win32\egg\openpyxl\reader\excel.py", line 204, in _load_workbook

  File "build\bdist.win32\egg\openpyxl\reader\worksheet.py", line 310, in read_worksheet
  File "build\bdist.win32\egg\openpyxl\reader\iter_worksheet.py", line 137, in __init__
  File "build\bdist.win32\egg\openpyxl\reader\worksheet.py", line 78, in read_dimension
AttributeError: 'NoneType' object has no attribute 'split'

If I open this file in Excel, save it, then open it, it works fine.
Any ideas?

Thanks in advance.
EscapeVelocityRegistrationExport.xlsx

Charlie Clark

unread,
Feb 5, 2014, 8:54:57 AM2/5/14
to openpyx...@googlegroups.com
Am .02.2014, 14:48 Uhr, schrieb <edward....@gmail.com>:

> If I open this file in Excel, save it, then open it, it works fine.
> Any ideas?

Yes, this is the same as
https://bitbucket.org/ericgazoni/openpyxl/issue/260/182-crashes-when-trying-to-find-dimension

It is fixed in the 1.8 branch and we will be making a release soon.

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226

edward....@gmail.com

unread,
Feb 5, 2014, 9:31:24 AM2/5/14
to openpyx...@googlegroups.com
I will switch to the new version when it is available.
Thanks much for letting me know.

edward....@gmail.com

unread,
Feb 13, 2014, 10:30:57 PM2/13/14
to openpyx...@googlegroups.com, edward....@gmail.com
Just installed 1.8.3.
Unfortunately, the problem can still be seen with the attached file in my previous post.

The issue is that the rows from sheets generated in Google docs have no "spans" tag (el.get("spans") returns None in worksheet.py:77).
The sheet.xml from the xlmx file has the format below.  There is no <spans> tag on the row.

Perhaps the min and max columns can be determined by looping across the row entries if span == None.
Is that possible?

<sheetData>
<row r="1">
<c t="s" s="1" r="A1"><v>0</v></c>
<c t="s" s="2" r="B1"><v>1</v></c>
<c t="s" s="2" r="C1"><v>2</v></c>
<c t="s" s="2" r="D1"><v>3</v></c>
<c t="s" s="2" r="E1"><v>4</v></c>
<c t="s" s="2" r="F1"><v>5</v></c>
<c t="s" s="2" r="G1"><v>6</v></c>
<c t="s" s="3" r="H1"><v>7</v></c>
<c t="s" s="4" r="I1"><v>8</v></c>
<c t="s" s="2" r="J1"><v>9</v></c>
<c s="5" r="K1"><v>3.0</v></c>
<c t="s" s="6" r="L1"><v>10</v></c>
</row>

---------------------------

  File "C:\Python27\lib\site-packages\openpyxl\reader\excel.py", line 136, in load_workbook
    _load_workbook(wb, archive, filename, use_iterators, keep_vba)
  File "C:\Python27\lib\site-packages\openpyxl\reader\excel.py", line 203, in _load_workbook
    sheet_codename=sheet_codename)
  File "C:\Python27\lib\site-packages\openpyxl\reader\worksheet.py", line 312, in read_worksheet
    xml_source, string_table, style_table)
  File "C:\Python27\lib\site-packages\openpyxl\reader\iter_worksheet.py", line 136, in __init__
    min_col, min_row, max_col, max_row = read_dimension(xml_source=self.xml_source)
  File "C:\Python27\lib\site-packages\openpyxl\reader\worksheet.py", line 78, in read_dimension
    if ":" in span:
TypeError: argument of type 'NoneType' is not iterable

edward....@gmail.com

unread,
Feb 14, 2014, 8:25:10 AM2/14/14
to openpyx...@googlegroups.com, edward....@gmail.com
I found a fix (see below) that works with spreadsheets created in Google Docs that don't have <spans> tags on the rows.
The code below should replace the read_dimension function in worksheet.py (sorry I didn't make a fork).

def read_dimension(xml_source):
    min_row = min_col =  max_row = max_col = None
    source = _get_xml_iter(xml_source)
    it = iterparse(source)

    dimension_tag = '{%s}dimension' % SHEET_MAIN_NS
    row_tag = '{%s}row' % SHEET_MAIN_NS
    column_tag = '{%s}c' % SHEET_MAIN_NS

    alpha_cap = set( 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' )
    ord_A = ord('A')

    for event, el in it:
        if el.tag == dimension_tag:
            dim = el.get("ref")
            if ':' in dim:
                start, stop = dim.split(':')
            else:
                start = stop = dim
            min_col, min_row = coordinate_from_string(start)
            max_col, max_row = coordinate_from_string(stop)
            return min_col, min_row, max_col, max_row

        if el.tag == row_tag:
            row = el.get("r")
            if min_row is None:
                min_row = int(row)
            span = el.get("spans")
            if span is None:
                for child in el:
                    if child.tag == column_tag:
                        ref = child.get("r")
                        cur_col = 0
                        for ch in ref:
                            if ch not in alpha_cap:
                                break
                            cur_col = cur_col * 26 + ord(ch) - ord_A
                        cur_col += 1
                        if min_col is None:
                             min_col = cur_col
                             max_col = cur_col
                        else:
                             min_col = min( min_col, cur_col )
                             max_col = max( max_col, cur_col )
            elif ":" in span:
                start, stop = span.split(":")
                if min_col is None:
                    min_col = int(start)
                    max_col = int(stop)
                else:
                    min_col = min(min_col, int(start))
                    max_col = max(max_col, int(stop))
    max_row = int(row)
    warn("Unsized worksheet")
    return get_column_letter(min_col), min_row, get_column_letter(max_col),  max_row

Charlie Clark

unread,
Feb 14, 2014, 1:34:11 PM2/14/14
to openpyx...@googlegroups.com
Hi,

Am .02.2014, 04:30 Uhr, schrieb <edward....@gmail.com>:

> Just installed 1.8.3.
> Unfortunately, the problem can still be seen with the attached file in my
> previous post.

> The issue is that the rows from sheets generated in Google docs have no
> "spans" tag (el.get("spans") returns None in worksheet.py:77).
> The sheet.xml from the xlmx file has the format below. There is no
> <spans>
> tag on the row.

True, sorry about that.

> Perhaps the min and max columns can be determined by looping across the
> row
> entries if span == None.
> Is that possible?

Yes, it's possible but it's pointless: looping over *all* the cells so you
can loop over all the cells…

I've already fixed this in 1.9: you can simply loop through unsized
worksheets but an exception will be raised if you want to know how big
they are. I'll have to see whether I can backport something so that you
can use the IterableWorksheet with Google docs files, though really this
is a massive bug in Google docs.

Charlie Clark

unread,
Feb 14, 2014, 1:40:53 PM2/14/14
to openpyx...@googlegroups.com
Am .02.2014, 14:25 Uhr, schrieb <edward....@gmail.com>:

> I found a fix (see below) that works with spreadsheets created in Google
> Docs that don't have <spans> tags on the rows.
> The code below should replace the read_dimension function in worksheet.py
> (sorry I didn't make a fork).

Just a brief note, see my previous e-mail about why this is not going to
be fixed, this kind of code if far too complex to be useful and *nothing*
will be excepted without tests. This is not personal criticism, I just see
it is a good opportunity to point it out. If you start implementing
lookups in the function then it's time to pause. As it stands we do have
some utility functions for converting between Excel coordinates and
numeric indices and back but they're probably not documented as well as
they should be.

But thanks very much for pointing out the problem and trying to fix it! I
hope I haven't discouraged you from further contributions!

edward....@gmail.com

unread,
Feb 14, 2014, 3:44:40 PM2/14/14
to openpyx...@googlegroups.com
No offense taken.
I had a customer with a bug and I shared the fix.
What I have now will work fine until 1.9 is available.  Scanning the file twice is a bit slower, but not significantly.
Don't waste your time backporting - I can wait.

Regarding Google Docs and xlsx output, I did some research.
According to "Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference.pdf", section 18.3.1.73, on page 1671, the <spans> tag is "Optimization only, and not required".
Google's xlsx is compliant with the spec - although is lacks most (all?) of the optimization features.

Thanks again.

Charlie Clark

unread,
Feb 15, 2014, 9:38:09 AM2/15/14
to openpyx...@googlegroups.com
Am .02.2014, 21:44 Uhr, schrieb <edward....@gmail.com>:


> No offense taken.
> I had a customer with a bug and I shared the fix.
> What I have now will work fine until 1.9 is available. Scanning the file
> twice is a bit slower, but not significantly.

It will depend upon the size of the file and memory use is particularly
dependent upon the number of columns. But, yeah, basically it's more bad
practice than a real a bottleneck.

> Don't waste your time backporting - I can wait.

Unfortunately, it's not just your code that is broken as result. I'm going
to reinstate the 1.7 practice of parsing the whole damn file twice. I
didn't have a test file without spans when I was working on 1.8, and as it
was me that broke this, it'll be me that'll fix it.

> Regarding Google Docs and xlsx output, I did some research.
> According to "Ecma Office Open XML Part 1 - Fundamentals And Markup
> Language Reference.pdf", section 18.3.1.73, on page 1671, the <spans> tag
> is "Optimization only, and not required".
> Google's xlsx is compliant with the spec - although is lacks most (all?)
> of the optimization features.

Thanks for the lookup, openpyxl has historically been more reverse
engineering than working from the spec but I'm we can start to change that.

TBH the "spans" aren't much use here either. They're convenient when
you're working with columns but in this instance whether you parse all
rows or all rows and cells doesn't make much difference. The real
convenience is the "dimensions" tag at the head of the file because you
can break out of the parser as soon as you've found this. This is how 1.9
works and one of the reasons why it's about 25 % faster (for all cells,
even more so when you're just using a range) than 1.8 when using iterators.

For reference, we're already in an event parser so you can just add an
additional code branch for cells, though it does also appear that using
Element.iter(node, tagname) seems to be slightly more efficient.

openpyxl.cell.column_index_from_string() and
openpyxl.cell.get_column_letter() are the utility functions for the lookup.

edward....@gmail.com

unread,
Feb 15, 2014, 4:26:14 PM2/15/14
to openpyx...@googlegroups.com
Just for completeness, this is what the Ecma spec says about the "dimension" tag:

18.3.1.35 dimension (Worksheet Dimensions)
This element specifies the used range of the worksheet. It specifies the row and column bounds of used cells in
the worksheet. This is optional and is not required.

I appreciate the focus on efficiency.  However, I don't think you can do much better than:
  1. If  "dimension" tag is present, use that.
  2. If not, scan the rows, and if the "spans" tags are present, use those.  Also count the rows.
  3. If there are no "spans" tags, scan all the column entries in each row and record the min/max column.  The spec does not say whether the columns in the row are in left-to-right order, so I would assume you have to look at each of them.
If the openpyxl optimized reader can read the file without getting the dimensions, than that's great.  That may be all the caller wants.
However, if the caller needs the dimensions, and they are not stored in the sheet, then the sheet is going to have to be scanned twice anyway.
It would be nicer if openpyxl did the work here.

Ultimately, the Ecma standard is the root cause.  Unless the writer uses the optimization fields, the sheet needs to be scanned twice to get its dimensions.

Also, consider the following suggestion:

Since "dimensions" and "spans" tags are optional, would the openpyxl optimized writer run faster by not bothering with those tags in its output?
Excel can read the files perfectly well without those tags.

The output could be significantly smaller too.
For example, the attached file from Google Docs is 72k.
After opening it and saving it in Excel, it becomes 99k - about 37% larger.
The full extent of what Excel adds is unclear, but it does add the optional optimization fields.

Perhaps this is why the Google Docs team didn't include the optimization fields in their output.

Charlie Clark

unread,
Feb 16, 2014, 7:11:47 AM2/16/14
to openpyx...@googlegroups.com
Am .02.2014, 22:26 Uhr, schrieb <edward....@gmail.com>:

> Just for completeness, this is what the Ecma
> spec<http://www.ecma-international.org/publications/standards/Ecma-376.htm>
> <http://www.ecma-international.org/publications/standards/Ecma-376.htm>says
> about the "dimension" tag:

> 18.3.1.35 dimension (Worksheet Dimensions)
> This element specifies the used range of the worksheet. It specifies the
> row and column bounds of used cells in
> the worksheet. *This is optional and is not required.*

Yes, very much lipstick on the proverbial pig that this specification is.

> I appreciate the focus on efficiency. However, I don't think you can do
> much better than:
> 1. If "dimension" tag is present, use that.
> 2. If not, scan the rows, and if the "spans" tags are present, use
> those. Also count the rows.
> 3. If there are no "spans" tags, scan all the column entries in each
> row
> and record the min/max column. The spec does not say whether the
> columns
> in the row are in left-to-right order, so I would assume you have to
> look
> at each of them.

Indeed.

> If the openpyxl optimized reader can read the file without getting the
> dimensions, than that's great. That may be all the caller wants.

The iterator is there for users of extremely large files. It's a
potentially huge penalty on users to do anything else. This is essentially
current behaviour in the 1.9 branch. Dimensions are always at the start of
the file so reading them is not much of a penalty. Currently this happens
automatically but I'll probably defer it. Whether we provide a method that
will do a full scan or not will depend upon whether people need it.

> However, if the caller needs the dimensions, and they are not stored in
> the
> sheet, then the sheet is going to have to be scanned twice anyway.

Nope. As you iterate through the rows you can collect the coordinates of
each cell for the calculation in a single pass.

> It would be nicer if openpyxl did the work here.

Might add the method, depends on whether there is a real use case. For
example, do you need the dimensions for your customer?

> Ultimately, the Ecma standard is the root cause. Unless the writer uses
> the optimization fields, the sheet needs to be scanned twice to get its
> dimensions.

Nope, see above.

> Also, consider the following suggestion:

> Since "dimensions" and "spans" tags are optional, would the openpyxl
> optimized writer run faster by not bothering with those tags in its
> output?

Quite possibly. Eric told me that the writer has to jump some hurdles
because of the need to put the dimensions at the start of the file. I
haven't started looking at the optimised writer yet but there is certainly
huge potential for performance improvements. As things stand the
recommendation has to be use xlsxwriter for speed because it's about 5
times faster. The focus on 1.9 and beyond is harmonising behaviour across
reader and writer and removing duplication and cleaning up where possible.

> Excel can read the files perfectly well without those tags.

> The output could be significantly smaller too.

I suspect not given that these are repetitive strings that will compress
well.

> For example, the attached file from Google Docs is 72k.
> After opening it and saving it in Excel, it becomes 99k - about 37%
> larger.
> The full extent of what Excel adds is unclear, but it does add the
> optional optimization fields.


> Perhaps this is why the Google Docs team didn't include the optimization
> fields in their output.

No idea. I'm not going to start guessing why they do things but having it
at the start of a file certainly poses a problem in any kind of streaming
context. But I'm also not going to start making up for their omissions.

edward....@gmail.com

unread,
Feb 16, 2014, 11:37:00 AM2/16/14
to openpyx...@googlegroups.com
Sure, you know the dimensions of a sheet after reading it, one may want to know the dimensions without fully reading the sheet, if at all possible.
Three use cases are to preallocate space, initialize additional data structures based on the expected rows/cols and to check integrity based on the number of rows/cols.
I didn't convey this clearly in my previous post.

Consider a a maximum-performance/minimum-size streaming output option with no optimization tags, as permitted by the spec.  Let the caller decide the speed/size/optimization tradeoff.

Charlie Clark

unread,
Feb 16, 2014, 11:51:08 AM2/16/14
to openpyx...@googlegroups.com
Am .02.2014, 17:37 Uhr, schrieb <edward....@gmail.com>:

> Sure, you know the dimensions of a sheet after reading it, one may want
> to
> know the dimensions without fully reading the sheet, if at all possible.

Do you need to do this? It's also perfectly reasonable to want to know
that you don't know how big the worksheet is.

> Three use cases are to preallocate space, initialize additional data
> structures based on the expected rows/cols and to check integrity based
> on
> the number of rows/cols.

Using iterators is expressly against preallocating space and there's not
much you can do about integrity either. If you are worried about the
quality of sheets produced, say, by another library then that library is
the thing to fix. If you want a specific range of cells, including blanks,
then that's taken care of.

> I didn't convey this clearly in my previous post.
> Consider a a maximum-performance/minimum-size streaming output option
> with
> no optimization tags, as permitted by the spec. Let the caller decide
> the
> speed/size/optimization tradeoff.

How are they going to be able to assess? It's possible to calculate the
dimensions as a *side-effect* of reading all the cells.

Anyway, code speaks louder than words. I fixed 1.8 including added a test
and 1.9 is currently as described. Additional changes will depend upon
specific requirements.
Reply all
Reply to author
Forward
0 new messages