Am .02.2014, 22:26 Uhr, schrieb <
edward....@gmail.com>:
> Just for completeness, this is what the Ecma
> spec<
http://www.ecma-international.org/publications/standards/Ecma-376.htm>
> <
http://www.ecma-international.org/publications/standards/Ecma-376.htm>says
> about the "dimension" tag:
> 18.3.1.35 dimension (Worksheet Dimensions)
> This element specifies the used range of the worksheet. It specifies the
> row and column bounds of used cells in
> the worksheet. *This is optional and is not required.*
Yes, very much lipstick on the proverbial pig that this specification is.
> I appreciate the focus on efficiency. However, I don't think you can do
> much better than:
> 1. If "dimension" tag is present, use that.
> 2. If not, scan the rows, and if the "spans" tags are present, use
> those. Also count the rows.
> 3. If there are no "spans" tags, scan all the column entries in each
> row
> and record the min/max column. The spec does not say whether the
> columns
> in the row are in left-to-right order, so I would assume you have to
> look
> at each of them.
Indeed.
> If the openpyxl optimized reader can read the file without getting the
> dimensions, than that's great. That may be all the caller wants.
The iterator is there for users of extremely large files. It's a
potentially huge penalty on users to do anything else. This is essentially
current behaviour in the 1.9 branch. Dimensions are always at the start of
the file so reading them is not much of a penalty. Currently this happens
automatically but I'll probably defer it. Whether we provide a method that
will do a full scan or not will depend upon whether people need it.
> However, if the caller needs the dimensions, and they are not stored in
> the
> sheet, then the sheet is going to have to be scanned twice anyway.
Nope. As you iterate through the rows you can collect the coordinates of
each cell for the calculation in a single pass.
> It would be nicer if openpyxl did the work here.
Might add the method, depends on whether there is a real use case. For
example, do you need the dimensions for your customer?
> Ultimately, the Ecma standard is the root cause. Unless the writer uses
> the optimization fields, the sheet needs to be scanned twice to get its
> dimensions.
Nope, see above.
> Also, consider the following suggestion:
> Since "dimensions" and "spans" tags are optional, would the openpyxl
> optimized writer run faster by not bothering with those tags in its
> output?
Quite possibly. Eric told me that the writer has to jump some hurdles
because of the need to put the dimensions at the start of the file. I
haven't started looking at the optimised writer yet but there is certainly
huge potential for performance improvements. As things stand the
recommendation has to be use xlsxwriter for speed because it's about 5
times faster. The focus on 1.9 and beyond is harmonising behaviour across
reader and writer and removing duplication and cleaning up where possible.
> Excel can read the files perfectly well without those tags.
> The output could be significantly smaller too.
I suspect not given that these are repetitive strings that will compress
well.
> For example, the attached file from Google Docs is 72k.
> After opening it and saving it in Excel, it becomes 99k - about 37%
> larger.
> The full extent of what Excel adds is unclear, but it does add the
> optional optimization fields.
> Perhaps this is why the Google Docs team didn't include the optimization
> fields in their output.
No idea. I'm not going to start guessing why they do things but having it
at the start of a file certainly poses a problem in any kind of streaming
context. But I'm also not going to start making up for their omissions.