Question about format info in xlrd

Daniel Burgener

unread,

Aug 25, 2010, 5:02:47 PM8/25/10

to python-excel

I am working with a spreadsheet and need to determine whether or not
certain cells are bolded. I have been reading tutorials and guides
online and it seems that the way to do this is by setting
formatting_info=True in open_workbook() and then I can access the
xf_index of individual cells. However, when I try to set
formatting_info=True, my Python interpreter freezes up. From task
manager, I can see that it's using over 500MB of RAM. When I run the
script without formatting_info set to true, it only uses ~10MB. I
suspect that the issue is caused by my spreadsheet being very large.
I tried setting on_demand=True, but that did not help. Is there any
way to figure out whether cells are bold with a very large
spreadsheet? Thanks!

-Daniel

John Machin

unread,

Aug 25, 2010, 8:36:08 PM8/25/10

to python...@googlegroups.com

Some basic information, please:

What version of Python
What version and SP of Windows
What version of xlrd

What does "freezes up" mean? Does it freeze inside open_workbook or
inside your own code? Does it freeze if you run

\python2X\python \python2X\scripts\runxlrd.py -f1 ov yourfile.xls

in a Command Prompt window [where "python2X" is where your Python is
installed]?

What does task manager show is the percentage of CPU that Python is
getting? What else is running?

How much physical memory?
Size of pagefile (shown as "Commit charge (limit)" in task manager on
Win XP)?

Either answer these or (better) send me a copy of your XLS file plus the
shortest script that demonstrates the problem:

How many worksheets in XLS file? How many rows and columns in each? Size
of XLS file in MB? What software created the XLS file? What is your
script doing after it opens the workbook?

Here's a reference point for you: On my set up (Win XP SP2, Python 2.7,
latest release xlrd) running the above command on a large file (120 MB)
takes max extra memory of 150MB with -f0 and 183MB with -f1. This runs
comfortably on a 1GB memory box with 128MB allocated to screen memory
and the task manager graph showing about 637MB in use before I run the
command.

The difference in memory consumption is mostly attributable to having
[with default open_workbook args] an array.array('h') per row to contain
XF indexes. In other words, 2 bytes per cell plus small change. Double
that to allow for Windows mem mgt silliness. So 10 MB -> 500 MB
certainly indicates a problem somewhere.

Cheers,
John

Daniel Burgener

unread,

Aug 26, 2010, 10:11:12 AM8/26/10

to python-excel

Thanks for your response. I will try to answer your questions.

Python 2.5
XP SP 3
xlrd 0.7.1

> What does "freezes up" mean? Does it freeze inside open_workbook or
> inside your own code? Does it freeze if you run
>
> \python2X\python \python2X\scripts\runxlrd.py -f1 ov yourfile.xls
>
> in a Command Prompt window [where "python2X" is where your Python is
> installed]?

This command completes after a while. It consumes a comparable amount
to resources to my script (nearly 600MB)

>
> What does task manager show is the percentage of CPU that Python is
> getting? What else is running?

13%

nothing else is consuming large amounts of resources. Just the
standard web browser etc.

>
> How much physical memory?
> Size of pagefile (shown as "Commit charge (limit)" in task manager on
> Win XP)?

3MB RAM
6121128

> How many worksheets in XLS file? How many rows and columns in each? Size
> of XLS file in MB? What software created the XLS file? What is your
> script doing after it opens the workbook?

39 worksheets. ~65000 rows and 16 columns each. The file is about
2MB. I'm not sure how the file was created. The script is basically
just scanning the whole file and reading in most of the cells.

The vast majority of these rows are blank rows. I don't have a lot of
experience with Excel, but I've tried deleting them and had no
success. According to the documentation for xlrd, without
formatting_info on it will ignore blank rows at the end, but with it,
it keeps formatting information for them. Would this be why it's
consuming so much memory? If so, is there a way to remove the excess
rows from the Excel file, or get xlrd to not load them?

Thanks!

-Daniel

Daniel Burgener

unread,

Aug 26, 2010, 3:39:38 PM8/26/10

to python-excel

Additional update:

I've confirmed that the formatting_info=True option is causing it to
load all 65,000 rows, most of which are blank. The memory consumed
isn't causing the issue. It's being caused by the fact that the
program loops through every row of the spreadsheet, meaning it's
looping through all the blank rows at the end. When formatting_info
is False, these rows aren't loaded and sheet.nrows is set to the
actual number of used rows. When it's True, sheet.nrows is set to
~65000, which makes these loops take a long time. Is there any way to
only load the rows loaded by formatting_info=False, but load them with
formatting_info? Or to determine the actual number of rows used on
the fly and stop scanning there? I changed my script to go down a
maximum of 2000 rows, which is an acceptable upper bound for the size
of my spreadsheet, and I can cut the memory consumption to normal
values with the on_demand attribute and unloading sheets when I'm
done, but it's still taking an unacceptably long time to run because
of the amount of row traversal. I guess I'll write something to
detect the end of the file by looking for three consecutive blank rows
or something, unless you have a better suggestion.

Thanks again for your advice!

-Daniel

On Aug 26, 10:11 am, Daniel Burgener <burgener.dan...@gmail.com>
wrote:

John Machin

unread,

Aug 26, 2010, 6:21:44 PM8/26/10

to python...@googlegroups.com

On 27/08/2010 12:11 AM, Daniel Burgener wrote:
> Thanks for your response. I will try to answer your questions.
>
> Python 2.5

Any particular reason why you aren't using 2.6 or 2.7?

> XP SP 3
> xlrd 0.7.1
>
>> What does "freezes up" mean? Does it freeze inside open_workbook or
>> inside your own code?

It helps if you answer all questions, especially ones that you should
have asked yourself before asking the forum. I presume from later
information that the answers are (1) "runs slowly" (2) "inside your own
code"

>> Does it freeze if you run
>>
>> \python2X\python \python2X\scripts\runxlrd.py -f1 ov yourfile.xls
>>
>> in a Command Prompt window [where "python2X" is where your Python is
>> installed]?
>
> This command completes after a while.

How long is "a while"??? Note: it actually prints the time to do each of
open_workbook() and the requested command.

It consumes a comparable amount
> to resources to my script (nearly 600MB)
>
>>
>> What does task manager show is the percentage of CPU that Python is
>> getting? What else is running?
>
> 13%

This is rather unlikely ... after opening your physically-small file and
getting the contents into memory, open_workbook should be getting 100%
of an otherwise idle CPU; likewise with your loop-over-all-cells code.
Are you sure it's not swapping?

> nothing else is consuming large amounts of resources. Just the
> standard web browser etc.

What is "the standard web browser"?? IE? Firefox? Opera? Chrome?
something else? Note: the latest Firefox seems to have a habit of using
99% of the CPU for minutes at a time when a complex page is in the
actively-displayed tab ... were you actually watching Task Manager all
the time?

>
>>
>> How much physical memory?
>> Size of pagefile (shown as "Commit charge (limit)" in task manager on
>> Win XP)?
>
> 3MB RAM

3GB, perhaps?

> 6121128
>
>
>> How many worksheets in XLS file? How many rows and columns in each? Size
>> of XLS file in MB? What software created the XLS file? What is your
>> script doing after it opens the workbook?
>
> 39 worksheets. ~65000 rows and 16 columns each. The file is about
> 2MB. I'm not sure how the file was created. The script is basically
> just scanning the whole file and reading in most of the cells.

Potential slowdowns: (1) getting cell(r,c).value, cell(r,c).ctype, and
cell(r,c).xf_index for each cell (2) not making a set of xf_indexes of
XFs whose Font contains bold ONCE before you start iterating over sheets

>
> The vast majority of these rows are blank rows. I don't have a lot of
> experience with Excel, but I've tried deleting them and had no
> success. According to the documentation for xlrd, without
> formatting_info on it will ignore blank rows at the end, but with it,
> it keeps formatting information for them.

Actually, it treats blank cells as empty cells. Ignoring wholly blank
rows ANYWHERE in the file is consequential.

Would this be why it's
> consuming so much memory?

Yes.

If so, is there a way to remove the excess
> rows from the Excel file,

John Machin

unread,

Aug 26, 2010, 6:38:16 PM8/26/10

to python...@googlegroups.com

On 27/08/2010 5:39 AM, Daniel Burgener wrote:
> Additional update:
>
> I've confirmed that the formatting_info=True option is causing it to
> load all 65,000 rows, most of which are blank. The memory consumed
> isn't causing the issue. It's being caused by the fact that the
> program loops through every row of the spreadsheet, meaning it's
> looping through all the blank rows at the end. When formatting_info
> is False, these rows aren't loaded and sheet.nrows is set to the
> actual number of used rows. When it's True, sheet.nrows is set to
> ~65000, which makes these loops take a long time. Is there any way to
> only load the rows loaded by formatting_info=False, but load them with
> formatting_info? Or to determine the actual number of rows used on
> the fly and stop scanning there?

You could open it first with formatting_info=False and grab nrows and
ncols for each sheet, then open it again with formatting_info=True and
used the saved limits in your loops.

> I changed my script to go down a
> maximum of 2000 rows, which is an acceptable upper bound for the size
> of my spreadsheet, and I can cut the memory consumption to normal
> values with the on_demand attribute and unloading sheets when I'm
> done, but it's still taking an unacceptably long time to run because
> of the amount of row traversal.

In that case I'm definitely suspecting that you must be doing something
horrid in your looping code. Can you send me your script and your file
under an NDA (informal one preferred)?

> I guess I'll write something to
> detect the end of the file by looking for three consecutive blank rows
> or something, unless you have a better suggestion.

See above.

John Machin

unread,

Aug 30, 2010, 7:50:24 AM8/30/10

to python...@googlegroups.com

On 27/08/2010 8:38 AM, John Machin wrote:
> On 27/08/2010 5:39 AM, Daniel Burgener wrote:
>> Additional update:
>>
>> I've confirmed that the formatting_info=True option is causing it to
>> load all 65,000 rows, most of which are blank.

"blank" can mean many things:
(1) cell type is XL_CELL_EMPTY
(2) cell type is XL_CELL_BLANK
(3) cell type is XL_CELL_TXT and value == u""
(4) cell type is XL_CELL_TXT and value.isspace()

After a bit of trying various things on the weekend:

You have one or more XL_CELL_BLANK cells in the last row, causing
sheet.nrows to be so large. However I now suspect that many of the extra
cells are actually XL_CELL_EMPTY empty cells. It is very easy to create
such a pathological sheet: open Excel, go to cell IV65536, and don't
enter any data, just format it (e.g. change the background to blue), and
save the file. You will have one ..._BLANK cell and (2 ** 24 - 1)
..._EMPTY cells.

EMPTY cells don't have a valid xf_index; the value -1 is used internally
in xlrd to denote this. When you call sheet.cell_xf_index(rowx, colx) on
an empty cell, xlrd finds the effective xf_index by checking for row and
column default XFs (with two exceptions being thrown if you don't have
any of those things). This is, relatively speaking, very very slow.

Suggestion: if sheet.cell_type(rowx, colx) == xlrd.XL_CELL_EMPTY, the
cell can't have any data to display, so it certainly can't be bold, so
"continue" immediately; don't call sheet.cell_xf_index(rowx, colx)

>> The memory consumed
>> isn't causing the issue. It's being caused by the fact that the
>> program loops through every row of the spreadsheet, meaning it's
>> looping through all the blank rows at the end. When formatting_info
>> is False, these rows aren't loaded and sheet.nrows is set to the
>> actual number of used rows. When it's True, sheet.nrows is set to
>> ~65000, which makes these loops take a long time. Is there any way to
>> only load the rows loaded by formatting_info=False, but load them with
>> formatting_info? Or to determine the actual number of rows used on
>> the fly and stop scanning there?
>
> You could open it first with formatting_info=False and grab nrows and
> ncols for each sheet, then open it again with formatting_info=True and
> used the saved limits in your loops.
>
>> I changed my script to go down a
>> maximum of 2000 rows, which is an acceptable upper bound for the size
>> of my spreadsheet, and I can cut the memory consumption to normal
>> values with the on_demand attribute and unloading sheets when I'm
>> done, but it's still taking an unacceptably long time to run because
>> of the amount of row traversal.
>
> In that case I'm definitely suspecting that you must be doing something
> horrid in your looping code. Can you send me your script and your file
> under an NDA (informal one preferred)?

I'd still like very much to see your script and your file ...

Chris Withers

unread,

Aug 30, 2010, 12:36:59 PM8/30/10

to python...@googlegroups.com

John Machin wrote:
> You have one or more XL_CELL_BLANK cells in the last row, causing
> sheet.nrows to be so large. However I now suspect that many of the extra
> cells are actually XL_CELL_EMPTY empty cells. It is very easy to create
> such a pathological sheet: open Excel, go to cell IV65536, and don't
> enter any data, just format it (e.g. change the background to blue), and
> save the file. You will have one ..._BLANK cell and (2 ** 24 - 1)
> ..._EMPTY cells.

Would the ragged rows stuff help with this?

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

Daniel Burgener

unread,

Aug 30, 2010, 1:32:55 PM8/30/10

to python-excel

Thank you for your response. The XL_CELL_EMPTY approach worked
perfectly! I continue to the next sheet on 3 or more consecutive
XL_CELL_EMPTY or XL_CELL_BLANK cells. Perhaps there's a slightly more
elegant solution, but this work quite fine for my purposes.

Unfortunately, I can't send you my script, even with an NDA, as it is
for work, and that's against company policy. But yeah, it's doing
several database calls for every cell, so each cell consumes a
significant amount of time.

Thanks again!

-Daniel

John Machin

unread,

Aug 30, 2010, 7:43:06 PM8/30/10

to python...@googlegroups.com

On 31/08/2010 3:32 AM, Daniel Burgener wrote:
> Thank you for your response.

"you" == Chris?

"the ragged rows stuff" is an unreleased enhancement, available from
SVN. It in effect saves memory by ignoring trailing XL_CELL_EMPTY cells
in each row. A new method sheet.row_length(rowx) needs to be used
instead of sheet.ncols (which tells the MAXIMUM number of columns). It
would certainly help in the pathological case that I mentioned -- all
rows except the last would be reported as having zero length. It would
not help if the rightmost column in almost every row contained an
XL_CELL_BLANK cell. A combination of ragged_rows and ignoring
XL_CELL_EMPTY cells would be best.

> The XL_CELL_EMPTY approach worked
> perfectly! I continue to the next sheet on 3 or more consecutive
> XL_CELL_EMPTY or XL_CELL_BLANK cells. Perhaps there's a slightly more
> elegant solution, but this work quite fine for my purposes.

"The" XL_CELL_EMPTY approach? What I suggested was in effect ignoring
all such cells. Your "3 or more consecutive" approach is as you say
inelegant, AND more importantly depends heavily on foreknowledge or
assumption about your file. Also I'd dearly like to see the "3 or more
consecutive" code ... surely you'd just continue to the next sheet after
seeing 3 consecutive whatevers.

> Unfortunately, I can't send you my script, even with an NDA, as it is
> for work, and that's against company policy.

Please consider that without the actual script, the process of
determining what your problem has been complicated by the fact that your
problem description has been very woolly (and continues to be so; see
"database" below). Is it your company's policy that that its employees
hide behind gmail etc addresses when soliciting free support for FOSS
packages?

> But yeah, it's doing
> several database calls for every cell, so each cell consumes a
> significant amount of time.

"database calls"? You said you were testing for bold text ...

Daniel Burgener

unread,

Aug 31, 2010, 9:43:12 AM8/31/10

to python-excel

No, that was directed at you, I was unaware of the distinction between
XL_CELL_EMPTY and XL_CELL_BLANK, and so testing for both enabled me to
get my code working.

I apologize if my description has been unclear. I have been trying my
best to describe the problem.

I am both testing for bold text and doing database calls.
Specifically, bold text in the spreadsheet is differentiating
different sections (and I didn't design the spreadsheet, and have no
control over its format), and so different database calls are needed
based on the section I am in.

My problem is currently solved, and I appreciate your help.

-Daniel

Reply all

Reply to author

Forward