Some basic information, please:
What version of Python
What version and SP of Windows
What version of xlrd
What does "freezes up" mean? Does it freeze inside open_workbook or
inside your own code? Does it freeze if you run
\python2X\python \python2X\scripts\runxlrd.py -f1 ov yourfile.xls
in a Command Prompt window [where "python2X" is where your Python is
installed]?
What does task manager show is the percentage of CPU that Python is
getting? What else is running?
How much physical memory?
Size of pagefile (shown as "Commit charge (limit)" in task manager on
Win XP)?
Either answer these or (better) send me a copy of your XLS file plus the
shortest script that demonstrates the problem:
How many worksheets in XLS file? How many rows and columns in each? Size
of XLS file in MB? What software created the XLS file? What is your
script doing after it opens the workbook?
Here's a reference point for you: On my set up (Win XP SP2, Python 2.7,
latest release xlrd) running the above command on a large file (120 MB)
takes max extra memory of 150MB with -f0 and 183MB with -f1. This runs
comfortably on a 1GB memory box with 128MB allocated to screen memory
and the task manager graph showing about 637MB in use before I run the
command.
The difference in memory consumption is mostly attributable to having
[with default open_workbook args] an array.array('h') per row to contain
XF indexes. In other words, 2 bytes per cell plus small change. Double
that to allow for Windows mem mgt silliness. So 10 MB -> 500 MB
certainly indicates a problem somewhere.
Cheers,
John
Any particular reason why you aren't using 2.6 or 2.7?
> XP SP 3
> xlrd 0.7.1
>
>> What does "freezes up" mean? Does it freeze inside open_workbook or
>> inside your own code?
It helps if you answer all questions, especially ones that you should
have asked yourself before asking the forum. I presume from later
information that the answers are (1) "runs slowly" (2) "inside your own
code"
>> Does it freeze if you run
>>
>> \python2X\python \python2X\scripts\runxlrd.py -f1 ov yourfile.xls
>>
>> in a Command Prompt window [where "python2X" is where your Python is
>> installed]?
>
> This command completes after a while.
How long is "a while"??? Note: it actually prints the time to do each of
open_workbook() and the requested command.
It consumes a comparable amount
> to resources to my script (nearly 600MB)
>
>>
>> What does task manager show is the percentage of CPU that Python is
>> getting? What else is running?
>
> 13%
This is rather unlikely ... after opening your physically-small file and
getting the contents into memory, open_workbook should be getting 100%
of an otherwise idle CPU; likewise with your loop-over-all-cells code.
Are you sure it's not swapping?
> nothing else is consuming large amounts of resources. Just the
> standard web browser etc.
What is "the standard web browser"?? IE? Firefox? Opera? Chrome?
something else? Note: the latest Firefox seems to have a habit of using
99% of the CPU for minutes at a time when a complex page is in the
actively-displayed tab ... were you actually watching Task Manager all
the time?
>
>>
>> How much physical memory?
>> Size of pagefile (shown as "Commit charge (limit)" in task manager on
>> Win XP)?
>
> 3MB RAM
3GB, perhaps?
> 6121128
>
>
>> How many worksheets in XLS file? How many rows and columns in each? Size
>> of XLS file in MB? What software created the XLS file? What is your
>> script doing after it opens the workbook?
>
> 39 worksheets. ~65000 rows and 16 columns each. The file is about
> 2MB. I'm not sure how the file was created. The script is basically
> just scanning the whole file and reading in most of the cells.
Potential slowdowns: (1) getting cell(r,c).value, cell(r,c).ctype, and
cell(r,c).xf_index for each cell (2) not making a set of xf_indexes of
XFs whose Font contains bold ONCE before you start iterating over sheets
>
> The vast majority of these rows are blank rows. I don't have a lot of
> experience with Excel, but I've tried deleting them and had no
> success. According to the documentation for xlrd, without
> formatting_info on it will ignore blank rows at the end, but with it,
> it keeps formatting information for them.
Actually, it treats blank cells as empty cells. Ignoring wholly blank
rows ANYWHERE in the file is consequential.
Would this be why it's
> consuming so much memory?
Yes.
If so, is there a way to remove the excess
> rows from the Excel file,
You might like to try xlutils.filter
or get xlrd to not load them?
Not at the moment.
You could open it first with formatting_info=False and grab nrows and
ncols for each sheet, then open it again with formatting_info=True and
used the saved limits in your loops.
> I changed my script to go down a
> maximum of 2000 rows, which is an acceptable upper bound for the size
> of my spreadsheet, and I can cut the memory consumption to normal
> values with the on_demand attribute and unloading sheets when I'm
> done, but it's still taking an unacceptably long time to run because
> of the amount of row traversal.
In that case I'm definitely suspecting that you must be doing something
horrid in your looping code. Can you send me your script and your file
under an NDA (informal one preferred)?
> I guess I'll write something to
> detect the end of the file by looking for three consecutive blank rows
> or something, unless you have a better suggestion.
See above.
"blank" can mean many things:
(1) cell type is XL_CELL_EMPTY
(2) cell type is XL_CELL_BLANK
(3) cell type is XL_CELL_TXT and value == u""
(4) cell type is XL_CELL_TXT and value.isspace()
After a bit of trying various things on the weekend:
You have one or more XL_CELL_BLANK cells in the last row, causing
sheet.nrows to be so large. However I now suspect that many of the extra
cells are actually XL_CELL_EMPTY empty cells. It is very easy to create
such a pathological sheet: open Excel, go to cell IV65536, and don't
enter any data, just format it (e.g. change the background to blue), and
save the file. You will have one ..._BLANK cell and (2 ** 24 - 1)
..._EMPTY cells.
EMPTY cells don't have a valid xf_index; the value -1 is used internally
in xlrd to denote this. When you call sheet.cell_xf_index(rowx, colx) on
an empty cell, xlrd finds the effective xf_index by checking for row and
column default XFs (with two exceptions being thrown if you don't have
any of those things). This is, relatively speaking, very very slow.
Suggestion: if sheet.cell_type(rowx, colx) == xlrd.XL_CELL_EMPTY, the
cell can't have any data to display, so it certainly can't be bold, so
"continue" immediately; don't call sheet.cell_xf_index(rowx, colx)
>> The memory consumed
>> isn't causing the issue. It's being caused by the fact that the
>> program loops through every row of the spreadsheet, meaning it's
>> looping through all the blank rows at the end. When formatting_info
>> is False, these rows aren't loaded and sheet.nrows is set to the
>> actual number of used rows. When it's True, sheet.nrows is set to
>> ~65000, which makes these loops take a long time. Is there any way to
>> only load the rows loaded by formatting_info=False, but load them with
>> formatting_info? Or to determine the actual number of rows used on
>> the fly and stop scanning there?
>
> You could open it first with formatting_info=False and grab nrows and
> ncols for each sheet, then open it again with formatting_info=True and
> used the saved limits in your loops.
>
>> I changed my script to go down a
>> maximum of 2000 rows, which is an acceptable upper bound for the size
>> of my spreadsheet, and I can cut the memory consumption to normal
>> values with the on_demand attribute and unloading sheets when I'm
>> done, but it's still taking an unacceptably long time to run because
>> of the amount of row traversal.
>
> In that case I'm definitely suspecting that you must be doing something
> horrid in your looping code. Can you send me your script and your file
> under an NDA (informal one preferred)?
I'd still like very much to see your script and your file ...
Would the ragged rows stuff help with this?
Chris
--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk
"you" == Chris?
"the ragged rows stuff" is an unreleased enhancement, available from
SVN. It in effect saves memory by ignoring trailing XL_CELL_EMPTY cells
in each row. A new method sheet.row_length(rowx) needs to be used
instead of sheet.ncols (which tells the MAXIMUM number of columns). It
would certainly help in the pathological case that I mentioned -- all
rows except the last would be reported as having zero length. It would
not help if the rightmost column in almost every row contained an
XL_CELL_BLANK cell. A combination of ragged_rows and ignoring
XL_CELL_EMPTY cells would be best.
> The XL_CELL_EMPTY approach worked
> perfectly! I continue to the next sheet on 3 or more consecutive
> XL_CELL_EMPTY or XL_CELL_BLANK cells. Perhaps there's a slightly more
> elegant solution, but this work quite fine for my purposes.
"The" XL_CELL_EMPTY approach? What I suggested was in effect ignoring
all such cells. Your "3 or more consecutive" approach is as you say
inelegant, AND more importantly depends heavily on foreknowledge or
assumption about your file. Also I'd dearly like to see the "3 or more
consecutive" code ... surely you'd just continue to the next sheet after
seeing 3 consecutive whatevers.
> Unfortunately, I can't send you my script, even with an NDA, as it is
> for work, and that's against company policy.
Please consider that without the actual script, the process of
determining what your problem has been complicated by the fact that your
problem description has been very woolly (and continues to be so; see
"database" below). Is it your company's policy that that its employees
hide behind gmail etc addresses when soliciting free support for FOSS
packages?
> But yeah, it's doing
> several database calls for every cell, so each cell consumes a
> significant amount of time.
"database calls"? You said you were testing for bold text ...