EnvironmentError: [Errno 22] Invalid argument

823 views
Skip to first unread message

jennyshk

unread,
Aug 17, 2009, 2:00:39 AM8/17/09
to python-excel
Hi,

An error 'EnvironmentError: [Errno 22] Invalid argument' occurs when I
run my script which basically opens up file and writes to it until the
loop has finished. The script runs fine after several times, but then
in the middle it just throws this error and I suspect its something to
do with opening the workbook in the method below. I'm unsure why? I'm
still quite new to python but I've followed this code closely to the
examples provided in the tutorials, maybe I'm missing something?? Do I
need to close the workbook at the end or does saving it enough? Please
help!

def collectDBSubtotal(database, subtotal, filename, lastmonth):
#todaysDate = datetime.date(2000,1,1).today()

rb = open_workbook(save_file+filename,formatting_info=True) <-----
Error exists here
readSheet = rb.sheet_by_index(0) #reference to read subtotal
print "get sheet by index"
wb = copy(rb)
print "Copied workbook"
writeSheet = wb.get_sheet(0) #reference to write subtotal
print "About to writesheet"
if readSheet.nrows != 0 :
found = 0
for rowx in xrange(readSheet.nrows) :
if rowx != 0 :
readValue = readSheet.cell_value(rowx, 0)
rowNum = str(rowx+1)
if readValue == database.lower() :
writeSheet.write(rowx, lastmonth, subtotal)
writeSheet.write(rowx, 13, Formula('SUM($B$'+rowNum
+':$M$'+rowNum+')'))
found = 1
else :
writeSheet.write(rowx, 13, Formula('SUM($B$'+rowNum
+':$M$'+rowNum+')'))
if not found : #db not found, add at the end
writeSheet.write(readSheet.nrows, 0, database.lower())
#write dbname
writeSheet.write(readSheet.nrows, lastmonth, subtotal)
#write subtotal
rowNum = str(readSheet.nrows+1)
writeSheet.write(readSheet.nrows, 13, Formula('SUM($B
$'+rowNum+':$M$'+rowNum+')')) #write formula
else :#start writing some rows
writeSheet.write(1, 0, database.lower())
writeSheet.write(1, lastmonth, subtotal)
writeSheet.write(1, 13, Formula('SUM($B$1:$M$1'))
print "Saving file and leaving collectDBSubtotal" +save_file
wb.save(save_file+filename)

jennyshk

unread,
Aug 17, 2009, 2:36:05 AM8/17/09
to python-excel
In addition, the script runs ok when the data (web logfile) that gets
feed into the for loop is relatively small but once it gets larger say
at 11MB it throws an error. OS Unix 5.9, Python version 2.4.2.

Chris Withers

unread,
Aug 17, 2009, 3:35:27 AM8/17/09
to python...@googlegroups.com
jennyshk wrote:
> An error 'EnvironmentError: [Errno 22] Invalid argument'

Can you provide the full traceback please?

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

John Machin

unread,
Aug 17, 2009, 5:01:13 AM8/17/09
to python...@googlegroups.com
On 17/08/2009 4:36 PM, jennyshk wrote:
> In addition, the script runs ok when the data (web logfile) that gets
> feed into the for loop is relatively small but once it gets larger say
> at 11MB it throws an error.

But you suspect that the error is happening when the input xls is opened
(which appears to be BEFORE the data is accessed) ... which for loop
is that?


> OS Unix 5.9, Python version 2.4.2.

What is "Unix 5.9"? Please tell us what you get when you run this:
import sys
print sys.platform
print sys.version

>
> On Aug 17, 4:00 pm, jennyshk <jenny...@gmail.com> wrote:
>> Hi,
>>
>> An error 'EnvironmentError: [Errno 22] Invalid argument' occurs when I
>> run my script which basically opens up file and writes to it until the
>> loop has finished. The script runs fine after several times,

after several times of what?

>> but then
>> in the middle

in the middle of what?

>> it just throws this error and I suspect its something to
>> do with opening the workbook in the method below.

WHY do you so suspect?

>> I'm unsure why? I'm
>> still quite new to python but I've followed this code closely to the
>> examples provided in the tutorials, maybe I'm missing something?? Do I
>> need to close the workbook at the end or does saving it enough?

There is no close() method for either input or output workbook, because
none is (should be?) needed. The input file is closed when reading is
finished. The output file is closed when writing is finished.

>>
>> def collectDBSubtotal(database, subtotal, filename, lastmonth):
>> #todaysDate = datetime.date(2000,1,1).today()
>>
>> rb = open_workbook(save_file+filename,formatting_info=True) <-----

A high proportion of results from a Google search for the error message
"EnvironmentError: [Errno 22] Invalid argument" mention mmap; please add
", use_mmap=False" to the open_workbook args and try again and let us
know the result.

>> Error exists here

It would help enormously (with any problem with an unexpected or
mystifying Python exception) if you were to give the full traceback and
error message.

What is in "save_file"? Is it a global?

>> readSheet = rb.sheet_by_index(0) #reference to read subtotal
>> print "get sheet by index"
>> wb = copy(rb)

This would be xlutils.copy(), would it?

>> print "Copied workbook"
>> writeSheet = wb.get_sheet(0) #reference to write subtotal
>> print "About to writesheet"
>> if readSheet.nrows != 0 :
>> found = 0
>> for rowx in xrange(readSheet.nrows) :
>> if rowx != 0 :

Nothing to do with your problem, but consider compressing the previous
two lines into one line by using xrange(1, readSheet.nrows)

>> readValue = readSheet.cell_value(rowx, 0)
>> rowNum = str(rowx+1)
>> if readValue == database.lower() :
>> writeSheet.write(rowx, lastmonth, subtotal)
>> writeSheet.write(rowx, 13, Formula('SUM($B$'+rowNum
>> +':$M$'+rowNum+')'))
>> found = 1
>> else :
>> writeSheet.write(rowx, 13, Formula('SUM($B$'+rowNum
>> +':$M$'+rowNum+')'))
>> if not found : #db not found, add at the end
>> writeSheet.write(readSheet.nrows, 0, database.lower())
>> #write dbname
>> writeSheet.write(readSheet.nrows, lastmonth, subtotal)
>> #write subtotal
>> rowNum = str(readSheet.nrows+1)
>> writeSheet.write(readSheet.nrows, 13, Formula('SUM($B
>> $'+rowNum+':$M$'+rowNum+')')) #write formula
>> else :#start writing some rows
>> writeSheet.write(1, 0, database.lower())
>> writeSheet.write(1, lastmonth, subtotal)
>> writeSheet.write(1, 13, Formula('SUM($B$1:$M$1'))
>> print "Saving file and leaving collectDBSubtotal" +save_file
>> wb.save(save_file+filename)

Ummmm ... saving the output file with the same name as the input file?

......

OK, let me see if my reading of the above is correct:

This function collectDBSubtotal has args:

database: key which may match a value in column 0.
lastmonth: values are 1, ..., 12
subtotal: a number to be put in column[lastmonth]

Each time the function is called:
(1) the input xls is opened, and the xlrd Book object is copied to an
xlwt Workbook object
(2) you loop over all the rows in the first worksheet; if column[0]
matches database, you update column[lastmonth] = subtotal.
(3) if database was not found, you write a new row at the end.
(4) not sure why you want to plug a row-total Formula in at all, let
alone why you rewrite it on non-database rows, but we'll ignore that for
the moment.
(5) the output workbook is saved.

Now that's a whole lot of pushing and shoving to update (effectively)
ONE cell[database][lastmonth] = subtotal. You mentioned 11 Mb of data
... does that mean that you are calling this function in a loop, with
many thousands of different (database, lastmonth, subtotal) tuples?

If so, I'm not altogether surprised that something gagged on that.
Perhaps xlrd needs to shut down the mmap object explicitly; perhaps not
every *x implementation of the underlying functionality works properly
when under extreme stress.

If you are actually doing what I fear, you should consider changing it
so that:

(1) your input data is a big list of (database, lastmonth, subtotal) tuples.

(2) you do these things exactly ONCE each: open input workbook, copy to
output workbook, read through the xlrd worksheet.cellvalue(rowx, 0)
making a dict which maps from database to rowx

(3) then you iterate over the list of tuples; for each tuple you do this:

if database in the_map:
rowx = the_map[database]
else:
the_map[database] = rowx = lastrowx = lastrowx + 1
writeSheet.write(rowx, 0, database)
writeSheet.write(rowx, lastmonth, subtotal)
# mucking about with formulas is left as an exercise

HTH,
John

jennyshk

unread,
Aug 23, 2009, 8:14:42 PM8/23/09
to python-excel
Hi John and Chris,

Thank you very much for your prompt replies. Sorry I wasn't very
clear in my explanation. In general, my script produces a monthly
summary report (excel format), of the types of databases that have
been accessed through EZProxy webserver and finds out which University
faculty they're from.

That def collectDBSubtotal() method gets called from another method
reportMonthlyStatistics() which pulls data from a class object
FacultyBreakdown. Before reportMonthlyStatistics() is invoked to do a
big stats report, the class object is called throughout the script to
store information such as the database names and faculty names in the
class constructor by x.incrDB(databaseName, facultyName). I feed the
class each time with the 2 variables which are extracted from certain
loglines in the logfile. About a months worth of web logfiles are
searched where 1 daily gzipped logfile is about 12MB.

I suspected that the problem came from biff2_8_load() due to the
traceback call (posted below) which occurs after a number of times the
workbook is opened and rewritten to in the collectDBSubtotal(). Yes I
too think its because of the sheer volume of data that finally, the
workbook can no longer run and decides to fails to open. I have used
parameter

I have used mmap=false in open_workbook() and it worked! It took 78
hours to run however so I will have to try cut some things out of my
script to make it shorter. Anyway thanks greatly John and Chris for
your input, I can finally move onto the next bits!

John Machin

unread,
Aug 23, 2009, 9:23:26 PM8/23/09
to python...@googlegroups.com
On 24/08/2009 10:14 AM, jennyshk wrote:
> Hi John and Chris,

Hi jennyshk,

>
> Thank you very much for your prompt replies. Sorry I wasn't very
> clear in my explanation.

The following explanation seems to have little bearing bearing on your
problem ...

> In general, my script produces a monthly
> summary report (excel format), of the types of databases that have
> been accessed through EZProxy webserver and finds out which University
> faculty they're from.
>
> That def collectDBSubtotal() method gets called from another method
> reportMonthlyStatistics() which pulls data from a class object
> FacultyBreakdown. Before reportMonthlyStatistics() is invoked to do a
> big stats report, the class object is called throughout the script to
> store information such as the database names and faculty names in the
> class constructor by x.incrDB(databaseName, facultyName). I feed the
> class each time with the 2 variables which are extracted from certain
> loglines in the logfile. About a months worth of web logfiles are
> searched where 1 daily gzipped logfile is about 12MB.

... except to tell us that you have a large quantity of data.


>
> I suspected that the problem came from biff2_8_load() due to the
> traceback call (posted below)

Posted below where? I see no traceback output.

> which occurs after a number of times the
> workbook is opened and rewritten to in the collectDBSubtotal(). Yes I
> too think its because of the sheer volume of data that finally, the
> workbook can no longer run and decides to fails to open. I have used
> parameter
>
> I have used mmap=false in open_workbook() and it worked! It took 78
> hours to run however

I'm rather boggled that you would let it run for so long ... don't you
check performance on a subset of your data? E.g. work out how many
whatevers per minute it is doing and extrapolate that to a month's worth?

And how many whatevers per minute was it achieving with use_mmap=True
before it fell over?

> so I will have to try cut some things out of my
> script to make it shorter. Anyway thanks greatly John and Chris for
> your input, I can finally move onto the next bits!

Presumably the "next bits" include looking through the list of questions
that I asked you in my previous message and addressing some of them, e.g.

(1) This one appears highly relevant to your 78-hour run time: """Now

that's a whole lot of pushing and shoving to update (effectively) ONE
cell[database][lastmonth] = subtotal. You mentioned 11 Mb of data ...
does that mean that you are calling this function in a loop, with many
thousands of different (database, lastmonth, subtotal) tuples?"""

(2) """What is "Unix 5.9"? Please tell us what you get when you run this:


import sys
print sys.platform
print sys.version
"""

This question wasn't posed out of idle curiosity -- quite apart from the
fact that that sort of info should be supplied up front without being
asked, there's the question of whether mmap is behaving badly on your
system and whether xlrd can be changed to work around or bypass the problem.


jennyshk

unread,
Aug 24, 2009, 2:26:37 AM8/24/09
to python-excel
Hi John,

I thought I included that in my previous email but obviously the
traceback and platform/version result did not copy over. Anyway, I've
posted them again. Before running the script against the monthly logs
with mmap=True, I did several tests with different subsets of data,
ranging from a single modified logfile (reduced down to 50% in size)
and then went onto test logfiles for a day, week and month over the
weekend. The script ran OK when the dataset was less than 11MB or a
single logfile but anything over that I get "Invalid argument" error.
However, after your advice to use mmap=False, it ran successfully
minus the length of time it takes. It wasn't accurate for me to say
that the part of the script I was concerned with took exactly 3-days
to run because it also prints results to CSV and HTML format. But even
so, I did a seperate run for printing CSV and HTML outputs and this
took only 30 minutes.

(1) Traceback (most recent call last):
File "./ezproxy-testerror.py", line 1414, in ?
main()
File "./ezproxy-testerror.py", line 1406, in main
reportExcel(year, lastmonth)
File "./ezproxy-testerror.py", line 1329, in reportExcel
reportMonthlyStatistics(summarykeys, areaTotals, facultyTotals,
subtotal_file, useage_file2, lastmonth) #writes month stats,subtotals
File "./ezproxy-testerror.py", line 1196, in reportMonthlyStatistics
collectDBSubtotal(DBname, facTotalDB, subtotal_file, lastmonth)
File "./ezproxy-testerror.py", line 1266, in collectDBSubtotal
rb = open_workbook(save_file+filename,formatting_info=True)
File "/usr/local/lib/python2.4/site-packages/xlrd/__init__.py", line
425, in open_workbook
on_demand=on_demand,
File "/usr/local/lib/python2.4/site-packages/xlrd/__init__.py", line
897, in biff2_8_load
EnvironmentError: [Errno 22] Invalid argument

(2) """import sys
print sys.platform
print sys.version """

sunos5
2.4.2 (#1, Oct 3 2005, 01:00:07)
[GCC 3.3.2]
Reply all
Reply to author
Forward
0 new messages