Welcome to the right place to deal with Python and Excel.
Perhaps you can make available on your website an example of the Excel
files you have; or upload it to something like http://drop.io/ for us
to download?
The general workflow I would use for this:
1. In a loop, use xlrd to read the Excel file, row-by-row:
1a) Read each row into a Python list
1b) Transfer that Python list into the next row of your pre-allocated*
Numpy array, http://numpy.org/,
2. Perform the data manipulation you describe.
3. Then use xlwt to export the results out again to Excel.
* Use sheet.nrows to figure that out, accounting for any header rows
I'm sure others may have different approaches.
Kevin
Hi ... Here's a smidgeon of devil's advocacy:
So you have multiple 4xN arrays, where N could be between 120,000 and
500,000 and you want to process pairs of arrays ...
Has your Python course covered scipy/numpy yet?
It might help if you were to explain:
(1) "base calls"
(2) "exported as DNA sequences"
(3) "associated in a meaningful way"
Most importantly: I have difficulty comprehending why you might want to
use Excel. How do your 4 x 500,000 arrays get into Excel files in the
first place? Who is doing that, and why? What it is that you are
actually planning to do with Excel files i.e. user-friendly interface to
WHAT? Are you going to type in the up to 2 million numbers in each
array? Are you going to browse through Excel files containing 500,000
rows of data?
Please note that currently xlrd and xlwt support only XLS files (those
compatible with Excel 97 to Excel 2003) which are limited to 65536 rows
per worksheet. Splitting your arrays into 65K chunks probably wouldn't
be user-friendly.
What did your bioinformatics course have to say about data storage?
HTH,
John
Any chance you could find a mail client that knows how to do reply
quoting properly? ;-)
> Has your Python course covered scipy/numpy yet?
>
> # No, but I've now looked them up. Like Alice, I could fold up like a
> telescope if I knew how to start...
Reference is lost on me, but numpy arrays do seem better suited to your
task in hand...
> #user friendly interface
Excel is *not* user friendly for the amounts of data you're talking about...
> How do your 4 x 500,000 arrays get into Excel files in the
> first place?
>
> # Data are provided that way by an external proprietary program, easily
> cut-and-pasted into Excel
If they need to be cut and pasted into excel, maybe you could explain
what "that way" provided by the external program is? What other export
options does that external proprietary program have?
> # Yup. DNA 101 - We are looking at an experimental vector that agrees with a
> reference vector ca 99.5% of lines, disagrees unambiguously ca. 0.2% of
> lines, and is ambiguous 0.3%. Many of the ambiguities are resolvable by
> automatic decision rules. Maybe 0.2% require inspection and hand-curation.
This really sounds like a task for numpy arrays, with maybe the 0.2%
being spat out to excel files.
> *All but the last step is programmed, albeit in a series of cludgy VBA Ifs,
> rather than a single elegant Python control structure *. As well, some
> repetitive Excel tasks are done by hand, as there don't seem to be macros to
> handle them. This wastes everybody's time.
Dump excel for this part, look at numpy, maybe look at matplotlib too...
> What did your bioinformatics course have to say about data storage?
>
> We do this on PCs, storage isn't a problem, but execution time is slow
> enough to be seriously annoying. Excel 2007's compulsive need to save an
> unlimited number of undo steps really slows things down.
Seriously, I feel the need to scream "numpy" at you very loudly,
resisting the urge to insert the "t". Wrong tools for the job. If you
need to store data, use a database. Sqlite via sqlalchemy is trivial to
get going with nowadays... *only* the small sets of data that really
need human intervention should be chucked in excel files, and even then
only if you can't find some better way of dealing with them...
Chris :-)
--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk
Please don't ignore Chris's suggestion aboutr getting a mail client that
can quote properly. E.g. Thunderbird is free and not too stupid ;-)
>> #user friendly interface
>
> Excel is *not* user friendly for the amounts of data you're talking about...
>
>> How do your 4 x 500,000 arrays get into Excel files in the
>> first place?
>
> If they need to be cut and pasted into excel, maybe you could explain
> what "that way" provided by the external program is? What other export
> options does that external proprietary program have?
>
> #Proprietary program supplies them as 4 x n arrays, with its own analysis
> based on a propriety likelihood algorithm, in two more columns, which we
> ignore. My (much simpler arithmetic) algorithm is demonstrably superior at
> recovering accurate data. No other option of data export
So write a short chunk of Python to read the provided file, rather than
manually cut and paste into anything.
>
> Dump excel for this part, look at numpy, maybe look at matplotlib too...
>
> #OK - I'm relearning control structures now, then I require a means
"a means" ??
>
> Seriously, I feel the need to scream "numpy" at you very loudly,
> resisting the urge to insert the "t".
+1
>
> #What I think I know at this point:
> I need a spreadsheet interface, like Excel, or OpenOffice (the latter
> lacks some useful features). I would gladly use a Python spreadsheet, if
> such exists.
[caveat lector: my quoting the following URL implies nothing beyond what
you asked viz the product exists]
http://www.resolversystems.com/products/resolver-one
HTH,
John