New member: interested in writing Python bioinformatic scripts, to be fed from and to an Excel spreadsheet

40 views
Skip to first unread message

codfather

unread,
Sep 14, 2009, 1:04:44 PM9/14/09
to python-excel, sc...@mun.ca
Hi Folks
I'm a genetic professor with a 37-yr old background in Fortran, just
back from a bioinformatics course, and have just taken my first two
lectures in Python. My interest is in writing Python scripts that can
be fed spreadsheet data Excel, and feed processed data back to Excel.
Excel is the user-friendly interface, Python is the way of doing
manipulations way better than in Visual Basic. My data are numerical
arrays length N = 120,000 ~ 500,000, in 4xN arrays in duplicate, where
each element in the quartet is the signal strength of a DNA base, for
A C G or T. The intensity data are to be compared between the
duplicates, base calls made, and exported as DNA sequences, with the
signal strengths associated with them in a meanginful way.

I look forward to working with folks who have experience at the
Python / Excel interface. Right now I have no idea how do it.

Kevin Dunn

unread,
Sep 14, 2009, 1:30:48 PM9/14/09
to python...@googlegroups.com
Hi there,

Welcome to the right place to deal with Python and Excel.

Perhaps you can make available on your website an example of the Excel
files you have; or upload it to something like http://drop.io/ for us
to download?

The general workflow I would use for this:

1. In a loop, use xlrd to read the Excel file, row-by-row:
1a) Read each row into a Python list
1b) Transfer that Python list into the next row of your pre-allocated*
Numpy array, http://numpy.org/,
2. Perform the data manipulation you describe.
3. Then use xlwt to export the results out again to Excel.

* Use sheet.nrows to figure that out, accounting for any header rows

I'm sure others may have different approaches.

Kevin

John Machin

unread,
Sep 14, 2009, 7:04:09 PM9/14/09
to python...@googlegroups.com

Hi ... Here's a smidgeon of devil's advocacy:

So you have multiple 4xN arrays, where N could be between 120,000 and
500,000 and you want to process pairs of arrays ...

Has your Python course covered scipy/numpy yet?

It might help if you were to explain:

(1) "base calls"
(2) "exported as DNA sequences"
(3) "associated in a meaningful way"

Most importantly: I have difficulty comprehending why you might want to
use Excel. How do your 4 x 500,000 arrays get into Excel files in the
first place? Who is doing that, and why? What it is that you are
actually planning to do with Excel files i.e. user-friendly interface to
WHAT? Are you going to type in the up to 2 million numbers in each
array? Are you going to browse through Excel files containing 500,000
rows of data?

Please note that currently xlrd and xlwt support only XLS files (those
compatible with Excel 97 to Excel 2003) which are limited to 65536 rows
per worksheet. Splitting your arrays into 65K chunks probably wouldn't
be user-friendly.

What did your bioinformatics course have to say about data storage?

HTH,
John

Dr Steven M Carr

unread,
Sep 16, 2009, 11:30:09 AM9/16/09
to python...@googlegroups.com
Responses interleaved below. I appreciate the commentary.


________________________________________
Dr Steven M Carr, Professor of Biology
Department of Biology
Memorial University of Newfoundland
St John's NL A1B 3X9 CANADA

e-mail: sc...@mun.ca
http://www.mun.ca/biology/scarr/Research.html
(709) 737 - 4776 office / -4713 lab / -3018 FAX
________________________________________
# N is a fixed number 115984 and in a latter iteration will be ca. 480000

Has your Python course covered scipy/numpy yet?

# No, but I've now looked them up. Like Alice, I could fold up like a
telescope if I knew how to start...

It might help if you were to explain:

(1) "base calls"
(2) "exported as DNA sequences"
(3) "associated in a meaningful way"

# in the 4xN arrays, the 4 columns refer to the four letters of the DNA code
("bases"), A C G T, each of which is associated with a number
# There are two replicates of each experiment, which should be identical,
but sometimes aren't
# In each of the N rows, one of the four bases is "best" {highest number}:
that's the "call", with a confidence score relative to 'second best'
# Having identified 'best' in each row, export the result as a vector length
N ("the DNA sequence") of the 'best' in each row, with a second vector of
the confidence scores
First-order complication: the replicates don't agree in ca. 0.5% of
cases, resulting in an'ambiguity'. Employ hard decision rules based on
confidence scores, and fuzzy logic based on knowledge of the system, to
reduce ambiguities to ACGT
Second-order complications: Remaining ambiguities must be hand-curated,
subject to experience and fuzzier logic.

Most importantly: I have difficulty comprehending why you might want to
use Excel.

#user friendly interface for handling what eventually become vector strings
exported to other programs, with easy sorting functions and conditional
formatting, etc.

How do your 4 x 500,000 arrays get into Excel files in the
first place?

# Data are provided that way by an external proprietary program, easily
cut-and-pasted into Excel

Who is doing that, and why?

# Me or a student, publication or a degree; see next answer

What it is that you are
actually planning to do with Excel files i.e. user-friendly interface to
WHAT? Are you going to type in the up to 2 million numbers in each
array? Are you going to browse through Excel files containing 500,000
rows of data?

# Yup. DNA 101 - We are looking at an experimental vector that agrees with a
reference vector ca 99.5% of lines, disagrees unambiguously ca. 0.2% of
lines, and is ambiguous 0.3%. Many of the ambiguities are resolvable by
automatic decision rules. Maybe 0.2% require inspection and hand-curation.
*All but the last step is programmed, albeit in a series of cludgy VBA Ifs,
rather than a single elegant Python control structure *. As well, some
repetitive Excel tasks are done by hand, as there don't seem to be macros to
handle them. This wastes everybody's time.

We compile scores of such vectors, one for each individual in a species. The
biological signal is in the pattern of disagreements, against each other
(evolution) and against functional patterns (molecular biollogy). This we
evaluate elsewhere with other software.

Please note that currently xlrd and xlwt support only XLS files (those
compatible with Excel 97 to Excel 2003) which are limited to 65536 rows
per worksheet. Splitting your arrays into 65K chunks probably wouldn't
be user-friendly.

#Just so. This wasn't really practical until Excel 2007; previously we used
a 4x16000 array, and had to break the large arrays into subsets of ca. 16000
rows, by hand. Presumably VBA can break our subsets out automatically, but
we haven't seen how. Then we would have a set of seven species, each 16000
long or so, and xlrd and xlwt would work?

What did your bioinformatics course have to say about data storage?

We do this on PCs, storage isn't a problem, but execution time is slow
enough to be seriously annoying. Excel 2007's compulsive need to save an
unlimited number of undo steps really slows things down.

HTH,
John

Ttfn
Steve




Dr Steven M Carr

unread,
Sep 16, 2009, 11:37:16 AM9/16/09
to python...@googlegroups.com
Thanks Kevin. I am creeping towards suspicion that I can write a VBA macro to export a column to a Python script, which will evaluate my problem allatonce with a more elegant Python control structure, then run another macro to import the Python results back to Excel as the final "call" with a quality score and perhaps conditional formatting.

I uploaded my excel spreadsheet (one generation back) as you suggested, will probably cut it down to one data subset, since the whole thing is like drinking from a spittoon.

Steve


________________________________________
Dr Steven M Carr, Professor of Biology
Department of Biology
Memorial University of Newfoundland
St John's NL A1B 3X9 CANADA

e-mail: sc...@mun.ca
http://www.mun.ca/biology/scarr/Research.html
(709) 737 - 4776 office / -4713 lab / -3018 FAX
________________________________________


-----Original Message-----
From: python...@googlegroups.com [mailto:python...@googlegroups.com] On Behalf Of Kevin Dunn
Sent: September 14, 2009 3:01 PM
To: python...@googlegroups.com
Subject: [pyxl] Re: New member: interested in writing Python bioinformatic scripts, to be fed from and to an Excel spreadsheet


Chris Withers

unread,
Sep 18, 2009, 6:44:26 AM9/18/09
to python...@googlegroups.com
Dr Steven M Carr wrote:
> Responses interleaved below.

Any chance you could find a mail client that knows how to do reply
quoting properly? ;-)

> Has your Python course covered scipy/numpy yet?
>
> # No, but I've now looked them up. Like Alice, I could fold up like a
> telescope if I knew how to start...

Reference is lost on me, but numpy arrays do seem better suited to your
task in hand...

> #user friendly interface

Excel is *not* user friendly for the amounts of data you're talking about...

> How do your 4 x 500,000 arrays get into Excel files in the
> first place?
>
> # Data are provided that way by an external proprietary program, easily
> cut-and-pasted into Excel

If they need to be cut and pasted into excel, maybe you could explain
what "that way" provided by the external program is? What other export
options does that external proprietary program have?

> # Yup. DNA 101 - We are looking at an experimental vector that agrees with a
> reference vector ca 99.5% of lines, disagrees unambiguously ca. 0.2% of
> lines, and is ambiguous 0.3%. Many of the ambiguities are resolvable by
> automatic decision rules. Maybe 0.2% require inspection and hand-curation.

This really sounds like a task for numpy arrays, with maybe the 0.2%
being spat out to excel files.

> *All but the last step is programmed, albeit in a series of cludgy VBA Ifs,
> rather than a single elegant Python control structure *. As well, some
> repetitive Excel tasks are done by hand, as there don't seem to be macros to
> handle them. This wastes everybody's time.

Dump excel for this part, look at numpy, maybe look at matplotlib too...

> What did your bioinformatics course have to say about data storage?
>
> We do this on PCs, storage isn't a problem, but execution time is slow
> enough to be seriously annoying. Excel 2007's compulsive need to save an
> unlimited number of undo steps really slows things down.

Seriously, I feel the need to scream "numpy" at you very loudly,
resisting the urge to insert the "t". Wrong tools for the job. If you
need to store data, use a database. Sqlite via sqlalchemy is trivial to
get going with nowadays... *only* the small sets of data that really
need human intervention should be chucked in excel files, and even then
only if you can't find some better way of dealing with them...

Chris :-)

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

Dr Steven M Carr

unread,
Sep 18, 2009, 11:51:32 AM9/18/09
to python...@googlegroups.com

Continuing thread ...

> #user friendly interface

Excel is *not* user friendly for the amounts of data you're talking about...

> How do your 4 x 500,000 arrays get into Excel files in the
> first place?

If they need to be cut and pasted into excel, maybe you could explain
what "that way" provided by the external program is? What other export
options does that external proprietary program have?

#Proprietary program supplies them as 4 x n arrays, with its own analysis
based on a propriety likelihood algorithm, in two more columns, which we
ignore. My (much simpler arithmetic) algorithm is demonstrably superior at
recovering accurate data. No other option of data export

> # Yup. DNA 101 - We are looking at an experimental vector that agrees with
a
> reference vector ca 99.5% of lines, disagrees unambiguously ca. 0.2% of
> lines, and is ambiguous 0.3%. Many of the ambiguities are resolvable by
> automatic decision rules. Maybe 0.2% require inspection and hand-curation.

This really sounds like a task for numpy arrays, with maybe the 0.2% being
spat out to excel files.

#The 0.2%, and all the data, have to be analyzed in context, e.g,
1) Line 5678 might have associate info "2nd position of 43rd codon for
15th gene"
2) Interpretation of line n is influenced by data at lines +/- 12 from
n.
3) e.g., a 'mutation' at line 10000 may affect signal from lines 9988 to
10012,
But that effect is not apparent at those lines analyzed individually

> *All but the last step is programmed, albeit in a series of cludgy VBA
Ifs,
> rather than a single elegant Python control structure *. As well, some
> repetitive Excel tasks are done by hand, as there don't seem to be macros
to
> handle them. This wastes everybody's time.

Dump excel for this part, look at numpy, maybe look at matplotlib too...

#OK - I'm relearning control structures now, then I require a means

Seriously, I feel the need to scream "numpy" at you very loudly,
resisting the urge to insert the "t". Wrong tools for the job. If you
need to store data, use a database.

Sqlite via sqlalchemy is trivial to get going with nowadays...

#OK, I'll read up on them.

*only* the small sets of data that really
need human intervention should be chucked in excel files, and even then
only if you can't find some better way of dealing with them...

#Expressed another way: *any* of the 500,000 lines may require intervention,
I don't know which until I run the same algorithm on all of them. The
component 16K line vector must be kept intact, because there is information
in the order

Chris :-)


#What I think I know at this point:
I need a spreadsheet interface, like Excel, or OpenOffice (the latter
lacks some useful features). I would gladly use a Python spreadsheet, if
such exists.
I *think* I want to export, for each of 100 ~ 500K lines, two blocks of
four numbers each, to a python macro-like script, which will use a python
control structure to generate a 'call' and quality score for both the first
and second blocks, and a consensus 'call' for the two blocks, based on
decision rules I already know. I think I could write these out now, as IF or
ELIF blocks, with slightly more knowledge of python syntax. That script
would return the call/score for each block, plus the consensus call, to the
spreadsheet. Ideally, the spreadsheet would color-code the consensus call
(again according to set rules), or at least return a column of flags [eg,
which rule was used on this line?] that would let Excel do that via
conditional formatting.

Even simpler:
If not Excel, is there a python spreadsheet?
Can an Excel VBA macro [or other spreadsheet macro] invoke a python
script, and can python return a block of values to an the spreadsheet,
seamlessly?
I think you're telling me 'yes,' but my understanding of the named
scripts is inadequate as yet.

Steve

John Machin

unread,
Sep 18, 2009, 12:17:50 PM9/18/09
to python...@googlegroups.com
On 19/09/2009 1:51 AM, Dr Steven M Carr wrote:

Please don't ignore Chris's suggestion aboutr getting a mail client that
can quote properly. E.g. Thunderbird is free and not too stupid ;-)

>> #user friendly interface
>
> Excel is *not* user friendly for the amounts of data you're talking about...
>
>> How do your 4 x 500,000 arrays get into Excel files in the
>> first place?
>
> If they need to be cut and pasted into excel, maybe you could explain
> what "that way" provided by the external program is? What other export
> options does that external proprietary program have?
>
> #Proprietary program supplies them as 4 x n arrays, with its own analysis
> based on a propriety likelihood algorithm, in two more columns, which we
> ignore. My (much simpler arithmetic) algorithm is demonstrably superior at
> recovering accurate data. No other option of data export

So write a short chunk of Python to read the provided file, rather than
manually cut and paste into anything.

>
> Dump excel for this part, look at numpy, maybe look at matplotlib too...
>
> #OK - I'm relearning control structures now, then I require a means

"a means" ??

>
> Seriously, I feel the need to scream "numpy" at you very loudly,
> resisting the urge to insert the "t".

+1

>
> #What I think I know at this point:
> I need a spreadsheet interface, like Excel, or OpenOffice (the latter
> lacks some useful features). I would gladly use a Python spreadsheet, if
> such exists.

[caveat lector: my quoting the following URL implies nothing beyond what
you asked viz the product exists]
http://www.resolversystems.com/products/resolver-one

HTH,
John

Dr Steven M Carr

unread,
Sep 18, 2009, 1:07:10 PM9/18/09
to python...@googlegroups.com

>
> #What I think I know at this point:
> I need a spreadsheet interface, like Excel, or OpenOffice (the latter
> lacks some useful features). I would gladly use a Python spreadsheet, if
> such exists.

[caveat lector: my quoting the following URL implies nothing beyond what
you asked viz the product exists]
http://www.resolversystems.com/products/resolver-one


# This looks useful. Thanks for the suggestion

Steve

HTH,
John



pyinex

unread,
Sep 19, 2009, 9:05:25 AM9/19/09
to python-excel

>     Can an Excel VBA macro [or other spreadsheet macro] invoke a python
> script, and can python return a block of values to an the spreadsheet,
> seamlessly?

Yes (and you don't even need a macro; you can just call the function
directly from your spreadsheet).

Go to:

http://code.google.com/p/pyinex


Reply all
Reply to author
Forward
0 new messages