Help with refactoring program to average data in columns in a csv file based on name in a single column

60 views
Skip to first unread message

Chris

unread,
Oct 30, 2013, 2:48:59 AM10/30/13
to python...@googlegroups.com
Dear Pythonistas,

I was wondering if anyone could help me with rewriting a script to average (mean) rows in a csv file, based on the value (Gene Symbol) in a particular column? This script reads in a csv file, then sums each row that has the same "Gene Name" while keeping a counter of the number of instances of each "Gene Name" and finally divides the summed values by the counter giving the mean. This is then outputted to another csv file. I originally wrote this script (attached as Original.py) with the ability to only handle 12 columns of data, but I would now like to rewrite it to:
A) Handle any number of columns of data (preferably just taken from analysis of the input file) (probably between 3 and 50).
B) Output values to a second file for the medians of the rows, rather than only the mean.
C) Run as a script on the command line with three inputs for 1) Input file name 2) Output file mean 3) Output file median.

I have attached a sample input and output, the original program and the partially revised program. The first two columns will always contain the "Probe Set ID" and the "Gene Symbol". There are an unknown and variable number of instances of a particular Gene Symbol in the file. Ideally, I would like to append the Probe Set Ids in the output file, as seen in the TestOutputMean file currently.

So far I have been able to solve C, but am struggling with getting A and B to work. Specifically, I am not sure how to write the section that will take an unknown number of columns and output mean and/or median values. If anyone could help me with this it would be very much appreciated. This program is used in my research on Asthma to analyse ~50,000 row files of microarray data. Unfortunately, at this point I am stuck, as I am still learning python!

Kindest regards,

Chris

PhD Candidate, University of Calgary
TestOutputMean.csv
Original.py
New.py
TestInput.csv

Jerry Seutter

unread,
Oct 30, 2013, 11:39:52 AM10/30/13
to pythoncalgary
Hey, it looks like you're off to a great start with Python.  Congratulations!

1. Using csv.reader and csv.writer will simplify your code and will work more reliably:

import csv
reader = csv.reader(open('TestInput.csv'))
for row in reader:
    print row

You can find documentation at http://docs.python.org/2/library/csv.html

2.  I think you are finding your problem hard because the data is not in the held in the right "structure" in your program.  Right now you are storing a dictionary keyed by gene name with sums of the other columns. This needs to change because you want the program to do more stuff now.  You need to instead store a dictionary keyed by gene name with the values of the other columns:


old_dict = {'AGPAT9': ['2', '4']}

new_dict = {'AGPAT9': { 'Column A': ['1.5', '0.5'], 'Column B': ['1', '3'] }}

Notice how the values are not summed in the "new"?  There is an extra dictionary there now that preserves more information so that you retain enough information to calculate the median.  Unfortunately this will complicate how you store and extract information in your dict:


To store the data:

dict = {}
row = [1, 5, 5]
for i, val in enumerate(row[]):
    # The first column gets special treatment because it is our gene name
    if i == 0:
        GeneName = val
        continue

    # This part is just data structure maintenance...
    if GeneName not in dict.keys:
        dict[GeneName] = {}
    column_name = headers[i]
    if column_name not in dict[GeneName]:
        dict[GeneName][column_name] = []

    # Now we can actually store data...
    dict[GeneName][column_name].append(value)


To read the data:

for gene_name, gene_data in sorted(dict):
    for column_name, column_values in sorted(gene_data):
        total = sum(column_values)
        average = total / float(len(column_values))
        median = median(column_values) # You have to write your own function for this

See how much easier it is to do your calculations when you have your data structures right?  This is the hard part of writing computer programs and is usually a process of trial and error.


3. For calculating the median, I will refer you to stack overflow which is an amazing resource: http://stackoverflow.com/questions/10482339/how-to-find-median .

4. This is just a style tip - don't name your dictionary "dict".  There is a python method called dict() and you are inadvertently making it impossible to call that method in your program.  If you really want to use "dict", use "dict_".  Anybody with experience in Python will know why you did that.  (the same thing happens with the list() method)


If you have more questions, feel free to ask!

Jerry



--
You received this message because you are subscribed to the Google Groups "calgary python user group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pythoncalgar...@googlegroups.com.
To post to this group, send email to python...@googlegroups.com.
Visit this group at http://groups.google.com/group/pythoncalgary.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages