Question on GEO data discussion

195 views

Skip to first unread message

Atray Dixit

unread,

Mar 10, 2017, 2:00:31 PM3/10/17

to pertu...@googlegroups.com

Below is a brief exchange re: the GEO data that others may find helpful:

Hello, I am trying to get a quick overview of the data available at GEO. A couple of questions:

(1) What does this file contain: GSM2396856_dc_3hr.mtx.txt.gz?

%%MatrixMarket matrix coordinate real general

17515 33063 44685757

1 1 1

1 14 2

1 16 1

1 22 1

1 25 1

1 33 1

(2) What is the difference between the lenient and strict assignment of guides?

One more question, how did you process the count matrix to get expression data to fit the linear model? Couldn’t find it in the methods section.

In particular:

- Did you apply some kind of library size normalisation?

- Variance stabilising transformation such as log transform?

- Did you remove genes with low variance?

Thank you and great work!

Ricard.

Atray Dixit

unread,

Mar 10, 2017, 2:00:46 PM3/10/17

to pertu...@googlegroups.com

(1) The file you refer to is in matrix market format and can be read with a function like scipy's mmread. The first two numbers give row and column indices and the last number gives the UMI's observed. Generally, for large sparse matrices it loads faster into memory.

A function like the following will be able to read it if you also include the genes.tsv and barcodes.tsv files in the same directory.

def read_10x(pathin):

"""Return Pandas Dataframe containing 10x dataset """

mat=scipy.io.mmread(os.path.join(pathin, "matrix.mtx"))

genes_path = os.path.join(pathin, "genes.tsv")

gene_ids = [row[0] for row in csv.reader(open(genes_path), delimiter="\t")]

gene_names = [row[1] for row in csv.reader(open(genes_path), delimiter="\t")]

gene_final = [x+'_'+y for x,y in zip(gene_ids,gene_names)]

barcodes_path = os.path.join(pathin, "barcodes.tsv")

barcodes = [row[0][0:14] for row in csv.reader(open(barcodes_path), delimiter="\t")]

DGE=pd.DataFrame(mat.toarray())

DGE.index=gene_final

DGE.columns=barcodes

return DGE

(2) The difference between lenient and strict assignment has to do with how stringently molecules suspected of being PCR chimeras were removed before creating the assignment of guides to cells. The percentage of all reads a given guide was in a particular cell was also slightly more strict. See supplementary Figures S1 and S2.

The count matrix is log transformed. Library size (we call it cell complexity, or transcripts detected sometimes) is added as a covariate to the model, and we removed genes that were basically zero across all cells.

Thanks for your interest,