(1) The file you refer to is in matrix market format and can be read with a function like scipy's
mmread. The first two numbers give row and column indices and the last number gives the UMI's observed. Generally, for large sparse matrices it loads faster into memory.
A function like the following will be able to read it if you also include the genes.tsv and barcodes.tsv files in the same directory.
def read_10x(pathin):
"""Return Pandas Dataframe containing 10x dataset """
mat=scipy.io.mmread(os.path.join(pathin, "matrix.mtx"))
genes_path = os.path.join(pathin, "genes.tsv")
gene_ids = [row[0] for row in csv.reader(open(genes_path), delimiter="\t")]
gene_names = [row[1] for row in csv.reader(open(genes_path), delimiter="\t")]
gene_final = [x+'_'+y for x,y in zip(gene_ids,gene_names)]
barcodes_path = os.path.join(pathin, "barcodes.tsv")
barcodes = [row[0][0:14] for row in csv.reader(open(barcodes_path), delimiter="\t")]
DGE=pd.DataFrame(mat.toarray())
DGE.index=gene_final
DGE.columns=barcodes
return DGE
(2) The difference between lenient and strict assignment has to do with how stringently molecules suspected of being PCR chimeras were removed before creating the assignment of guides to cells. The percentage of all reads a given guide was in a particular cell was also slightly more strict. See supplementary Figures S1 and S2.
The count matrix is log transformed. Library size (we call it cell complexity, or transcripts detected sometimes) is added as a covariate to the model, and we removed genes that were basically zero across all cells.
Thanks for your interest,