cbc_gbc_dicts, batches

jan.gl...@gmail.com

unread,

May 15, 2017, 4:35:48 AM5/15/17

to Perturb-seq

Hi Atray,

Inspecting the cell/guide mappings on GEO (*_cbc_gbc_dict.csv.gz files) it appears to me that each cell contains only a single guide? (in contrast to eg. fig 1D of the paper)

Is this correct / how are these files generated?

Another question I did not find answered anywhere is about batches: From the mimosca.py file I understand that `pt214d_A6` in `AAACATACACCTGA_pt214d_A6` represents some kind of batch. Are theses only sequencing batches or are they on the cell culture level, i.e. do equal cell barcodes from different batches correspond to the same or to different cells?

Thank you for taking the time to answer questions here,

kind regards,

Jan

Atray Dixit

unread,

May 15, 2017, 2:06:48 PM5/15/17

to Perturb-seq

Hi Jan,

So the *_cbc_gbc_dict.csv.gz files have a mapping guide -> cells that have it. If you create the reciprocal mapping, for each cell which guides it has, you should see some cells with 0 guides and others with >1 guide. Let me know if still have trouble getting the inverse mapping and I can share with you a function that might be able to help.

The batches are 10x channel batches. The cells are cultured together and then segregated for 10x scRNA-seq. Cells with the same barcode from different batches most definitely correspond to different cells (their barcode complexity in their v1 chemistry was such that you expect barcode collisions between different channels at a low, but nonzero rate).

Hope that helps, and I'll respond to your github inquiry later tomorrow (in brief I forgot to upload the file and will comment carefully and upload).

Best,

Atray

jan.gl...@gmail.com

unread,

May 15, 2017, 2:53:16 PM5/15/17

to Perturb-seq

Hi Atray,

Thanks for the clarifications, they were very helpful already.

cbc_gbc mapping: I just realized that the described problem (maximal 1 distinct guide per cell) is only present in the dc_0hr, k562_tf_7, k562_tf_13 and k562_ccycle files, i.e. the dict files without the _lenient/_strict suffixes; In the files with suffix I see up to 10 guides per cell in the inverse mapping. Perhaps there is a difference in how they are generated/saved? I am very much looking forward to the related notebook.

batches: Thanks, that clears thinks up for me. (On a side note, not requiring an answer: that rate is ~0.5% for k562_tf_13. Can the within batch collision rate be expected to be the same?)

all the best,

Jan

Atray Dixit

unread,

May 15, 2017, 3:49:45 PM5/15/17

to Perturb-seq

re: cbc_gbc_mapping....interesting, they might indeed be subsetted to the cells with only 1 guide. It might take me some time to dig up the original files..and post them, but I'll look into it.

re: batches yes that is consistent with our empirical observations and what the 10X genomics support team had predicted. There are approximately 730,000 distinct barcodes, and you can think of each channel as a Poisson sampling of the barcodes. The more cells that are sampled the higher the collision rate with be. Within a channel, there will be roughly half the number of cells as there are with two channels, so the collision rate within a channel will be lower.

Thanks,

Atray

jan.gl...@gmail.com

unread,

May 21, 2017, 3:27:46 PM5/21/17

to Perturb-seq

Hi Atray,

thank you again for your response.

It would be wonderful if you could upload these files at some point - I would like to use them and am afraid I would have a really hard time reproducing them from the raw reads - even with the dict_maker script.

all the best,

Jan

Atray Dixit

unread,

Jun 4, 2017, 12:23:37 PM6/4/17

to Perturb-seq

I put them up on the github repo:

https://github.com/asncd/MIMOSCA/tree/master/GBC_CBC_pairing/gbc_cbc_dicts

Let me know if these are ok. I believe there might be some cases in which there are more cells in the dictionary than in the corresponding expression matrix because of some filtering.

jan.gl...@gmail.com

unread,

Jun 8, 2017, 7:21:02 PM6/8/17

to Perturb-seq

Thanks a lot for digging into your old files, Atray!

I found that "promoters_concat_all.csv" is a superset of the mappings in "GSM2396858_k562_tfs_7_cbc_gbc_dict.csv.gz" (which apparently contains also cells with more than one guide (but only one guide mapping)),

while "pt2_concat_all.csv" contains mappings from a superset of cells (including cell with more than one guide) from "GSM2396859_k562_tfs_13_cbc_gbc_dict.csv.gz".

"dc_0hr_concat_all.csv" and "ph_concat_all.csv" seem to be the same as "GSM2396857_dc_0hr_cbc_gbc_dict_strict.csv" and "GSM2396860_k562_tfs_highmoi_cbc_gbc_dict_lenient.csv", respectively, and which already contain multiple mappings per cell.

Thus, only a multi guide guide_cell_dict for the cell cycle dataset (GSM2396861_k562_ccycle_cbc_gbc_dict.csv) is missing. Could check if you have this one as well?

Thank you again, this helps me a lot.

best,

Jan

Atray Dixit

unread,

Jul 31, 2017, 10:44:57 PM7/31/17

to Perturb-seq

Hi Jan,

I'm so sorry for forgetting to reply to this posting. I've attached a dictionary file that should contain multi-guide mappings.

Please let me know if you have any more questions.

Best,

Atray

ccycle_concat_all.csv

jan.gl...@gmail.com

unread,

Aug 1, 2017, 4:16:43 AM8/1/17

to Perturb-seq

Thank you, Atray!

Don't worry about the delay, I was on vacation anyway :).

best, Jan

Reply all

Reply to author

Forward