(sc)RNA-Seq data for intro course

34 views
Skip to first unread message

lauren...@gmail.com

unread,
Feb 24, 2021, 4:19:02 PM2/24/21
to bioconductor-teaching
Dear all,

I have looked more closely at the snRNASeq and RNASeq datasets with the goal of formatting them for the intro course (i.e. make them similar in structure to the ecology surveys data). 

- As already discussed, the wide format with genes by sample with the row data cbinded isn't appropriate at all.

- I initially thought that taking a subset of genes, pivoting that into a long format and joining that with the colData would work, but it doesn't at all. That data fails at the very first plotting exercise, where we want to produce a scatter plot of one gene vs another one. This is only possible if we pivot the data back into a wide format, to recover genes as columns. Even though pivot_wider() and pivot_longer() have been introduced before starting the plotting chapter, I feel it leads to a cognitive overload for students, who need to understand pivoting to re-format the data they have familiarised themselves before being able to learn ggplot2. 

- The type of data that would work would be to start from the assay for a handful of genes, transpose that to have n samples along the rows and sample names + handful of genes as columns, and then join this with the colData. I think this would exactly match what we would need to re-use the flow of the data carpentry lesson. 

- Unfortunately, this approach generates datasets with 45 (bulk) or 384 (scRNA) rows, which is too small to really make a strong case for using R for data analysis. Compare this against the 30K+ rows of the surveys data. 

Are you aware of a (sc)RNA-Seq dataset with 10K+ samples, ideally with a time component? 

Best wishes,

Laurent

Drnevich, Jenny

unread,
Feb 24, 2021, 4:56:19 PM2/24/21
to lauren...@gmail.com, bioconductor-teaching

I’ve asked Aaron Lun on the singlecell-queries slack. He was looking for data sets for his data package.

 

Jenny

--
You received this message because you are subscribed to the Google Groups "bioconductor-teaching" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bioconductor-tea...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/bioconductor-teaching/8bfedb37-1348-4e1c-a9f3-307ec3325dd7n%40googlegroups.com.

Drnevich, Jenny

unread,
Feb 24, 2021, 11:41:28 PM2/24/21
to Drnevich, Jenny, lauren...@gmail.com, bioconductor-teaching

Aaron pointed me to a promising data set: scRNAseq::BachMammaryData(). 25K cells total from 4 mammary developmental stages, 2 replicates each!  I made an attempt at reading in the data, getting logNormCounts, pulling 13 genes I got from skimming the paper (https://www.nature.com/articles/s41467-017-02001-5), transposing the normalized values to cells X genes and then adding on colData. This has all 25,806 cell barcodes from Cell Ranger; the paper clearly lays out the extra cell filtering thresholds, but they only ended up filtering out ~800 cells so I did not bother to try it.  I also did not attempt any plotting at all to see if these genes were suitable – time for me to go to bed.

 

Knitting takes about ~5 min for me, mostly due to the normalization but downloading the data the first time could add on more. I’m fairly old-school with base R coding so feel free to update to “tidyr” ways. I did put it in a tibble at the end! The final .csv and .rds are < 1500 KB, an easy size for downloading

 

I pushed everything to https://github.com/Bioconductor/bioconductor-teaching/tree/master/data/BachMammaryData

 

Goodnight,

Jenny

Charlotte Soneson

unread,
Mar 5, 2021, 3:33:22 AM3/5/21
to bioconductor-teaching
Hi all,

quick reminder about the Bioconductor teaching committee meeting on Monday, March 8, 2pm CET. Jitsi link as usual: https://meet.jit.si/BioconductorTeaching. Based on the discussion below, I think that the data set selection for the lessons would benefit from some more discussion in this meeting. If you have additional topics, please feel free to add them to the agenda.

See you on Monday!
Charlotte

Laurent Gatto

unread,
Apr 11, 2021, 3:17:15 PM4/11/21
to Drnevich, Jenny, bioconductor-teaching
Dear all, 

I think this data set would be a good fit for the intro course. I pushed a short script, rnaseq_dplyr_ggplot.R, that summarises what the dplyr and ggplot chapters would essentially contain. I'll get started with the development of the actual material using that data. 

Best wishes,

Laurent

Drnevich, Jenny

unread,
Apr 15, 2021, 10:59:58 AM4/15/21
to Laurent Gatto, bioconductor-teaching

Good, I am glad that this data set fits our requirements. I just found that one sample from the data set has it’s own chapter in OSCA: http://bioconductor.org/books/release/OSCA/bach-mouse-mammary-gland-10x-genomics.html. I’m not sure it’s necessary that we do any QC filtering that I skipped before, but this has codes to calculate MT percentage and discard cells with high MT.

 

Jenny

Reply all
Reply to author
Forward
0 new messages