Fang made the libraries from four different experimental conditions, two replicates each, using either the UDI PCR primers or our old single-indexing PCR primers. Here's how the conditions are labeled in the analysis:
High: 10 ng Universal Human Reference RNAThis is a mix of 1/4 good, 1/4 bad, and 1/2 borderline libraries, so the overall metrics of the sequencing run were unsurprisingly worse than usual. However, all the libraries were sequenced in the same run (the UDIs are 2x8 nt while the old 1x6 indices just had a bunch of extra bases that should always be the same), so it should be safe to compare them with one another. In general all the QC metrics (attached) look very similar between the new UDIs and the old single indices, but maybe slightly worse with the UDIs.
Low: 100 pg Universal Human Reference RNA
IDC: invasive ductal carcinoma, FFPE LCM, ~500 cells
Lym: lymphocytes, FFPE LCM, ~500 cells
It's not surprising that the UDIs might perform a little worse. With limiting input material and damaged RNA, we have problems filtering out PCR primer dimers, and UDIs will produce larger dimers that are harder to eliminate by size-selection (UDI_diagram.pdf). So there's a little bit of a tradeoff there.
We can look at the other side of the tradeoff with a simple simulation: count the reads from this experiment that would be assigned to all 48 of the old indices (using only the first 6 nt of the i7 read) and all 96 of the new indices (using only the 8 nt i7 read), including the ones that weren't actually used in any library. Then we can assume all the hits to unused indices are spurious. What we see (collision_analysis.pdf) is that there are a few specific index sequences that attract a lot of spurious reads, and they tend to be A-rich or perhaps T-rich.
It looks like the main artifact here is that a lot of clusters get poly(A) in their i7 reads:
AAAAAA 13115740For example, CAAAAA is common and it's only one base error away from index 28, CAAAAG, so if we'd used I28 for a library it would have had all those reads misassigned to it. I'm not sure what molecular mechanism causes these homopolymer index reads, but curiously it doesn't seem to affect the i5 index read (base_composition.png). In the previous analyses we could see that the library with a lot more hits than the others, Lym3, had a dubious old-style index, TATAAT, but it didn't have vastly lower alignability than its replicates (in fact it was slightly higher). This suggests that the clusters with homopolymer i7 reads aren't necessarily from bad molecules that won't align anyway; it's a worse problem because reads assigned to the wrong library might just as frequently align to the genome and get counted in downstream analysis.
AAAAAT 1923848
ATAAAA 1586582
TAAAAA 1571272
AATAAA 1006144
TTATAA 773607
TTAAAA 763349
ATAAAT 760293
TTTATA 749611
AAATAA 718130
AATTAA 642070
ACATTT 640218
AAAATA 635863
ATAACT 518620
CAAAAA 509183
--
You received this message because you are subscribed to the Google Groups "Smart-3SEQ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to smart-3seq+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/smart-3seq/c05b4fb3-b311-41e8-bde9-1a8328827950n%40googlegroups.com.