Moving to i5-indexing

57 views
Skip to first unread message

Adam Passman

unread,
Apr 12, 2021, 8:30:17 AM4/12/21
to Smart-3SEQ
Hi

I've been using 8nt i7 indexes with P5 Universal and have purchased around 70 indexes or so. I noticed you use i5 with P7 universal, particularly for FFPE LCM due to sequencing artefacts and you obviously see improvement from this change.

I'm just wondering about the cost/benefit here. In your opinion, is the improvement worth me purchasing a whole new set of primers? If the change just a marginal increase in mappable reads, perhaps it's not worth the switch? What are your thoughts?

Cheers,

Adam 

Joe Foley

unread,
Apr 12, 2021, 1:17:53 PM4/12/21
to smart...@googlegroups.com
What we saw with low-quality libraries (FFPE LCM) was a lot of poly(A) artifacts in the i7 index read, but not in the i5 read, which didn't necessarily manifest as different proportions of alignable reads but would bias the number of reads assigned to each library depending on their index sequences (possibly misassigning some reads), and potentially also disqualify a lot of reads with undetermined i7 indexes that might be rescued if they had i5. Below is a summary of an experiment comparing the 6 nt i7 indexes with full 8+8 UDIs. Ultimately we decided that i5-only indexing might actually be superior to UDIs in addition to costing half as much in oligos.

Fang made the libraries from four different experimental conditions, two replicates each, using either the UDI PCR primers or our old single-indexing PCR primers. Here's how the conditions are labeled in the analysis:
High: 10 ng Universal Human Reference RNA
Low: 100 pg Universal Human Reference RNA
IDC: invasive ductal carcinoma, FFPE LCM, ~500 cells
Lym: lymphocytes, FFPE LCM, ~500 cells
This is a mix of 1/4 good, 1/4 bad, and 1/2 borderline libraries, so the overall metrics of the sequencing run were unsurprisingly worse than usual. However, all the libraries were sequenced in the same run (the UDIs are 2x8 nt while the old 1x6 indices just had a bunch of extra bases that should always be the same), so it should be safe to compare them with one another. In general all the QC metrics (attached) look very similar between the new UDIs and the old single indices, but maybe slightly worse with the UDIs.

It's not surprising that the UDIs might perform a little worse. With limiting input material and damaged RNA, we have problems filtering out PCR primer dimers, and UDIs will produce larger dimers that are harder to eliminate by size-selection (UDI_diagram.pdf). So there's a little bit of a tradeoff there.

We can look at the other side of the tradeoff with a simple simulation: count the reads from this experiment that would be assigned to all 48 of the old indices (using only the first 6 nt of the i7 read) and all 96 of the new indices (using only the 8 nt i7 read), including the ones that weren't actually used in any library. Then we can assume all the hits to unused indices are spurious. What we see (collision_analysis.pdf) is that there are a few specific index sequences that attract a lot of spurious reads, and they tend to be A-rich or perhaps T-rich.

It looks like the main artifact here is that a lot of clusters get poly(A) in their i7 reads:
AAAAAA  13115740
AAAAAT  1923848
ATAAAA  1586582
TAAAAA  1571272
AATAAA  1006144
TTATAA  773607
TTAAAA  763349
ATAAAT  760293
TTTATA  749611
AAATAA  718130
AATTAA  642070
ACATTT  640218
AAAATA  635863
ATAACT  518620
CAAAAA  509183
For example, CAAAAA is common and it's only one base error away from index 28, CAAAAG, so if we'd used I28 for a library it would have had all those reads misassigned to it. I'm not sure what molecular mechanism causes these homopolymer index reads, but curiously it doesn't seem to affect the i5 index read (base_composition.png). In the previous analyses we could see that the library with a lot more hits than the others, Lym3, had a dubious old-style index, TATAAT, but it didn't have vastly lower alignability than its replicates (in fact it was slightly higher). This suggests that the clusters with homopolymer i7 reads aren't necessarily from bad molecules that won't align anyway; it's a worse problem because reads assigned to the wrong library might just as frequently align to the genome and get counted in downstream analysis.
--
You received this message because you are subscribed to the Google Groups "Smart-3SEQ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to smart-3seq+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/smart-3seq/c05b4fb3-b311-41e8-bde9-1a8328827950n%40googlegroups.com.

base_composition.png
collision_analysis.pdf
UDI_diagram.pdf
insert_lengths.pdf
dedup.pdf
read_category_percent.pdf
read_category_count.pdf
OpenPGP_signature
Reply all
Reply to author
Forward
0 new messages