Moving to i5-indexing

57 views

Skip to first unread message

Adam Passman

unread,

Apr 12, 2021, 8:30:17 AM4/12/21

to Smart-3SEQ

I've been using 8nt i7 indexes with P5 Universal and have purchased around 70 indexes or so. I noticed you use i5 with P7 universal, particularly for FFPE LCM due to sequencing artefacts and you obviously see improvement from this change.

I'm just wondering about the cost/benefit here. In your opinion, is the improvement worth me purchasing a whole new set of primers? If the change just a marginal increase in mappable reads, perhaps it's not worth the switch? What are your thoughts?

Cheers,

Adam

Joe Foley

unread,

Apr 12, 2021, 1:17:53 PM4/12/21

to smart...@googlegroups.com

What we saw with low-quality libraries (FFPE LCM) was a lot of poly(A) artifacts in the i7 index read, but not in the i5 read, which didn't necessarily manifest as different proportions of alignable reads but would bias the number of reads assigned to each library depending on their index sequences (possibly misassigning some reads), and potentially also disqualify a lot of reads with undetermined i7 indexes that might be rescued if they had i5. Below is a summary of an experiment comparing the 6 nt i7 indexes with full 8+8 UDIs. Ultimately we decided that i5-only indexing might actually be superior to UDIs in addition to costing half as much in oligos.

Fang made the libraries from four different experimental conditions, two replicates each, using either the UDI PCR primers or our old single-indexing PCR primers. Here's how the conditions are labeled in the analysis:

High: 10 ng Universal Human Reference RNA
Low: 100 pg Universal Human Reference RNA
IDC: invasive ductal carcinoma, FFPE LCM, ~500 cells
Lym: lymphocytes, FFPE LCM, ~500 cells

This is a mix of 1/4 good, 1/4 bad, and 1/2 borderline libraries, so the overall metrics of the sequencing run were unsurprisingly worse than usual. However, all the libraries were sequenced in the same run (the UDIs are 2x8 nt while the old 1x6 indices just had a bunch of extra bases that should always be the same), so it should be safe to compare them with one another. In general all the QC metrics (attached) look very similar between the new UDIs and the old single indices, but maybe slightly worse with the UDIs.

It's not surprising that the UDIs might perform a little worse. With limiting input material and damaged RNA, we have problems filtering out PCR primer dimers, and UDIs will produce larger dimers that are harder to eliminate by size-selection (UDI_diagram.pdf). So there's a little bit of a tradeoff there.

We can look at the other side of the tradeoff with a simple simulation: count the reads from this experiment that would be assigned to all 48 of the old indices (using only the first 6 nt of the i7 read) and all 96 of the new indices (using only the 8 nt i7 read), including the ones that weren't actually used in any library. Then we can assume all the hits to unused indices are spurious. What we see (collision_analysis.pdf) is that there are a few specific index sequences that attract a lot of spurious reads, and they tend to be A-rich or perhaps T-rich.

It looks like the main artifact here is that a lot of clusters get poly(A) in their i7 reads:

AAAAAA 13115740AAAAAT 1923848ATAAAA 1586582TAAAAA 1571272AATAAA 1006144TTATAA 773607TTAAAA 763349ATAAAT 760293TTTATA 749611AAATAA 718130AATTAA 642070ACATTT 640218AAAATA 635863ATAACT 518620CAAAAA 509183

For example, CAAAAA is common and it's only one base error away from index 28, CAAAAG, so if we'd used I28 for a library it would have had all those reads misassigned to it. I'm not sure what molecular mechanism causes these homopolymer index reads, but curiously it doesn't seem to affect the i5 index read (base_composition.png). In the previous analyses we could see that the library with a lot more hits than the others, Lym3, had a dubious old-style index, TATAAT, but it didn't have vastly lower alignability than its replicates (in fact it was slightly higher). This suggests that the clusters with homopolymer i7 reads aren't necessarily from bad molecules that won't align anyway; it's a worse problem because reads assigned to the wrong library might just as frequently align to the genome and get counted in downstream analysis.

--
You received this message because you are subscribed to the Google Groups "Smart-3SEQ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to smart-3seq+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/smart-3seq/c05b4fb3-b311-41e8-bde9-1a8328827950n%40googlegroups.com.

base_composition.png

collision_analysis.pdf

UDI_diagram.pdf

insert_lengths.pdf

dedup.pdf

read_category_percent.pdf

read_category_count.pdf

OpenPGP_signature

Reply all

Reply to author

Forward

0 new messages