data output comparison on Illumina HiSeq 2500 vs 4000, will it cause an issue for my stacks pipeline?

223 views
Skip to first unread message

nicolef...@gmail.com

unread,
Nov 26, 2018, 2:00:56 PM11/26/18
to Stacks
I have just run into a problem where my sequencing facility has switched over to mainly running their Illumina HiSeq 4000, compared to using the HiSeq 2500 (which they now run MUHC less often). I have one more set of ddRAD-seq libraries that need to be sequenced. Most online posts discuss the cost difference and read number difference between the 2 different platforms (4000- 350million reads, 2500- 250million reads). But I am mainly concerned about the downstream analysis and bioinformatics. 

I have used STACKS to process my sequences for 2 previous sequencing sets (seq'd on a 2500). I am worried about batch effects between datasets if I am going to run this last set on the Illumina HiSeq 4000. I was wondering if anyone has also run into this problem?

Is it possible to compare these datasets? Or would I need to completely redo my de-multiplexing for all of the samples and re-run my STACKS pipeline in order to get a comparable dataset with the 2 different sequencing platforms?

Any advice would be greatly appreciated!

Thanks,

Nicole

Catchen, Julian

unread,
Nov 27, 2018, 4:16:29 PM11/27/18
to stacks...@googlegroups.com, nicolef...@gmail.com
RADseq data should not be that susceptible to batch effects compared to
say, RNAseq data (where individual read counts matter). For RAD, the
read counts only matter in terms of how many of a particular allele was
sequenced on the platform at each RAD locus, as these allele counts will
be fed into the SNP caller and considered collectively. If you have very
low coverage, you might see some alleles occur on one sequencing
platform more than the other, but if you have good coverage, I wouldn't
worry about it. This is a similar problem for very rare alleles detected
in the data -- they are likely to be error, so it is good to employ a
modest minor allele frequency filter after you process the data.

I think your library construction effects would likely outweigh any
effects from sequencing platform. The flowcell and chemistry is
different on the 4000, but again, I think with good coverage these
effects should not be a problem.

Anyway, you can always do a PCA of your genotypes after processing and
look to see if you see any splits in the data appear based on what
sequencing platform was used.

julian

nicolef...@gmail.com wrote on 11/26/18 1:00 PM:

nicolef...@gmail.com

unread,
Nov 28, 2018, 10:19:07 AM11/28/18
to Stacks
Thank you Julian, with the wait time that facilities are now running for the HiSeq2500 (a matter of months before I get sequences back) it seems I need to go with processing on the HiSeq4000 so I can get my sequences back within a decent time frame. Ultimately, I will be getting more reads with the 4000, and I will make sure they sequence at my original read length. My coverage is decent 7x-10x mean coverage depth. So I believe it will fine. And I will definitely try the PCA. Thank you for your advice!

Best regards,

Nicole
Reply all
Reply to author
Forward
0 new messages