batch effect when combining data from novaseq/hiseq runs

Oksana Vernygora

unread,

Mar 14, 2022, 5:52:07 PM3/14/22

to Stacks

Hi,

I am running stacks on combined data from 3 ddRAD runs (one HiSeq run and two independent Novaseq runs). When samples are processed together there’s a clear batch effect with novaseq1 samples having significant amount of missing genotypes. I attached an IGV screenshot of the final filtered vcf file showing those distinct blocks of samples.

However, when I run stacks on each group of samples separately (novaseq1, novaseq2, hiseq) with exact same parameters, this doesn't happen, e.i. novaseq1 samples genotype just fine (track_1 in the image below):

I have attached ref_map and gstacks.distributions files for these runs.

These are low coverage samples and I know that can cause issues. But I don't know what could be causing this batch effect when samples are combined.

Any insight on this issue would be greatly appreciated!

Thank you,

-Oksana

ref_map_combined_30inds.log

vcf_novaseq1_10inds_0.5miss_mmDP3_0.01maf.log

ref_map_novaseq1_10inds.log

gstacks_combined_30inds.log.distribs

vcf_Alud_30inds_0.5miss_mmDP3_0.01maf.log

Victor Fitzgerald

unread,

Nov 13, 2022, 2:14:54 PM11/13/22

to Stacks

Hi Oksana,

I'm having a similar issue. Did you ever find a solution to this?

Thank you,

Victor

Oksana Vernygora

unread,

Nov 13, 2022, 6:15:27 PM11/13/22

to Stacks

Hi Victor,

Unfortunately, I never figured what was causing this issue and how to fix it in stacks. After trying multiple options, I decided to test alternative pipelines. Processing the same data with dDocent produced uniform genotype calls without any issues. I also tried Tassel and got similar uniform results without any batch effects.

I am not a bioinformatician to speak for the exact differences in these pipelines, just hope this helps=)

-Oksana

Reply all

Reply to author

Forward