Does the de novo method miss some loci that are heterozygous?

149 views
Skip to first unread message

Peter G

unread,
Jul 12, 2021, 9:16:52 AM7/12/21
to Stacks

Hi everyone,

If someone could help me with this I would be very grateful, thank you for reading.

I have a dataset of RADSeq data for 204 bumblebees. I want to find out if any of them are individually completely homozygous, as this would imply they are haploid and thus male: I want to remove any males from the dataset if there are any and have conformation others are diploid.

Originally, I carried out the de novo method, then used populations to export the snps to vcf format. I then used vcftools with the argument -het which, "Calculates a measure of heterozygosity on a per-individual basis. Specifically, the inbreeding coefficient, F, is estimated for each individual using a method of moments." For this, I got one individual with an F equal to 1, so I assumed it to be haploid and male. 

I then used the reference based approach, twice, using reference genomes for B. terrestris and B. hortorum. Again, I used populations to export in vcf format, and used vcftools -het. The results were similar to the de novo method, but each of the 3 protocols. Notably though, the individual that had an F=1 from the de novo method, now has F=0.99853 from the B. terrestris genome referenced, and F=0.99698 from the B. hortorum genome referenced.

So my question is, why are these numbers different? Are they different because the de novo method missed some loci that are heterozygous and therefore is less accurate? Or is it something else?

Or if someone knows of a better method for finding homozygosity of individuals, could you let me know please?

I've attached the data files of the vcftools outputs and log files.

Thanks,

Peter G

unread,
Jul 12, 2021, 9:18:50 AM7/12/21
to Stacks
males_methods.pngFor some reasons the above wouldnt post with the attachments. trying to attach to this post instead..

Peter G

unread,
Jul 12, 2021, 9:20:25 AM7/12/21
to Stacks
attaching log files in this additional post. sorry for this being spead between posts!
ref_map_bter.log
ref_map_bhor.log
populations_denovo.log

Julian Catchen

unread,
Jul 12, 2021, 5:58:48 PM7/12/21
to stacks...@googlegroups.com, Peter G
Hi,

Any analysis method you apply has to account for noise and error in the
sequencing data. You will never find a 100% answer in any of these
metrics. Mostly, this is due to repetitive regions in the genome which
can result in homozygous loci collapsing into a single
heterzygous-looking locus. Or, with different reference genomes, you can
have different regions of the genome present/absent with variable
alignments of your RAD reads to the genomes. Or, you can have
complicated indels that fool the assembler/aligners into collapsing loci
together.

"Mostly homozygous" is the best answer any analysis will be able to give
you, however, you should be able to differentiate between your classes
of bees if you compare them to one another.

The same things occur if you sequence known-haploid DNA, of if you look
for sex-specific markers, or if you use the data to establish
parent/sibling relationships (you can actually find siblings that are
"more" related to a focal individual than their true parents -- due to
the average level of relatedness in the population). And there are many
other similar cases...

Best,

julian

Peter G wrote on 7/12/21 8:16 AM:

Peter G

unread,
Jul 13, 2021, 1:22:19 PM7/13/21
to Stacks
Thank you very much Julian that is very helpful. Whilst we had presumed it would be down to small variations, we couldn't think of the underlying causes but a small number of mis-grouped loci from repetitive regions etc make complete sense. it also then fits why when we used reference based approaches, the homozygosity for some samples increased alittle. I think looking at the denovo approach, we have 3 clear outliers with very high homozygosity, and although when we compare this using a reference some of the other samples are closer to the outliers, it still seems those 3 high samples are best suited to be described as mostly homozygous. 

ON a similar note, we have another 3 samples on the opposite end of the spectrum (if you look at the graph in the previous post, almost all samples have F scores of 0.8, then 3x have 0.99 [presumed haploids], but another 3 have F values of 0.6). STRUCTURE analysis doesn't flag these samples as hybrids like we first though may be happening (our sample population are 3 different species, but the outliers are not linked to species or collection site). Do you have any thoughts on why these 3 samples could have markedly low F scores? All our samples have > 1 million retained reads and > 30x coverage depth. Given the pattern is present both denovo and when ref aligned, we dont believe its contamination and that the samples should not be filtered out. its just not clear to us what could be causing the values. 

Reply all
Reply to author
Forward
0 new messages