Intuitive description of freemix score

377 views
Skip to first unread message

John Blischak

unread,
Aug 28, 2017, 10:10:52 PM8/28/17
to verifyBamID
Hi,

I recently answered a question on the Bioinformatics StackExchange site about the freemix score calculation:

https://bioinformatics.stackexchange.com/a/2388/1302

I based my answer off of the original paper, the online documentation, and a previous response by Hyun Min Kang on this list:

https://groups.google.com/d/msg/verifybamid/oNMOfEZDwE4/AbVP_D-mbBEJ

I am cross-posting here in case 1) a verifyBamID user reading this list may find my explanation useful, and 2) Hyun Min Kang or any of the other verifyBamID authors would like to edit my explanation.

Thanks for creating such a useful tool!

John

kiran girdhar

unread,
May 14, 2018, 11:16:52 AM5/14/18
to verifyBamID
Hey

I have question on FREEMIX parameter

I have this from you Hyun Min Kang

"The key idea of FREEMIX estimate is to use excessive heterozygosity to estimate the level of contamination. Especially for common SNPs, you will observe higher fraction of heterozygous alleles than 2*p*(1-p), and it turns out that you can quantify the contamination very well if you know the population allele frequency already. If you do not have accurate population allele frequency information, than it would be harder to estimate FREEMIX parameters using verifyBamID."


My question is 


1. if you don't give input vcf file then how does it estimate population allele frequency to measure the heterozygosity?


1a. does verifyBAMID uses only BAM file to estimate contamination using sequence only method.


2. FREEMIX values can vary from 0-0.5 because the model assumes contamination as a mixture of two samples.


3. Is there any way I can determine gender mixing happened during sequencing? I feel the total number snps from chrX and ChrY is not enough to get good estimation of freemix parameter.


second question is on CHIPMIX


My understanding is CHIPMIX comes from sequence + array method. It uses Allele frequency from the input vcf file. Is that true?


1 lets say if the sample 1 is contaminated with 50% of sample 2. how does chipmix would look like?

Hyun Min Kang

unread,
May 15, 2018, 3:16:49 AM5/15/18
to verif...@googlegroups.com
You need to provide input VCF with specified allele frequency. Currently, it does not check for gender mismatch.

--
You received this message because you are subscribed to the Google Groups "verifyBamID" group.
To unsubscribe from this group and stop receiving emails from it, send an email to verifybamid...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

danli...@gmail.com

unread,
May 28, 2018, 10:48:54 AM5/28/18
to verifyBamID
Hi John,

Thanks for your reply.  However, I still have questions for equation(2) in the paper. In the paper, they do mentioned what populations they use for estimating the contamination (figure 4 CEU and YRI), but when I download VerifyBamID and only provide my VCF input file, how does VerifyBamID estimates the contamination using sequence data alone? In another word, how VerifyBamID estimate P(g_i) in the equation(2) ?

danli...@gmail.com

unread,
May 28, 2018, 10:51:04 AM5/28/18
to verifyBamID
Hi Hyun,

Could you please answer the question from Kiran: 

 if you don't give input vcf file then how does it estimate population allele frequency to measure the heterozygosity?

Thanks!


Dan

John Blischak

unread,
May 29, 2018, 10:34:46 AM5/29/18
to verifyBamID
Hi Dan,

The VCF file is required. verifyBamID will not work without a VCF file. From the documentation on input files:

The input VCF file contains (1) external genotype information and/or (2) allele frequency information as AF entry or AC/AN entries in the INFO field. (See | VCF specification for further details). If neither information is provided, verifyBamID will not work properly.


In other words, the VCF file needs to contain either 1) external genotypes from a population of individuals (in which case the allele frequency of each SNP is calculated directly from this population) or 2) allele frequency information provided in the INFO column.

On Monday, May 28, 2018 at 10:51:04 AM UTC-4, danli...@gmail.com wrote:
 if you don't give input vcf file then how does it estimate population allele frequency to measure the heterozygosity?


If you don't give an input VCF file, then it can't estimate the population allele frequency. Kiran and Dan, have you been able to successfully run verifyBamID without providing a VCF file?

When I run verifyBamID without providing a VCF file with the argument --vcf, I receive the following error:

FATAL ERROR -
--vcf [vcf file] required

John

danli...@gmail.com

unread,
May 29, 2018, 11:22:20 AM5/29/18
to verifyBamID
Hi John,

Thanks for the reply. 
But I still have a question, if the allele frequency of each SNP is calculated directly from the population in the input VCF, dose that mean P(g_i) in the equation (2) is estimated from the input VCF? What if the input vcf only includes one individual, the result from this situation could not be believed as true?

Thanks!

Dan

John Blischak

unread,
May 29, 2018, 12:20:14 PM5/29/18
to verifyBamID
Hi Dan,

On Tuesday, May 29, 2018 at 11:22:20 AM UTC-4, danli...@gmail.com wrote:
What if the input vcf only includes one individual, the result from this situation could not be believed as true?

I agree. If the VCF file only contains one individual, the results would be hard to interpret. In this case, I'd recommend including the INFO column field AF (for Allele Frequency) in the VCF file.


You can use the allele frequency estimate from the most closely related 1000 Genomes population:

http://www.internationalgenome.org/faq/which-populations-are-part-your-study/

A word of caution: I am a verifyBamID user, not a developer. I provide a VCF file with known genotypes of a population of individuals (> 100) with no variables defined in INFO. This has worked well for me in practice for multiple projects using different human populations. If you are going to manually set the AF INFO column but also provide known genotypes for one individual, I'd recommend verifying that this is working by also performing some tests with randomly generated allele frequency data to confirm the results change. I searched the code and can confirm the the AF INFO field is recognized, but I can't assure you that this won't be overridden by your 1 individual's genotypes.


Hope that helps,

John

danli...@gmail.com

unread,
May 29, 2018, 1:47:29 PM5/29/18
to verifyBamID
Thank you John, 

This is really helpful, thank you a lot.
I think my real question is the diversity of population in my samples, I have > 100 (296) individuals in my VCF file, but they are from 4 populations, my concern is  will the mixed population will lead a wrong/un-accurate estimation for AF? I will definitely add AF INFO column in my VCF file. 
One more question, when you mentioned different populations without AF INFO, did you mean mixed populations in one file? Or you keep one population in one file?

Thanks!

Dan

kiran girdhar

unread,
May 29, 2018, 2:14:29 PM5/29/18
to verif...@googlegroups.com
My guess is you would have to create 4 VCFs file with their respective AF.



KG

--
You received this message because you are subscribed to the Google Groups "verifyBamID" group.
To unsubscribe from this group and stop receiving emails from it, send an email to verifybamid+unsubscribe@googlegroups.com.

John Blischak

unread,
May 29, 2018, 4:51:46 PM5/29/18
to verifyBamID
Hi Dan,


On Tuesday, May 29, 2018 at 1:47:29 PM UTC-4, danli...@gmail.com wrote:
This is really helpful, thank you a lot.

Glad I could help!

I think my real question is the diversity of population in my samples, I have > 100 (296) individuals in my VCF file, but they are from 4 populations, my concern is  will the mixed population will lead a wrong/un-accurate estimation for AF? I will definitely add AF INFO column in my VCF file.

I think this depends on your use case. What is your main goal? Do you want to detect sample swaps? If yes, then the AF estimates are likely of little consequence.

However, if you want accurate estimates of contamination, this gets tricky. If you're concerned with contamination from the other 295 individuals in your study, then perhaps your empirical AF would make the most sense. If on the other hand you think the contamination arose only from samples of the same genetic background (e.g. maybe you received each population of individuals as cell lines from a separate lab), then it would likely make more sense to follow Kiran's suggestion of creating 4 separate VCF files and only calculating the contamination of each sample against the other samples from the same population.
 
One more question, when you mentioned different populations without AF INFO, did you mean mixed populations in one file? Or you keep one population in one file?

The latter. My projects have always only involved a single population. Also, my main goal was to detect sample swaps from RNA-seq data, which verifyBamID does an amazing job of. I don't have any direct experience with estimating contamination from DNA-seq data (apart from reading the original paper), so I am hesitant to give strong recommendations on how you proceed.
 
John

danli...@gmail.com

unread,
May 29, 2018, 5:25:42 PM5/29/18
to verifyBamID
Hi John,

I really appreciate your suggestions and all information you posted here. 
At first my main goal is to test sample swaps, but I notice the contamination in the results so that's why I try to figure out how VerifyBamID works, especially how to calculate FREEMIX. 
According to your reply, I have checked the contaminated samples are from the sample population, so I will try use this populations as the reference AF. 
Thank you very much!

Dan

Hyun Min Kang

unread,
May 29, 2018, 6:48:44 PM5/29/18
to verif...@googlegroups.com
Hi Dan -- 

* I think it would be the best to read the paper to understand how FREEMIX works at https://www.ncbi.nlm.nih.gov/pubmed/23103226
* To conceptually explain, FREEMIX is using the excess heterozygosity (than expected) with respect to allele frequency. 
* If you do not know population allele frequency, try the new version of verifyBamID at https://github.com/Griffan/VerifyBamID , which jointly estimate genetic ancestry and contamination together, so that you do not need to know the allele frequencies a priori.

Hyun.


--
Reply all
Reply to author
Forward
0 new messages