[Genome] minor allele frequency

191 views
Skip to first unread message

Teresita Diaz De Ståhl

unread,
Apr 28, 2011, 3:14:08 PM4/28/11
to gen...@soe.ucsc.edu
Hello UCSC,



I would appreciate if you can provide me some help... I have some problems trying to get the correct information on minor allele frequency.

If I understand correctly

* Common SNPs (132): uniquely mapped variants that appear in at least 1% of the population
* Flagged SNPs (132): uniquely mapped variants, excluding Common SNPs, that have been flagged by dbSNP as "clinically associated"
* Mult. SNPs (132): variants that have been mapped to more than one genomic location



* Common SNPs (132) and Flagged SNPs (132) should be mutually exclusive that is as "clinically associated" SNPs are not "common" and common SNPs are not flagged as "clinically associated". But when I download the information for rs169547 (only as an example) I went through the Table browser and I choose CommonSNP(132), http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&position=chr13:32929380-32929390&hgsid=193533287



* I get the results that I send you in the attached file..



* This SNP is flagged as clinically associated....why? I downloaded through the common SNPs?



* Then I get 3 frequencies for 3 alleles: G,A,C, 0.963316,0.033403,0.003281,



* Is the minor allele freq 0.003281? for C?


However in db132 http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs169547 this SNP has a MAF/MinorAlleleCount: T=0.025/32 and for T. Is the minor allele T or C?

Am I looking at the wrong columns? I would appreciate if you can help me to get into the right track...
Thanks

Kind regards /Teresita





Teresita Díaz de Ståhl, PhD., Assoc. Prof.

Department of Oncology-Pathology

Cancer Center Karolinska, CCK R8:04

Karolinska Institutet

SE-171 76 Stockholm, Sweden

+46-8-517 721 35 (office)

E-mail: Teresita.Di...@ki.se





-----Original Message-----
From: Brooke Rhead [mailto:rh...@soe.ucsc.edu]
Sent: den 28 april 2011 01:29
To: Teresita Diaz De Ståhl
Cc: gen...@soe.ucsc.edu
Subject: Re: [Genome] minor allele frequency



Hello Teresita,



Allele frequencies are now included in our SNP tables (this was not true prior to dbSNP build 132). So, if you download the snp132 table from here:



http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/



(the filename is snp132.txt.gz), you will get allele frequencies. Note that this file is quite large, so you will probably want to use one of the alternate methods for downloading described at the top of that page.



The table schema is shown here:

http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=varRep&hgta_track=snp132&hgta_table=snp132&hgta_doSchema=describe+table+schema



Note that not all SNPs include allele frequency data.



You might also be interested to know that we now provide 4 separate SNP tracks, described in our announcement here:

http://genome.ucsc.edu/goldenPath/newsarch.html#041811.2

(and also described on the SNP track descriptions in the Genome Browser).



If you have further questions, please feel free to contact us again at gen...@soe.ucsc.edu<mailto:gen...@soe.ucsc.edu>.



--

Brooke Rhead

UCSC Genome Bioinformatics Group





On 04/27/11 07:41, Teresita Diaz De Ståhl wrote:

> Hi,

>

> I wonder if it is possible to download the db132 including data of

> minor allele frequency for each SNP. Thanks in advance for some tips

> how to do it.

>

> Kind regards

>

> Teresita

>

> Teresita Díaz de Ståhl, PhD

>

> _______________________________________________

> Genome maillist - Gen...@soe.ucsc.edu<mailto:Gen...@soe.ucsc.edu>

> https://lists.soe.ucsc.edu/mailman/listinfo/genome
rs169547.txt

Brooke Rhead

unread,
Apr 28, 2011, 4:44:40 PM4/28/11
to Teresita Diaz De Ståhl, gen...@soe.ucsc.edu
Hi Teresita,

I am asking one of our engineers to look at your allele frequencies
question, but I can answer your question about Common vs. Flagged SNPs.

The Common SNPs and Flagged SNPs are not expected to be mutually
exclusive. There are longer explanations on the track controls pages
(click on the blue track names in the Variation and Repeats section on
the main Genome Browser page: http://genome.ucsc.edu/cgi-bin/hgTracks).

From the Flagged SNPs page:

"Only SNPs flagged as clinically associated by dbSNP, mapped to a single
location in the reference genome assembly, and *not* known to have a
minor allele frequency of at least 1%, are included in this subset."

From the Common SNPs page:

"Only SNPs that have a minor allele frequency of at least 1% and are
mapped to a single location in the reference genome assembly are
included in this subset.
...
The selection of SNPs with a minor allele frequency of 1% or greater is
an attempt to identify variants that appear to be reasonably common in
the general population. Taken as a set, common variants should be less
likely to be associated with severe genetic diseases due to the effects
of natural selection, following the view that deleterious variants are
not likely to become common in the population."

Also, if you scroll down to the description of "Miscellaneous Attributes
(dbSNP)" under "Interpreting and Configuring the Graphical Display" you
will see these two descriptions:

"Clinically Associated - SNP is in OMIM/OMIA and/or at least one
submitter is a Locus-Specific Database. This does not necessarily imply
that the variant causes any disease, only that it has been observed in
clinical studies.
Appears in OMIM/OMIA - SNP is mentioned in Online Mendelian Inheritance
in Man for human SNPs, or Online Mendelian Inheritance in Animals for
non-human animal SNPs. Some of these SNPs are quite common, others are
known to cause disease; see OMIM/OMIA for more information."

I hope this answers your first question. We will get back to you with
an answer to your other question.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 04/28/11 12:14, Teresita Diaz De St�hl wrote:
> Hello UCSC,
>
>
>
> I would appreciate if you can provide me some help... I have some problems trying to get the correct information on minor allele frequency.
>
> If I understand correctly
>
> * Common SNPs (132): uniquely mapped variants that appear in at least 1% of the population
> * Flagged SNPs (132): uniquely mapped variants, excluding Common SNPs, that have been flagged by dbSNP as "clinically associated"
> * Mult. SNPs (132): variants that have been mapped to more than one genomic location
>
>
>
> * Common SNPs (132) and Flagged SNPs (132) should be mutually exclusive that is as "clinically associated" SNPs are not "common" and common SNPs are not flagged as "clinically associated". But when I download the information for rs169547 (only as an example) I went through the Table browser and I choose CommonSNP(132), http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&position=chr13:32929380-32929390&hgsid=193533287
>
>
>
> * I get the results that I send you in the attached file..
>
>
>
> * This SNP is flagged as clinically associated....why? I downloaded through the common SNPs?
>
>
>
> * Then I get 3 frequencies for 3 alleles: G,A,C, 0.963316,0.033403,0.003281,
>
>
>
> * Is the minor allele freq 0.003281? for C?
>
>
> However in db132 http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs169547 this SNP has a MAF/MinorAlleleCount: T=0.025/32 and for T. Is the minor allele T or C?
>
> Am I looking at the wrong columns? I would appreciate if you can help me to get into the right track...
> Thanks
>
> Kind regards /Teresita
>
>
>
>
>
> Teresita D�az de St�hl, PhD., Assoc. Prof.
>> Teresita D�az de St�hl, PhD
>
>
>> _______________________________________________
>
>> Genome maillist - Gen...@soe.ucsc.edu<mailto:Gen...@soe.ucsc.edu>
>
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Genome maillist - Gen...@soe.ucsc.edu
>> https://lists.soe.ucsc.edu/mailman/listinfo/genome

Angie Hinrichs

unread,
Apr 29, 2011, 6:34:34 PM4/29/11
to Teresita Diaz De Ståhl, gen...@soe.ucsc.edu
Hi Teresita,

Unfortunately I think rs169547's frequency data in dbSNP are a little jumbled because some submissions are on the forward strand, and some are on the reverse strand, and not all were corrected to be on the same strand before dbSNP combined the submissions' genotype and allele counts.

Here are several other details to note on the dbSNP page http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs169547 :

1. In the top section where you see "MAF/MinorAlleleCount: T=0.025/32", in the same column it says "RefSNP Alleles: A/C/G/T" -- while this is possible, it is often a symptom of a strand error. Also in the same column it says "MAF Source: 1000 Genomes", so it is not using all available frequency data for the MAF that it quotes in that section.

2. Scroll down to the section "Submitter records for this RefSNP Cluster", and see the columns "ss to rs Orientation/Strand" and "Alleles". Most submissions have either "rev/B C/T" or "fwd/T A/G", which are reverse complements of each other. That's consistent with a bi-allelic SNP that is A/G on the forward strand of the reference assembly. But then at the bottom of the table, there is a group of submissions starting with three from COMPLETE_GENOMICS that report the opposite alleles for fwd or rev: "fwd/B C/T" or "rev/T A/G", or from 1000 Genomes, just "fwd/ C/T".

Isn't it strange, though, that the first group has "fwd/T" and "rev/B" (T for Top, B for Bottom I guess?) but the latter group has "fwd/B" and "rev/T". Also, the flanking sequences given for the "fwd/B C/T" submissions in the second group are the same as the flanking sequences of the "rev/B C/T" submissions in the first group; likewise for A/G. To me, this looks like the result of a programming error, or perhaps a submission data error that dbSNP's programs should detect.

3. Scroll down further to the section "Population Diversity". In the brightly colored displays of genotypes and alleles, it is clear that some submissions are purely A/G, and some are purely C/T. Nobody has observed 3 or 4 alleles in the same population, only one or two, and it's always either A/G or C/T, and clearly the minor allele is A (or T on the rev strand) and the major allele is G (or C on the rev strand). The 1000 Genomes pilot submissions (no genotype bars, only alleles) are reported as C/T and that is why the top of the page's MAF from 1000 Genomes is for T.

Now, on to UCSC's distillation of the vast amount of data in dbSNP:

> * Then I get 3 frequencies for 3 alleles: G,A,C,
> 0.963316,0.033403,0.003281,
>
> * Is the minor allele freq 0.003281? for C?

The alleleFreq columns in snp132 are extracted from dbSNP's SNPAlleleFreq table (ftp://ftp.ncbi.nlm.nih.gov/snp/database/organism_data/human_9606/SNPAlleleFreq.bcp.gz , columns defined in ftp://ftp.ncbi.nlm.nih.gov/snp/database/organism_schema/human_9606/human_9606_table.sql.gz), joined with dbSNP's Allele table (ftp://ftp.ncbi.nlm.nih.gov/snp/database/shared_data/Allele.bcp.gz , columns defined in ftp://ftp.ncbi.nlm.nih.gov/snp/database/shared_schema/dbSNP_main_table.sql.gz). Here is a SQL query that extracts the rows for rs169547 from those tables:

select snp_id, allele, chr_cnt, freq
from SNPAlleleFreq, Allele
where SNPAlleleFreq.allele_id = Allele.allele_id
and snp_id = 169547;

The result is

+--------+--------+---------+------------+
| snp_id | allele | chr_cnt | freq |
+--------+--------+---------+------------+
| 169547 | G | 3230 | 0.963316 |
| 169547 | A | 112 | 0.0334029 |
| 169547 | C | 11 | 0.00328064 |
+--------+--------+---------+------------+

Why are there no T's reported there, but a small number of C's? -- I wish I knew! (Even without joining with Allele, only three rows are returned from SNPAlleleFreq.) We will have to ask dbSNP, snp-...@ncbi.nlm.nih.gov.

For now, I suppose one could ignore the row for C... (3230 G + 112 A = 0.9665 G, 0.0335 A).


> However in db132
> http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs169547
> this SNP has a MAF/MinorAlleleCount: T=0.025/32 and for T. Is the
> minor allele T or C?

I'm pretty sure it's not C. I think the minor allele is A on the forward strand, T on the reverse strand of GRCh37. The minor allele frequency depends a lot on your population (0% for East Asians, up to almost 10% for Africans) but seems to be in the neighborhood of 3% overall in submissions to dbSNP.

Our snp132 table has a column 'exceptions' which includes "SingleClassQuadAllelic" for rs169547 and others like it. You might consider excluding SNPs like that from your analysis.

Hope that helps (and please consider reporting this case to snp-...@ncbi.nlm.nih.gov),

Angie



Reply all
Reply to author
Forward
0 new messages