Phased genotypes for individuals with an NA---- number

31 views

Skip to first unread message

J Ir

unread,

Mar 27, 2018, 5:52:59 PM3/27/18

to gen...@soe.ucsc.edu

Thank you for this great online service.

I want to make a table with a few thousand non-consecutive rsnums (rows) and a few NA---- [individuals from 1000 Genomes that have phased genotypes] (columns). I notice that you have about 20 pgNA#'s in your download section. I was able to download the gz file for one of them and the genotypes seem to use "/" and not the pipe symbol, so I am unsure whether these are in fact phased genotypes.

Ensembl allows users to extract a single phased genotype, though it requires interfacing with their API to make the request that I want. There are nearly 4,000 phased individuals with full genome results to choose from.
http://grch37.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=7:24452405-24453405;v=rs111;vdb=variation;vf=107

I tried to use their data slicer to create a subset of rsnums, though they only allowed selection based upon a genomic region; not on a list of non-consecutive variants.

Your Genome Browser appears to have the functionality that I am interested in. My guess is that I will not even need to worry about accessing your API. I have been able to upload a custom track into Genome Browser and I was also able to choose an NA# with --> Clade: Mammal
                                      Genome: Human
                                  Assembly: Feb. 2009 (GRCh37/hg19)
                                      Group: Variation
                                      Track: Genome Variants
                                      Table: YRI NA18507 (pgYoruban3)

using Table Browser. If you might be able to provide any guidance it would be appreciated.

{It would also be helpful to know if I might be able to access NA# with phased genomes that are not on your dropdown menu in Table Browser. Perhaps there is a database (NCBI?) that could be accessed and the genotypes could be included on the UCSC Genome Browser?}

Thank you.

Brian Lee

unread,

Mar 29, 2018, 6:31:55 PM3/29/18

to J Ir, UCSC Genome Browser Mailing List

Dear J lr,

Thank you for using the UCSC Genome Browser and your question about extracting data from the UCSC Genome Browser.

You may be interested to also try our Data Integrator tool, it is available under the top Tools menu.

You could upload custom track of rsIDs in BED format with a position like:
chr21 33031973 33031974 rs7277748

If you only had the rsIDs, you could even use other tools like the Table Browser to extract their coordinates to build your custom track (using the identifiers option).

With your custom locations you could go to the Data Integrator and add it as track to pull data from other tracks with data of interest. See the final image on the bottom of the Data Integrator Help page to see how secondary tracks are extracted:
http://genome.ucsc.edu/goldenPath/help/hgIntegratorHelp.html

Below is a session link where a small number of SNPs were added as a custom track "my SNPS" and then four other Genome Variants tracks were added. Then on the "Output Options" section the "Choose fields" option was clicked. Then for each of these tables, only certain fields were requested. The original rsID in my custom track and then the named alleles in the four other tracks. NOTE: the data integrator has a maximum of 5 tracks total at a time. By clicking "get output" you will see that you get the rsIds as a column and then each table pgNA## as a new column and when data is available it displays (note that empty columns make the visual alignment confusing, but you can send the output to a file.tsv and load it in a spreadsheet).

http://genome.ucsc.edu/cgi-bin/hgIntegrator?hgS_doOtherUser=submit&hgS_otherUserName=brianlee&hgS_otherUserSessionName=hg19_DataIntegrator

We also have other tools that might be of interest. The Variant Annotation Integrator, under the Tools menu as well, allows you to input rsIDs and extract data about them. Also you can use MySQL and do command-line MySQL queries on these tables: http://genome.ucsc.edu/goldenPath/help/mysql.html

Thank you again for your inquiry and for using the UCSC Genome Browser. If you have any further questions and reply to gen...@soe.ucsc.edu messages will be archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UC Santa Cruz Genomics Institute

Training videos & resources: http://genome.ucsc.edu/training/index.html
Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining

> --
>
> ---
> You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
> To post to this group, send email to gen...@soe.ucsc.edu.
> Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
> To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CANfiQE2BNYcR%2BYBpZxab2jP%2BgnGD9NqDdFogd%2BOAnkgqJ85pvg%40mail.gmail.com.
> For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Christopher Lee

unread,

Apr 2, 2018, 1:36:24 PM4/2/18

to Brian Lee, J Ir, UCSC Genome Browser Mailing List

Thank you very much! I was able to accomplish what I wanted to
quickly. Providing the custom track made it much easier for me to know
what to do.

My concerns now are to:

1-find out how to upload other genomes into the Genome Browser
(suggestions for urls would be welcome).

2-find phased genotypes

3-impute genotypes?

4-determine whether the blank genotypes in the results from Data
Integrator are not available or homozygous reference

(Details below)

Problems:
I did become confused, though, when trying to format the rsnums in the
Table Browser.
For quite some time I was trying to adjust the settings using the
intersection tool and
trying to use my custom track. When I finally simply added the rsnums
into identifiers,
there was no problem.

I also had some trouble with the Data Integrator. When I added the
tracks for the genotypes of individuals
the ADD button disappeared. I was not sure how to add more genotypes
without the ADD button. I finally realized
that the ADD button would only reactivate once another individual was
added. Same problem with the custom
track. I added the genotypes into the custom track, but then I was not
sure why the custom track was not added.
When I finally realized that the custom track needs to added again
with the pull down menu everything was fine.

Suggestion:

Creating a common graphical interface for tabular databases would be.
Uploading any array of data from the internet
could be made effortless. There must be a large number of databases
that exist out there but accessing them through
APIs is not user friendly. The UCSC tools did make accessing the data
much easier.

As a suggestion to make it even easier, what you might consider is
having a spread sheet that had row and column headers
with drop down menus which could include the various fields (such as
genotypes, NA#s etc.). It could be organized so
that it would be easy to see exactly what information was available.
Perhaps such an interface could be used as a
common GUI across all the databases.

Here are details for the points numbered above.

1- Can I pull additional genome resources from other databases into
the UCSC Genome Browser?
From what I can see Genome Browser has about 100 human genome results,
though I am not sure
whether they are full genomes. I think that it would be very helpful
to have at least a thousand
phased full genomes accessible to the Browser.

Ensembl has phased genotypes for 4000 individuals (1000 Genomes Project).
http://useast.ensembl.org/Homo_sapiens/Variation/Sample?db=core;r=1:18369141-18370141;v=rs12125819;vdb=variation;vf=7258054;sample=NA18518

I would love to be able to upload some of these into Genome Browser
and then extract the genotypes I want. It would be very helpful, if
you might
include an example of how this is done. Is it as simple as creating a
custom track and then including
the url form Ensembl? Might you know the web addresses for phased
genomes (perhaps NCBI or Ensembl)?

2- The individual genotypes that I extracted from Genome Browser
appear to be unphased. For example, this is a genotype
from my Genome Browser result: G/A (the slash mark means that it is unphased).
Here is a genotype from Ensembl: G|G (the pipe mark means that it is phased).
I did not see any phased genotypes in my results.
Does Genome Browser have phased genotypes for individuals?

3- I had a problem with the rsnums on Genome Browser. Many of them
were not included
on the genechip that was used. For some reason these same rsnums for
the same individuals were
available on Ensembl. To get around this I used Broad's proxy finder.
It would be a great upgrade
of your Browser if you included an option that would allow for
genotypes to be automatically
estimated by proxies. Many of the genotypes that I was interested in
had a perfect proxy.
Having the option to set the minimum rsquare for proxies would also be helpful.

4-When a genotype is left blank does this mean that the genotype was
not obtained or that
a homozygous reference was found?

Thank you again for helping me.

> https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAKy8-o65vAEQ8sah4e9pxLxt5A4ZWweijU8zvP4QBTJ5pspekw%40mail.gmail.com.

Christopher Lee

unread,

Apr 2, 2018, 7:05:50 PM4/2/18

to J Ir, UCSC Genome Browser Mailing List

Hello J Ir,

Thank you for your questions about phased genotypes for certain individuals. You can import other data into the UCSC Genome Browser in the form of custom tracks or track hubs. For more information about these two ideas please see the following help pages:
Custom Tracks: https://genome.ucsc.edu/goldenPath/help/customTrack.html
Track Hubs: https://genome.ucsc.edu/goldenPath/help/hgTrackHubHelp.html

In the case of getting the 1000 Genomes data into the Genome Browser, we already have the 1000 Genomes Phase 3 data available as a native track in the Variation Group. Here is a session (https://genome.ucsc.edu/goldenPath/help/hgSessionHelp.html) of some variants from this dataset on chromosome 22:
https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg19_1000GenomesNativeVsCustomTrack

This session displays data from a VCF custom track of just the chr22 variants (labeled as ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz), and data from the native version (labeled as 1000 Genomes Phase 3 Integrated Variant Calls: SNVs, Indels, SVs). In either the case, the data is the same, and I just wanted to show that you can load a VCF file as a custom track if we don't have the data available natively. If you click on any of the variants shown, you will be directed to a details page that displays a multitude of information about the particular variant you clicked on (all from the corresponding VCF file). If you click the plus next to "Detailed genotypes" to expand the section, you can see the phasing information for all the individuals you are interested in.

You can extract the information from this VCF file using the Table Browser. For instance, say you are interested in finding the phased genotypes of NA21144 and NA20911 for the rs1799967 and rs57205909 variants. What you could do is get the positions of those two variants and then grab all 1000 Genomes data corresponding to those two individuals. Please note that this is effectively the same approach used previously with the Data Integrator except substituting the 1000 Genomes dataset for the Genome Variants data and the Table Browser for Data Integrator:

1. Navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
2. Make the following selections:
clade: Mammal
genome: Human
assembly:Feb 2009 (GRCh37/hg19)
group: Variation
track: Common SNPs(150), or any dbSNP track of interest
table: snp150Common
3. Next to "identifiers", click "paste list", enter in rs1799967 and rs57205909, and click submit.
4. Next to output format, make sure "selected fields from primary and related tables" is selected, and click "get output".
5. On the resulting page check the boxes for chrom, chromStart, and chromEnd and click "get output".
6. Copy the resulting coordinates, and then head back to the Table Browser.
7. This time instead of selecting the Common SNPs track, select the 1000G Ph3 Vars track.
8. Click the "define regions" button next to the position search, and paste the coordinates from step 6.
9. Choose "all fields from selected table" from the output format dropdown and then click "get output".

On the resulting page you will see the VCF header, and then the VCF lines corresponding to your variants of interest. The last ~2000 columns on each line contain the phasing information for all the different populations listed in the line that begins with "#CHROM POS ID...". You can then get the column number of your individual of interest and then look for the "0|1" in that column to check for the genotypes. This way you can obtain phasing information for multiple individuals in one go.

An easier way of obtaining this information is to download the VCF data that you are interested in, and then make a file of rsIDs, and use the following commands to extract the genotypes of particular individuals:

$ zgrep -Fwf rsIds.txt NameOfVcf.gz > rsIds.vcf
$ cut -f 3,2463,2513 rsIds.vcf > genotypes.txt

Where the 3, 2463 and 2513 in the cut command represent the columns of the VCF you might be interested in, such the ID column (3), and the NA20911 (2463) and NA21144 (2513) columns.

Unfortunately there is currently no way to do genotype imputation with the Genome Browser, although we have noted this as a feature request and will be sure to let you know if the feature gets added. Could you provide an example of an rsID that was not available at UCSC that was available at Ensembl? I'm not sure I'm understanding what you mean by this, or what you were trying to accomplish.

As to the blank genotype results from the Data Integrator, that only means that there were no items in the secondary tables intersecting the first, or if there were, the fields you selected had no values.

I hope I answered all of your questions, please let me know if there was anything I missed or if you need further clarification.

Thanks,

Christopher Lee
UCSC Genomics Institute

Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a

publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.

Reply all

Reply to author

Forward

0 new messages