Annotation Related

101 views
Skip to first unread message

Shrinka Sen

unread,
Jun 26, 2020, 12:32:23 PM6/26/20
to gen...@soe.ucsc.edu
Hello

              I am working with WGBS Data. I am using MethyKit package.


MethyKit has provided a reference bed file for annotation in the extdata folder, which looks like that (PFA).  In the 4th column of the file is the feature NM. As per my understanding, MethyKit is functioning in such a way, that it needs that for annotating CpGs in Exonic or Intronic regions.

> gene.obj<-readTranscriptFeatures("refseq.hg18.bed.txt")

diffAnn=annotateWithGeneParts(as(myDiff25p,"Granges"),gene.obj)

> head(getAssociationWithTSS(diffAnn))

target.row dist.to.feature feature.name feature.strand

60             1          4856565     NM_199260               -

60.1           2         4787656     NM_199260               -

60.2           3          4708577     NM_199260               -

60.3           4          4671224     NM_199260               -

60.4           5          4660082     NM_199260               -

60.5           6          3884768     NM_199260               -


> getTargetAnnotationStats(diffAnn,percentage=TRUE,precedence=TRUE)

promoter exon intron intergenic

1.69        1.84       29.87       66.59


I downloaded reference file from UCSC


  1. Go to UCSC browser -> Select "Tools" -> Select "Table Browser".
  2. Enter "genome/assembly" you're interested in.
  3. Enter "region" "position" chr x (as I divided all of my files chromosome wise and then doing the analysis one by one)
  4. Click "get output".
The output is like that (PFA) (a small part only), which doesn't have any feature like NM

Now my question is how can I change the UCSC bed file, to MethylKit compatible bed file 

Regards
Shrinka Sen
Post Doctoral Fellow
BC Cancer Research
Vancouver, Canada

                                                                   
refseq.hg18.bed.txt-example file of MethyKit.ods
refseq file from UCSC-Chr X.odt

Matthew Speir

unread,
Jun 29, 2020, 6:33:20 PM6/29/20
to Shrinka Sen, gen...@soe.ucsc.edu
Hello, Shrinka.

Thank you for your question about downloading a BED file from the UCSC Genome Browser.

The UCSC Genome Browser offers a number of different gene annotation tracks in addition to RefSeq (which includes transcripts IDs with the NM prefix). The BED file that you downloaded from the Table Browser appears to have a different gene annotation track selected for output such as GENCODE v32 (which includes ENST prefixed IDs).

To get a file from the Table Browser that matches the format of the MethyKit example you provided, you can use the following steps:

1. Navigate to the Table Browser, .
2. Select the clade/genome/assembly you are interested in.
3. Make the following selections to get output for the "UCSC RefSeq" track (this is UCSC's re-alignment of RefSeq RNAs, more details about this track and the other RefSeq tracks we host can be found on the description page):
  • group: Genes and Gene Predictions 
  • track: NCBI RefSeq
  • table: UCSC RefSeq
  • region: chrX
  • output format: BED - browser extensible data
  • output file: enter a name or leave blank to view results in your web browser
4. Click "get output"
5. Under "Create one BED record per", select "Whole Gene".
6. Click "get BED"

Your output should look like so:
chrX 276355 303355 NM_001370370 0 + 284187 299335 0 8 39,203,148,137,129,156,184,4259, 0,5126,7811,12377,14292,15143,16679,22741,
chrX 276355 303355 NM_001370372 0 + 284187 299335 0 7 39,203,148,137,129,184,4259, 0,5126,7811,12377,14292,16679,22741,
chrX 276355 303355 NM_001370371 0 + 284187 299335 0 7 39,148,137,129,156,184,4259, 0,7811,12377,14292,15143,16679,22741,

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

---

Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.



--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAOmkqUCH7Nk91LrN63%3DEwTJyQFyAeZN_kcMB1LL_CPBUCFLAtA%40mail.gmail.com.

Shrinka Sen

unread,
Jun 30, 2020, 11:58:39 AM6/30/20
to gen...@soe.ucsc.edu
Thanks a lot for your reply. I just want to know how can I get the header of the output file, those I downloaded following your instructions.

chrX 276355 303355 NM_001370370 0 + 284187 299335 0 8 39,203,148,137,129,156,184,4259, 0,5126,7811,12377,14292,15143,16679,22741,
chrX 276355 303355 NM_001370372 0 + 284187 299335 0 7 39,203,148,137,129,184,4259, 0,5126,7811,12377,14292,16679,22741,
chrX 276355 303355 NM_001370371 0 + 284187 299335 0 7 39,148,137,129,156,184,4259, 0,7811,12377,14292,15143,16679,22741,

Luis Nassar

unread,
Jul 2, 2020, 1:47:38 PM7/2/20
to Shrinka Sen, UCSC Genome Browser Discussion List

Hello Shrinka,

When you select output format: BED, the columns represent the standard BED columns defined by the format: https://genome.ucsc.edu/FAQ/FAQformat.html#format1

In this case, it is a BED12 (3 required fields and all 9 optional fields). Since these are transcripts, the thickStart and thickEnd represent the start/stop codons, and the blockSizes and blockStarts are the exons.

I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Shrinka Sen

unread,
Jul 3, 2020, 10:48:12 AM7/3/20
to gen...@soe.ucsc.edu
Hello

                Whenever I am putting  Genes and Gene Prediction in group, I am getting these options 

image.png

How it is possible to get these options like

group- Genes and Gene Prediction Tracks" 
track- UCSC Genes
table- knownGene

Shrinka Sen

unread,
Jul 6, 2020, 10:38:59 AM7/6/20
to gen...@soe.ucsc.edu
Hello

I have a list of differentially methylated CpGs. I just want to upload or paste these coordinates to get gene name corresponding to these. I just have coordinates of these, not any identifier. 

Is there any direct method, so that I will get gene names as output when I put coordinate as input..

My file looks like this

chr start end strand pvalue qvalue meth.diff
chr3 17839 17839 + 1.30E-08 1.00E-07 32.7554577141809
chr3 19841 19841 + 2.30E-12 2.63E-11 -25.5849232744034
chr3 21413 21413 + 2.63E-17 4.46E-16 -35.064886911383
chr3 22096 22096 + 2.25E-18 4.11E-17 -29.4962402436703
chr3 22097 22097 + 1.86E-09 1.58E-08 -38.936764170409


Matthew Speir

unread,
Jul 8, 2020, 3:36:35 PM7/8/20
to Shrinka Sen, gen...@soe.ucsc.edu
Hello, Shrinka. 

Thank you for your question about obtaining gene symbols associated with a list of regions. 

If you trim your file to only include the first three columns (chrom, start, end) and upload this as a custom track, you could use the Data Integrator to get the gene symbols that overlap with those regions.

You can trim the columns using the UNIX command-line utility "awk". If these items are also supposed to represent single base items, you will also need to subtract one from the start coordinates (see our blog post: http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/). As written, your items span 0 bases. This command will trim your file to only the first three columns and subtract one from the start position:
awk '{print $1,$2-1,$3}' myMethylationSites.txt

Your output should look like:
chr3 17838 17839
chr3 19840 19841

If you don't need to subtract one from the start, you can use the command:
awk '{print $1,$2,$3}' myMethylationSites.txt

Again, your output should look like:
chr3 17839 17839
chr3 19841 19841

If you do not have access to a UNIX command line (e.g. Linux or Mac OSX) where you can run these commands, it should be possible to trim down the columns in Microsoft Excel, Google Sheets, or Mac OSX Numbers. 

Next, you will need to upload these as custom tracks, http://genome.ucsc.edu/cgi-bin/hgCustom. After uploading your custom tracks, navigate to the Data Integrator, http://genome.ucsc.edu/cgi-bin/hgIntegrator. From there use these steps to get the gene symbols that overlap with your regions:

1. Under "Select Genome Assembly and Region", set "region to annotate" to "genome" 
2. Under "Add Data Source", select your custom track:
  • track group: Custom Tracks
  • track: Your Track Name
3. Click "Add"
4. Under "Add Data Source", select GENCODE Genes v32:
  • track group: Genes and Gene Predictions
  • track: GENCODE v32 (knownGene)
5. Click "Add"
6. Click "Choose fields"
7. Under the "GENCODE v32 (knownGene)" section, select "hg38.kgXref" from the dropdown under "Related tables" and click "Add table"
8. Under "GENCODE v32 (knownGene)", click "Clear all"
9. Under "hg38.kgXref (Link together a Known Gene ID and a gene alias)", check the box next to "geneSymbol"
10. Click "Done"
11. Click "Get output"

Your output will look something like:
# hgIntegrator: database=hg38 region=genome Tue Jul  7 14:43:50 2020
#ct_UserTrack_3545.chrom ct_UserTrack_3545.chromStart ct_UserTrack_3545.chromEnd knownGene.kgXref_geneSymbol
chr3 17839 17839 LINC01986
chr3 17839 17839 LINC01986
chr3 17839 17839 LINC01986
chr3 17839 17839 LINC01986
chr3 17839 17839 LINC01986

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

---

Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Matthew Speir

unread,
Jul 8, 2020, 3:36:43 PM7/8/20
to Shrinka Sen, gen...@soe.ucsc.edu
Hello, Shrinka. 

Thank you for your question about the Table Browser. 

We do not produce a "UCSC Genes" track for hg38 anymore. We have retired the pipeline and the knownGene table is now based on gene annotations from the GENCODE group, https://www.gencodegenes.org/. You can access the "knownGene" table by selecting "GENCODE v32" from the dropdown menu next to "track".

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

---

Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Shrinka Sen

unread,
Jul 13, 2020, 12:12:36 PM7/13/20
to Matthew Speir, gen...@soe.ucsc.edu
Thanks for your reply

I have given input like this

chr3 17839 17839
chr3 19841 19841
chr3 21413 21413
chr3 22096 22096
chr3 22097 22097
chr3 24239 24239
chr3 24240 24240

and followed your instruction , but me output is like this PFA, though in input file is only from chromosome 3, why in output file  chr1  is coming? I am clueless where I am doing the mistake?

Please help

Regards

Shrinka Sen
Post Doctoral Fellow
BC Cancer Research
Vancouver, Canada
UCSC OUTPUT-3.txt

Daniel Schmelter

unread,
Jul 21, 2020, 8:00:35 PM7/21/20
to Shrinka Sen, Matthew Speir, UCSC Genome Browser Discussion List

Hello Shrinka,

Thank you for using the Genome Browser and for your patience with our reply.

I could not replicate your results with seeing chromosome 1 data returned after putting in chromosome 3 coordinates. The output file you sent has a variety of HTML tags and artifacts such that it appears to be the Data Integrator page itself and not the output file. This may not be the page that you meant to share. The results should be a simple text file like this:

# hgIntegrator: database=hg38 region=genome Tue Jul 21 16:08:13 2020
#ct_UserTrack_3545.chrom    ct_UserTrack_3545.chromStart    ct_UserTrack_3545.chromEnd    knownGene.kgXref_geneSymbol
chr3    17839    17839    LINC01986
chr3    19841    19841    LINC01986
...

You will need to provide more information such as the output file and the desired format if the previous instructions do not give the result you expected. For example, I could not understand the "PFA" in your statement "...my output is like this PFA...". If you want additional fields beyond gene symbol such as gene ID, you can select that among the Data Integrator options.

If you are trying to integrate this type of query into an automatic pipeline, you may be interested in our MySQL server or our RestAPI, which both allow programmatic access to gene datasets:

http://genome.ucsc.edu/goldenPath/help/mysql.html
http://genome.ucsc.edu/goldenPath/help/api.html#getData_examples

I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
All the best,

Daniel Schmelter
UCSC Genome Browser


Daniel Schmelter

unread,
Dec 22, 2020, 6:37:12 PM12/22/20
to Shrinka Sen, UCSC Genome Browser Discussion List

Hello Shrinka,

Thank you for your question about CpG islands. We appreciate your patience with this delayed reply.

First off, the difference in coordinate positions you mentioned is an unfortunate difference in data formats and base conventions. The data file you are using is almost certainly in BED format, which all follow the 0-base counting convention. Other file formats, including our Genome Browser visualization, use the 1-base format. This is something that often causes confusion but hopefully clears up once you're aware of the two conventions. We recommend you read our blog post on the topic:

http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

For the second question, how many real CpG occurrences there are in that particular region, that answer is a bit more complicated. The algorithm that we used to generate that track might be dated and different than what you are using to get the 284 number, often based on context. That region you provided, chr21:5020208-5023177, certainly returns a CpG count of 284 based on the cpg_lh program we use. It is possible that there is a bug in our process somewhere, but that will take time to resolve. We have made a note of your experience.

Would you mind sharing what process or research you are doing with the CpG files? This may help us understand.

I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.

All the best,

Daniel Schmelter
UCSC Genome Browser


On Tue, Dec 15, 2020 at 8:53 PM Shrinka Sen <shrinka....@gmail.com> wrote:
Hello Dr Schmelter

                    Could you please help me in one aspect. PFA the excel file. In the 1st sheet, CpG island information from UCSC is there (Chr21, Human Cell). In the 2nd sheet, information of CpG base position of my file is there (Chr21, (+) strand, Human Cell) is there

 

My question is

 

If you see the 1st line of the UCSC file, there are 261 CpGs starting from base position 5020207 to 5023177

 

chr21

5020207

5023177

CpG:_261

 

Now If I see my file

 

First of all 5020207 is not there. There is base position 5020208. In the same way 5023177 is not there. There is base position 5023176

Second and more important point is, in that region according to UCSC there are 261 CpGs, but in my file there are 284

 

Could you please comment on this? Am I clear with my query?


I really look  forward to your reply.


Thanks and Regards

Shrinka Sen

unread,
Dec 28, 2020, 11:35:06 AM12/28/20
to Daniel Schmelter, gen...@soe.ucsc.edu, genom...@soe.ucsc.edu
Thanks Daniel
                        Both of your answers are clear to me. I understand why there is a discrepancy of 1 bp and also understand that CpG nos are different.

                        Coming to your question  "Would you mind sharing what process or research you are doing with the CpG files?", I will be happy to do that, as it will help me to solve my problem. 

My goal is to calculate the median methylation values for each CpG island.

 

In the attached excel file, in the first sheet there is information from UCSC

 

Chr

Start

End

No

chr21

5020207

5023177

CpG:_261

Now if I go to second sheet for this particular CpG island from 5050207- 5023177, suppose I call it CpG island 1, it has 261 (n) no of CGs and each had a coverage value as well as a value of frequency of base C, that is methylation value. So I want to make a matrix like

 

# CpG Island     # No of CGs present there     Median values of Methylation

        1                                      261                                                X1

        2                                       21                                                 X2

        3                                       21                                                 X3

  …..100                                                                                 ……X100

 

This, I want to make for 100 CpG islands. For this same 100 CpG islands, I want to make a similar matrix for my sample set (suppose I have 10 samples).


Do you have any comments or suggestions for that? If yes, please enrich me.


Do you have information regarding Orphan CpG Islands?


I am eagerly waiting for your reply


Regards

Shrinka


Daniel Schmelter

unread,
Jan 6, 2021, 7:30:06 PM1/6/21
to Shrinka Sen, UCSC Genome Browser Discussion List, genome-www

Hello Shrinka,

Thank you for your question about CpG islands and finding the median methylation percentage.

As I understand, the data you want to produce will take some light scripting and for that, you can use Genome Browser utilities. Additional text formatting with "awk" or similar tools may be necessary to put it in your desired file format.

In order to calculate the median percentage of Cytosines in particular regions, you can use the two utilities "twoBitToFa" to obtain a file with each individual fasta sequence from a list of the regions you already have. Once you have that fasta file, you will need to count the number of C's and then divide by length to get a percentage. There are many options on how to do this, but our utility "faCount" will do the first step and return a CpG count as well. The following is an example:

faCount cpgRegions.fa 
#seq        len    A        C        G    T    N    cpg
CpG_1:    2970    358    1133    1031    448    0    284
...

These utilities can be downloaded in our utilities directory and need permissions to be granted with "chmod +x faCount twoBitToFa" to run properly. For usage instructions, run the commands without arguments.

http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads

You will then have to calculate cytosine percentage and that can be done with the awk command, which has online resources to support division by columns.

As far as Orphan CpGs, I do not know much about them so I won't comment at this time.

Thanks for sharing about your work! I hope this was helpful.

All the best,

Daniel Schmelter
UCSC Genome Browser

If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.

Shrinka Sen

unread,
Jan 11, 2021, 11:59:14 AM1/11/21
to Daniel Schmelter, UCSC Genome Browser Discussion List, genome-www
Hello
             Thanks for your reply. I will try the method you mentioned. I have not used that method. 

I initially took 20 CpG island. I have 8 samples for which WGBS data is there. I calculated %of Methylation for each CpG position, from those files using codes in R. My files look like below

Chr     base strand coverage freqC freqT
1 chr1 10469 + 2 100 0
2 chr1 10470 - 124 72.5806451612903 27.4193548387097
3 chr1 10471 + 3 66.6666666666667 33.3333333333333
4 chr1 10472 - 125 80 20
5 chr1 10484 + 5 80 20
6 chr1 10485 - 135 89.6296296296296 10.3703703703704
7 chr1 10489 + 5 100 0
8 chr1 10490 - 131 95.4198473282443 4.58015267175572
9 chr1 10493 + 4 100 0
10 chr1 10494 - 134 79.1044776119403 20.8955223880597

I am attaching one excel where for each individual (8) and for each CpG island (20), I calculated median % methylation value

Do you know whether UCSC has any information about Orphan CpG, or you know somebody who can help me a little bit in this regard. It will be very helpful for me.

Thanks
Shrinka
CpG Island Related Analysis-01.01.2021.xls

Daniel Schmelter

unread,
Jan 13, 2021, 6:22:07 PM1/13/21
to Shrinka Sen, UCSC Genome Browser Discussion List, genome-www

Hello Shrinka,

It looks like you accomplished your objective regarding calculating the median Cytosine percentage for each CpG island region. You may want to include the start AND end position of your CpG island for repeatability.

As far as orphan CpG islands go, I do not have any specific resources or expertise to share. The Genome Browser has no datasets with that term. We are happy to answer questions about using our tools if you do find a dataset you are interested in. We wish you the best with your research into that topic!

If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.

All the best,

Daniel Schmelter
UCSC Genome Browser

Reply all
Reply to author
Forward
0 new messages