MethyKit has provided a reference bed file for annotation in the extdata folder, which looks like that (PFA). In the 4th column of the file is the feature NM. As per my understanding, MethyKit is functioning in such a way, that it needs that for annotating CpGs in Exonic or Intronic regions.
> gene.obj<-readTranscriptFeatures("refseq.hg18.bed.txt")
> diffAnn=annotateWithGeneParts(as(myDiff25p,"Granges"),gene.obj)
> head(getAssociationWithTSS(diffAnn))
target.row dist.to.feature feature.name feature.strand
60 1 4856565 NM_199260 -
60.1 2 4787656 NM_199260 -
60.2 3 4708577 NM_199260 -
60.3 4 4671224 NM_199260 -
60.4 5 4660082 NM_199260 -
60.5 6 3884768 NM_199260 -
> getTargetAnnotationStats(diffAnn,percentage=TRUE,precedence=TRUE)
promoter exon intron intergenic
1.69 1.84 29.87 66.59
I downloaded reference file from UCSC
Matthew Speir
UCSC Cell Browser, Quality Assurance and Data Wrangler
Human Cell Atlas, User Experience Researcher
UCSC Genome Browser, User Support
UC Santa Cruz Genomics Institute
Revealing life’s code.
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAOmkqUCH7Nk91LrN63%3DEwTJyQFyAeZN_kcMB1LL_CPBUCFLAtA%40mail.gmail.com.
Hello Shrinka,
When you select output format: BED, the columns represent the standard BED columns defined by the format: https://genome.ucsc.edu/FAQ/FAQformat.html#format1
In this case, it is a BED12 (3 required fields and all 9 optional fields). Since these are transcripts, the thickStart and thickEnd represent the start/stop codons, and the blockSizes and blockStarts are the exons.
I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAOmkqUCJgPyz_Kny9H5mYKj_ZjdYkr61ovnwT87WLz0sgOLHKw%40mail.gmail.com.

| chr | start | end | strand | pvalue | qvalue | meth.diff |
| chr3 | 17839 | 17839 | + | 1.30E-08 | 1.00E-07 | 32.7554577141809 |
| chr3 | 19841 | 19841 | + | 2.30E-12 | 2.63E-11 | -25.5849232744034 |
| chr3 | 21413 | 21413 | + | 2.63E-17 | 4.46E-16 | -35.064886911383 |
| chr3 | 22096 | 22096 | + | 2.25E-18 | 4.11E-17 | -29.4962402436703 |
| chr3 | 22097 | 22097 | + | 1.86E-09 | 1.58E-08 | -38.936764170409 |
Matthew Speir
UCSC Cell Browser, Quality Assurance and Data Wrangler
Human Cell Atlas, User Experience Researcher
UCSC Genome Browser, User Support
UC Santa Cruz Genomics Institute
Revealing life’s code.
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAOmkqUA3LxuVW13%2BXebkTHMMV3Bt9QKUSW4ytmePe8xG4K_8Pg%40mail.gmail.com.
Matthew Speir
UCSC Cell Browser, Quality Assurance and Data Wrangler
Human Cell Atlas, User Experience Researcher
UCSC Genome Browser, User Support
UC Santa Cruz Genomics Institute
Revealing life’s code.
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAOmkqUAgFQmQANgZN%2BWJ9z9ikyhtN0oOptmG4-ZXovQHurT_Yg%40mail.gmail.com.
| chr3 | 17839 | 17839 |
| chr3 | 19841 | 19841 |
| chr3 | 21413 | 21413 |
| chr3 | 22096 | 22096 |
| chr3 | 22097 | 22097 |
| chr3 | 24239 | 24239 |
| chr3 | 24240 | 24240 |
Hello Shrinka,
Thank you for using the Genome Browser and for your patience with our reply.
I could not replicate your results with seeing chromosome 1 data returned after putting in chromosome 3 coordinates. The output file you sent has a variety of HTML tags and artifacts such that it appears to be the Data Integrator page itself and not the output file. This may not be the page that you meant to share. The results should be a simple text file like this:
You will need to provide more information such as the output file and the desired format if the previous instructions do not give the result you expected. For example, I could not understand the "PFA" in your statement "...my output is like this PFA...". If you want additional fields beyond gene symbol such as gene ID, you can select that among the Data Integrator options.
If you are trying to integrate this type of query into an automatic pipeline, you may be interested in our MySQL server or our RestAPI, which both allow programmatic access to gene datasets:
http://genome.ucsc.edu/goldenPath/help/mysql.html
http://genome.ucsc.edu/goldenPath/help/api.html#getData_examples
I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
All the best,
Daniel Schmelter
UCSC Genome Browser
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAOmkqUAaeRpc5Q8zWwRW6dKqt7duUvy2YUY8j2sHxKpNdr89kA%40mail.gmail.com.
Hello Shrinka,
Thank you for your question about CpG islands. We appreciate your patience with this delayed reply.
First off, the difference in coordinate positions you mentioned is an unfortunate difference in data formats and base conventions. The data file you are using is almost certainly in BED format, which all follow the 0-base counting convention. Other file formats, including our Genome Browser visualization, use the 1-base format. This is something that often causes confusion but hopefully clears up once you're aware of the two conventions. We recommend you read our blog post on the topic:
http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/
For the second question, how many real CpG occurrences there are in that particular region, that answer is a bit more complicated. The algorithm that we used to generate that track might be dated and different than what you are using to get the 284 number, often based on context. That region you provided, chr21:5020208-5023177, certainly returns a CpG count of 284 based on the cpg_lh program we use. It is possible that there is a bug in our process somewhere, but that will take time to resolve. We have made a note of your experience.
Would you mind sharing what process or research you are doing with the CpG files? This may help us understand.
I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
All the best,
Daniel Schmelter
UCSC Genome Browser
Hello Dr SchmelterCould you please help me in one aspect. PFA the excel file. In the 1st sheet, CpG island information from UCSC is there (Chr21, Human Cell). In the 2nd sheet, information of CpG base position of my file is there (Chr21, (+) strand, Human Cell) is there
My question is
If you see the 1st line of the UCSC file, there are 261 CpGs starting from base position 5020207 to 5023177
chr21
5020207
5023177
CpG:_261
Now If I see my file
First of all 5020207 is not there. There is base position 5020208. In the same way 5023177 is not there. There is base position 5023176
Second and more important point is, in that region according to UCSC there are 261 CpGs, but in my file there are 284
Could you please comment on this? Am I clear with my query?
I really look forward to your reply.
Thanks and Regards
|
My goal is to calculate the median methylation values for each CpG island.
In the attached excel file, in the first sheet there is information from UCSC
|
|
|
Now if I go to second sheet for this particular CpG island from 5050207- 5023177, suppose I call it CpG island 1, it has 261 (n) no of CGs and each had a coverage value as well as a value of frequency of base C, that is methylation value. So I want to make a matrix like
# CpG Island # No of CGs present there Median values of Methylation |
|
1 261 X1 2 21 X2 3 21 X3 …..100 ……X100
This, I want to make for 100 CpG islands. For this same 100 CpG islands, I want to make a similar matrix for my sample set (suppose I have 10 samples). Do you have any comments or suggestions for that? If yes, please enrich me. Do you have information regarding Orphan CpG Islands? I am eagerly waiting for your reply Regards Shrinka |
Hello Shrinka,
Thank you for your question about CpG islands and finding the median methylation percentage.
As I understand, the data you want to produce will take some light scripting and for that, you can use Genome Browser utilities. Additional text formatting with "awk" or similar tools may be necessary to put it in your desired file format.
In order to calculate the median percentage of Cytosines in particular regions, you can use the two utilities "twoBitToFa" to obtain a file with each individual fasta sequence from a list of the regions you already have. Once you have that fasta file, you will need to count the number of C's and then divide by length to get a percentage. There are many options on how to do this, but our utility "faCount" will do the first step and return a CpG count as well. The following is an example:
http://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads
You will then have to calculate cytosine percentage and that can be done with the awk command, which has online resources to support division by columns.
As far as Orphan CpGs, I do not know much about them so I won't comment at this time.
Thanks for sharing about your work! I hope this was helpful.
All the best,
Daniel Schmelter
UCSC Genome Browser
If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
Hello Shrinka,
It looks like you accomplished your objective regarding calculating the median Cytosine percentage for each CpG island region. You may want to include the start AND end position of your CpG island for repeatability.
As far as orphan CpG islands go, I do not have any specific resources or expertise to share. The Genome Browser has no datasets with that term. We are happy to answer questions about using our tools if you do find a dataset you are interested in. We wish you the best with your research into that topic!
If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
All the best,
Daniel Schmelter
UCSC Genome Browser