CpG location in exons

23 views
Skip to first unread message

Marcin Wojewodzic

unread,
Apr 26, 2017, 11:14:49 AM4/26/17
to gen...@soe.ucsc.edu
Hello, 

Is there anyway I could get information from the ucsc about the position in individual CpG cites (human) that are taken from CpG islands per exon?  I see the way CpG islands are reported gives the sum for CpGs inside the islands but not their location.

I would like to get distribution of CpG sites from CpG islands per exon for human genome and than I would like to compare this with the distribution from my RRBS data set. I manage to do that on my data, as I have a location of individual sistes mapped to hg38 and I performed search against the ranges of CpG islands.

I tried a lot of approaches now thinking that maybe intersection tool in table browser will help but with no success. 

Thank you for any advice!

Marcin


__________________________
Marcin Wojewodzic, PhD

Epigenetic Group 
Department of Molecular Oncology
Institute for Cancer Research
Norwegian Radium Hospital
Oslo University Hospital (OUS)








Christopher Lee

unread,
May 1, 2017, 6:30:07 PM5/1/17
to Marcin Wojewodzic, gen...@soe.ucsc.edu

Hi Marcin,

Thank you for your question about obtaining CpG Sites overlapping exons. Is it possible for you to expand upon your question a little further? Do you want the coordinates of all CG dinucleotides that overlap exons? Or do you just want the coordinates of the items in the CpG Islands track that have any overlap with exons? If you want the coordinates of the individual dinucleotides, then it will take some custom scripting on your part as we do not store these positions anywhere. However, grabbing the coordinates of where the CpG Islands track overlaps with exons from a gene track can be accomplished with our Table Browser tool: http://genome.ucsc.edu/cgi-bin/hgTables.

To obtain this information, follow the below steps:
1. Navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
2. Choose "Mammal", "Human", "Dec. 2013 (GRCh38/hg38)" from the "clade", "genome", and "assembly" dropdowns.
3. Now make the following selections:
group: Genes and Gene Predictions
track: Gene Track of Interest
table: Table of interest
region: genome
output format: custom track
4. Click "get output"
5. On the Output page, enter an informative name in the name and description fields, something like "geneTrack Exons", and then select "Exons plus" from the "Create one BED record per" section. Lastly, click "get custom track in table browser".
6. From the resulting Table Browser page, hover over the Tools section of the top blue menu bar, and click "Data Integrator".
7. From the Data Integrator page, make sure the hg38 assembly is selected, then change "region to annotate" to "genome".
8. Now, in the "Add Data Source" section, select the custom track of exons we just created by selecting "Custom Tracks" and "geneTrack Exons" from the "track group" and "track" dropdowns, then click "Add".
9. Select the CpG Islands track by selecting "Regulation" and "CpG Islands (cpgIslandExt)" from the "track group" and "track" dropdowns, then click "Add".
10. Now we are ready to obtain the overlapping items. In the "Output Options" section, click "choose fields". You will likely want to deselect all the fields from the custom track and only leave the fields from the CpG Islands track, but which fields you choose is up to you.
11. When you are done selecting fields, click "Done" and then click "Get output".

You will now have output of the following format, depending on the fields you chose (here I limited to only a small region of chr9 that contains two islands that overlap exons and two that don't, your results will vary):

# hgIntegrator: database=hg38 region=chr9:133223138-133309723 Mon May  1 10:48:43 2017
#cpgIslandExt.chrom    cpgIslandExt.chromStart    cpgIslandExt.chromEnd    cpgIslandExt.name    cpgIslandExt.length    cpgIslandExt.cpgNum    cpgIslandExt.gcNum    cpgIslandExt.perCpg    cpgIslandExt.perGc    cpgIslandExt.obsExp
    chr9    133255614    133256444    CpG: 63    830    63    535    15.2    64.5    0.73
    chr9    133255614    133256444    CpG: 63    830    63    535    15.2    64.5    0.73
    chr9    133255614    133256444    CpG: 63    830    63    535    15.2    64.5    0.73
    chr9    133255614    133256444    CpG: 63    830    63    535    15.2    64.5    0.73

    chr9    133274615    133275949    CpG: 154    1334    154    967    23.1    72.5    0.89

The results will look a little funny, this is because the Data Integrator outputs all rows from the primary table, regardless of whether there is overlap in the secondary tables, but by limiting to only output fields from the secondary tables we have effectively suppressed primary table output. These data are all the CpG Islands that overlap exons from the knownGene track. Since the knownGene track contains multiple transcripts per gene, the same item from the CpG Islands track can be repeated multiple times, you can output these results to a file (cpgPerExon.txt below) and then use uniq or a similar command to filter these identical lines:
$ perl -pe 's/^[ \t]*//' cpgPerExon.txt | uniq
chr9    133255614    133256444    CpG: 63    830    63    535    15.2    64.5    0.73

chr9    133274615    133275949    CpG: 154    1334    154    967    23.1    72.5    0.89

For the positions of CG dinucleotides within each CpG Island, we don't store the positions anywhere, so you will have to download the sequences of where these tracks overlap, and then write a script to extract positions from there. Such scripting is outside the scope of this mailing list, but you can use the Table Browser to extract the sequences of the overlapping positions:

1. Head to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables (or Tools->Table Browser).
2. Make sure the hg38 assembly is selected.
3. Make the following selections:
group: Regulation
track: CpG Islands
table: cpgIslandExt
region:genome
output format: sequence
4. Now intersect the CpG Islands track and our exon only custom track. We could not do this in the previous example because the Table Browser intersection discards fields of interest from our tables and only outputs position ranges, but since now all we want is sequence it does not matter.
- Next to intersection click "create".
- Select your exons only custom track from the group, track, and table dropdowns, and then select the bullet for "Base-pair-wise intersection (AND) of CpG Islands and exonsOnlyCustomTrack"
- Click submit
5. Click "get output"
6. On the sequence retrieval page, select any formatting you would like and click "get sequence"

You will now have a FASTA format file like the following:

>hg38_cpgIslandExt_chr9.1 range=chr9:133255615-133256356 5'pad=0 3'pad=0 strand=+ repeatMasking=none
CGGGAGGGGGACGGGGCTGCCGGCAGCCCTCCCAGAGCCCCTGGCAGCCG
CTCACGGGTTCCGGACCGCCTGGTGGTTCTTGGGCACCGCAGTGAACCTC
AGCTTCCTCAGGACGGCGGGCCAGCCCAGCAGCTGCTGGTCCCACAAGTA
CTCGGGGGAGAGCACCTTGGTGGGTTTGTGGCGCAGCAGGTACTTGTTCA
GGTGGCTCTCGTCGTGCCACACGGCCTCGATGCCGTTGGCCTGGTCGACC
ATCATGGCCTGGTGGCAGGCCCTGGTGAGCCGCTGCACCTCTTGCACCGA
...
...

Where each sequence is the genomic sequence where the CpG Islands track overlaps exons of your chosen track. Unlike the previous option, this option only gives the sequence of the overlapping region, and nothing else.

Please let us know if you have any further questions!

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/B02B2B98-C32D-45B9-AEC5-C1176D917789%40rr-research.no.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all
Reply to author
Forward
0 new messages