Human genome basepairs as a list

6 views
Skip to first unread message

Anjuska Kyllönen

unread,
Jan 22, 2018, 11:57:18 AM1/22/18
to genome...@cse.ucsc.edu
Hello,

From the latest assembly of the human genome, I would need the list of basepairs and their positions - visually it is the most in-zoomed stage of the browser flipped 90 degrees to make a list of two columns. Is there a direct source available? Can other tracks such as intron/exon status be added as columns?

Thank you in advance!

Best regards,
Anjuska Kyllönen

Jairo Navarro Gonzalez

unread,
Jan 25, 2018, 2:48:10 PM1/25/18
to Anjuska Kyllönen, genome...@cse.ucsc.edu

Hello Anjuska,

Thank you for using the UCSC Genome Browser and your inquiry.

Could you give an example of what you are trying to achieve? Do you want to create a unique data format such as the following?

#chrom start stop sequence feature

Unfortunately, there is no simple solution, and the final solution will require a fair amount of scripting to get the data formatted as you requested, which is beyond the scope of this mailing list. You may find some scripting help from other bioinformatics forums such as BioStars.

That being said, we can help you:
  1. Create a BED file for the hg38 assembly to describe each basepair's position in the genome
  2. Create a file which is a subset of the BED file for regions that overlap with exons
  3. Repeat Step #2 but filtering for introns

After you have these three files, you can then do a bit of scripting to merge them.

Step 1: Creating the BED file

To create a file that contains each base and position, we can use the twoBitToFa utility. You can download this tool from the utilities directory for your operating system. Using this tool, you can use a BED4 file to specify a region to extract sequence from the hg38 2bit file. If you are not familiar with how we store coordinates in different formats (0-start BED vs. 1-start positional), you can learn more from the following blog post: The UCSC Genome Browser Coordinate Counting Systems. For example, to extract the sequence for the following regions:

chr1:89293-89295
chr1:92089-92092

We will have to convert these ranges into a BED4 file that describes one base per line:

chr1 89292 89293 .
chr1 89293 89294 .
chr1 89294 89295 .
chr1 92088 92089 .
chr1 92089 92090 .
chr1 92090 92091 .
chr1 92091 92092 .

You can then use the twoBitToFa utility using the newly created BED4 to limit the sequence output to your regions of interest:

twoBitToFa -bed=input.bed -bedPos -udcDir=. http://hgdownload.soe.ucsc.edu/gbdb/hg38/hg38.2bit hg38.chr1.fa

Which should create a file that describes each base position and its nucleotide sequence:

>chr1:89292-89293
a
>chr1:89293-89294
t
>chr1:89294-89295
c
>chr1:92088-92089
a
>chr1:92089-92090
c
>chr1:92090-92091
c
>chr1:92091-92092
t

Step 2: Filtering Exons and Introns

To learn whether each base overlaps with an exon or intron, we will use the Table Browser's intersection tool. We will create a custom track that contains regions for each exon/intron inside of your chosen gene annotation track. Here's a previously answered question that shows how to get exons-only and introns-only positions as custom tracks using the GENCODE V24 dataset. With these two custom tracks, you can then intersect them with the BED file from Step 1 to filter overlaps.

Exon and intron regions will depend on the gene track you use, so you should select whichever gene track best fits your research purpose. For this example, I will use the knownGene table and here is a session with the exon-only custom track and the BED4 file from Step 1 as a custom track:

https://genome.ucsc.edu/cgi-bin/hgTables?hgS_doOtherUser=submit&hgS_otherUserName=jnavarr5&hgS_otherUserSessionName=hg38.MLQ.20854

After loading this session, next to intersection:, click the create radio button. Once on the new page, select the custom track tb_knownGene, and then select: "All User Track records that have any overlap with tb_knownGene". Once this option is selected, click submit. After you are back on the main Table Browser page, click get output. You should get output that describes which bases overlap with exon regions.

Step 3: Create a script to merge the files

At this point, you should have following four files:
  1. The BED4 file with your regions of interest
  2. The FASTA file created from the twoBitToFa tool which contains the hg38 sequence
  3. A subset of regions in the BED4 that overlap with exons
  4. A subset of regions in the BED4 that overlap with introns

Now you would just need to append the columns based on the key matching field (position) back into your BED4 file. There is an example of this in this previously answered question:

https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/R8CstMtiJZM/TFeA7iIYAQAJ

The outcome should be something like the following, with the following five fields, and the 5th field may be blank where the base doesn't overlap with exon or intron:

#chrom start stop sequence feature
chr1 89292 89293 A
chr1 89293 89294 T
chr1 89294 89295 C exon
chr1 92088 92089 A intron
chr1 92089 92090 C intron
chr1 92090 92091 C exon
chr1 92091 92092 T exon

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly-accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro 
UCSC Genomics Institute

Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Mirror-Specific Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-mirror+unsubscribe@soe.ucsc.edu.
To post to this group, send email to genome...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome-mirror/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome-mirror/CALyRH-wgKPOUgPjFTXCze9MqGBHdM%3DNm9vzQtJq%3Dg5ZUYbM5Mg%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all
Reply to author
Forward
0 new messages