a question about RefSeq RefFlat files

47 views
Skip to first unread message

Bogdan Tanasa

unread,
Sep 15, 2016, 10:34:08 AM9/15/16
to gen...@soe.ucsc.edu
Dear all,

've just downloaded the BED file of all exons for RefSeq genes on hg38 (using the Table Browser);
the results look in the following way :

chr1    35276    35481    FAM138F_exon_1_0_chr1_35277_r    0    -
chr1    35720    36081    FAM138F_exon_2_0_chr1_35721_r    0    -
chr1    34610    35174    FAM138A_exon_0_0_chr1_34611_r    0    -
chr1    35276    35481    FAM138A_exon_1_0_chr1_35277_r    0    -
chr1    35720    36081    FAM138A_exon_2_0_chr1_35721_r    0    -
chr1    69090    70008    OR4F5_exon_0_0_chr1_69091_f    0    +

may I ask the following 2 questions :

-- what is the meaning of the "0" zeros in the fields of type "FAM138F_exon_1_0_chr1_35277_r " ?
or "FAM138F_exon_2_0_chr1_35721_r" ?

-- in the same fields, what is the meaning of  "chr1_35277_r " or "35721_r " ?

many thanks,

-- bogdan

Chris Villarreal

unread,
Sep 15, 2016, 2:20:36 PM9/15/16
to Bogdan Tanasa, gen...@soe.ucsc.edu

Dear Bogdan Tanasa,

Thank you for your question about the UCSC Genome Browser.
You can find information on the BED format here: 

https://genome.ucsc.edu/FAQ/FAQdownloads.html#download35

Scroll down to the section titled “Name of fourth column in BED output”

The “0” in FAM138F_exon_1_0_chr1_35277_r indicates the number of bases added to the regions requested. For example, if you added 100 bases then the file name would read: 

chr1 35276 35481 FAM138F_exon_1_100_chr1_35377_r 0 -

The “chr1_35277” in "chr1_35277_r" indicates the position of the first base. If you have specified bases added to the requested features (for example, Exons plus 100 bases on each end), then columns 2 and 3 of the output wouldn't be the exact coordinates of the exon, they would start and end 100 bases before/after the exon. So, this part of the information is an easy way to see where the actual feature starts as displayed in the browser. It is "as displayed in the browser" because the coordinates in our tables almost always have 0-based starts (as they do in columns 2 and 3 of this output) but display as 1-based in the browser (for more info see the FAQ), but the start position listed in the section of the 4th column is actually 1 based. It will be the exact coordinate the feature starts on as displayed in the browser.

The “r" in “chr1_35277_r” indicates the strand of an item, "r" representing the reverse strand and "f" representing the forward strand.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

-Chris V
UCSC Genome Browser


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Bogdan Tanasa

unread,
Sep 15, 2016, 7:06:38 PM9/15/16
to Chris Villarreal, gen...@soe.ucsc.edu
Dear Chris,

thank you again for your kind and very informative reply. A question though : in the BED file I have downloaded for hg38 RefSeq RefFlat files,
some exons are on multiple lines : for example, the exons "ZNF41_exon_0_0_chrX_47445178_r" is listed on all these multiple lines (below).

is this because there are multiple the splice isoforms that contain the exons, or for other reason perhaps ? thank you !

chrX    47445177    47449366    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449366    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -
chrX    47445177    47449474    ZNF41_exon_0_0_chrX_47445178_r    0    -

Jairo Navarro Gonzalez

unread,
Sep 19, 2016, 1:25:48 PM9/19/16
to Bogdan Tanasa, Chris Villarreal, gen...@soe.ucsc.edu
Dear Bogdan,

Thank you for using the UCSC Genome Browser and your question about RefFlat files.
Yes, you are correct, in this example, there are 20 transcripts. 
For example, if you go to the gateway page and search for this gene "ZNF41" in hg38, and look in the RefSeq track, you will see 20 transcripts for this gene. 
18 of your rows display this exon region:

chrX 47445177 47449474 ZNF41_exon_0_0_chrX_47445178_r 0 -

2 of your rows display this smaller exon region:

chrX 47445177 47449366 ZNF41_exon_0_0_chrX_47445178_r 0 -

You can visually see the two smaller exons in the session below:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=jnavarr5&hgS_otherUserSessionName=MLQ.18086.hg38.RefSeq

Here is more information about RefSeq:
http://www.ncbi.nlm.nih.gov/books/NBK21091/
For example, from the intro, "Be aware, however, that the RefSeq collection does include alternatively spliced transcripts encoding the same protein or distinct protein isoforms, in addition to orthologs, paralogs, and alternative haplotypes for some organisms, which will affect the outcome of a database query."

I hope this is helpful. If you have any further questions, please reply 
to gen...@soe.ucsc.edu. All messages sent to that address are archived 
on a publicly-accessible Google Groups forum. If your question includes 
sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro 
UCSC Genome Browser

Bogdan Tanasa

unread,
Sep 19, 2016, 1:56:05 PM9/19/16
to Jairo Navarro Gonzalez, Chris Villarreal, gen...@soe.ucsc.edu
Thank you Jairo for all the information ! it is very helpful !
Reply all
Reply to author
Forward
0 new messages