A little more question about RefSeq

27 views

Skip to first unread message

雪儿

unread,

Apr 28, 2015, 12:29:54 PM4/28/15

to genome

Hello，
Thank you so much for your reply. We still feel confused about some details.

1.According to the filtering criteria you refered to in your last email, there should be more NM/NR transcripts in Refseq file taken from NCBI than refGene file taken from UCSC. However, we found that there are 49800 NM/NR transcripts in UCSC refGene.txt(GRch37, the newest verion) and only 42280 NM/NR transcripts in Refseq (ref_GRCh37.p13_top_level.gff3). Also, there are 41796 transcripts overlapping between them and 8004 unique transcipts in UCSC refGene. We thought Refseq dataset would cover all the transcripts in UCSC refGene since you just filter Refseq RNAs to get refGene. So, why are these unique transcripts in UCSC refGene missing in NCBI Refseq? Where are these unique transcripts from ? And did we use the wrong Refseq file?

2. The description page of RefSeq Genes track On your website points out that "The RefSeq Genes track shows known human protein-coding and non-protein-coding genes taken from the NCBI RNA reference sequences collection (RefSeq)." We're curious about which NCBI RNA reference sequences collection file you mean.

Look forward to your answer.Best wishes!

---------------------------------------------------------------------------------------------------------------------------------------------------

Hello,

Thanks for your question. In placing RefSeq alignments on the browser, we use several filtering criteria. These are outlined on the RefSeq Genes track description page: "RefSeq mRNAs were aligned against the human genome using blat; those with an alignment of less than 15% were discarded. When a single mRNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.1% of the best and at least 96% base identity with the genomic sequence were kept." You can read more about the RefSeq Genes track here: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=refGene. Also, a "reference" set of genes could refer to any set - you would have to determine what your criteria are for selecting a reference.

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group

On Tue, Apr 21, 2015 at 7:54 PM, 雪儿 <12042...@qq.com> wrote:

Dera UCSC team：
Hi!
We want to use the Refseq transcripts as the reference set in our RNA-seq analysis,so I downloaded two refseq file.We found the refGene.txt.gz in UCSC http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/ is different from ref_GRCh37.p13_top_level.gff3.gz in NCBI ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/GFF/ in transcripts numbers.We wonder that which standard and method of refGene.txt.gz is based on and why the refseq files in UCSC and NCBI have so many differences?Is it right to use refGene.txt.gz in your website as reference sequences?
Look forward your answer.Thank you !
--

Jonathan Casper

unread,

Apr 28, 2015, 6:55:12 PM4/28/15

to 雪儿, genome

Hello,

The answer to your question is that you used the wrong RefSeq file. If you open the containing directory at ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/GFF/ and look at the timestamp, you will see that the file ref_GRCh37.p13_top_level.gff3.gz was last updated in 2013. That file is close to two years old, and many more transcripts have been released since that time.

The directories that we use to obtain RefSeq transcripts are the release files at ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/ combined with the daily updates from ftp://ftp.ncbi.nih.gov/refseq/daily/. Unfortunately, there is no one file that we can point you to, as we are continually incorporating new data from the daily RefSeq updates. Please note that this process also involves data for more than just the human hg19 genome assembly, so you may need to apply some further filters to the files in those directories to find the data that you want.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--

Reply all

Reply to author

Forward

0 new messages