refGene.txt.gz file not being updated?

96 views
Skip to first unread message

Jorge Amigo Lechuga

unread,
Aug 31, 2020, 11:26:38 AM8/31/20
to gen...@soe.ucsc.edu
Hi,

Is there any reason why this file not being updated since March 1st 2020?
ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz

Is there any other place we should check for RefSeq updates?

Regards,
Jorge.
-- 

Jorge Amigo Lechuga

Fundación Pública Galega de Medicina Xenómica
Hospital Clínico Universitario de Santiago
Edificio de Consultas, planta -2
15706 - Santiago de Compostela (Spain)

tel: +34981955322
fax: +34981951473

Matthew Speir

unread,
Sep 1, 2020, 6:41:05 PM9/1/20
to Jorge Amigo Lechuga, UCSC Genome Browser Discussion List
Hello, Jorge.

Thank you for bringing this issue to our attention. We have plans to update this file and others dependent on our processing of Genbank and RefSeq data soon.

If you're not aware, the refGene.txt.gz consists of UCSC's realignment of NM/NR RNAs from RefSeq using BLAT. This means that there may be differences between the exons or gene positions reported in that file and those officially provided by RefSeq themselves. If you are looking for the official RefSeq transcript positions, then you should use the files starting with "ncbiRefSeq", e.g. ncbiRefSeqCurated.txt.gz. You can read more about how the UCSC RefSeq (refGene) and NCBI RefSeq tracks (e.g. ncbiRefSeqCurated) are generated on the track description page: https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=refSeqComposite.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

---

Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.



--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/681f7bf8-630b-3295-fa26-dfcb41ae93e5%40usc.es.

Jorge Amigo Lechuga

unread,
Sep 2, 2020, 11:55:57 AM9/2/20
to Matthew Speir, UCSC Genome Browser Discussion List
Hi Matthew,

Thanks for your response, and thank you very much for the track description page. We've been working years using this refGene.txt.gz file as a reference for our NGS data without knowing this, and it's now clear to us that we should use the ncbiRefSeqCurated.txt file instead.

Having said that, I have 2 new doubts about these ncbi files:
- ncbiRefSeqCurated.txt.gz and ncbiRefSeq.txt.gz are identical, and from the track description page the former should be a subset of the later. Am I missing anything?
- I see that the latest update for ncbiRefSeqCurated.txt.gz was on March 22nd. Hasn't this file been originally updated since then?
- We started downloading these files from your ftp because we were downloading many others from there, but would it be possible for you to point us to the original refSeq file from the official ftp? Would this file at ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz be the one we should parse if we want to be perfectly up to date?

Regards,
Jorge.


De: Matthew Speir <msp...@ucsc.edu>
Asunto: [genome] refGene.txt.gz file not being updated?
Fecha: miércoles, 2 de septiembre de 2020 0:41
Para: Jorge Amigo Lechuga <jorge...@usc.es>
Cc: UCSC Genome Browser Discussion List <gen...@soe.ucsc.edu>

Jairo Navarro Gonzalez

unread,
Sep 7, 2020, 7:45:47 PM9/7/20
to Jorge Amigo Lechuga, Matthew Speir, UCSC Genome Browser Discussion List

Hello Jorge,

Thank you for using the UCSC Genome Browser and sending your follow-up questions.

You are correct about the ncbiRefSeqCurated.txt.gz and ncbiRefSeq.txt.gz files being identical, even though ncbiRefSeqCurated is supposed to be a subset of ncbiRefSeq. The reason that the two files are identical is that NCBI doesn't update GRCh37/hg19 gene predictions anymore nor do they have predicted transcripts for GRCh37.

The March 22 update time is misleading since the update was to add patch sequences to the hg19 assembly. The GFF file we used for hg19 patch13 is from October 2019:

-r--r--r-- 2  25695105 Oct 24  2019 GCF_000001405.25_GRCh37.p13_genomic.gff.gz

and is available from this NCBI directory:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/

The file is identical to the file you found in the other path (ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gff.gz).

-rw-r--r-- 1 25695105 Sep  2 09:11 GRCh37_latest_genomic.gff.gz

If you want more frequent hg19 annotations, you should contact RefSeq and let them know. To contact RefSeq directly, you can fill out the following form:

https://www.ncbi.nlm.nih.gov/projects/RefSeq/update.cgi

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly-accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

Want to share the Browser with colleagues?

Host a workshop: https://bit.ly/ucscTraining


Reply all
Reply to author
Forward
0 new messages