Fw: [genome] Why UCSC cannot identify Ensembl chicken galGal4 reference genome?

259 views
Skip to first unread message

賴勇志

unread,
Mar 23, 2015, 6:25:23 PM3/23/15
to gen...@soe.ucsc.edu

Hi Steve,

Many thanks for your explanation. You are right, I have tried every inconsistent contig I have in this analysis as the below. Do you know there is any tool I can convert .bed and .bam files between Ensembl and UCSC? I don’t think I can manually change the contig names for each item in my .bed or .bam files. In addition, do you know the different between Ensembl (UCSC) and NCBI chicken galGal4 reference genomes? I used TopHat2 to align my RNA-Seq data on both Ensembl (UCSC) and NCBI chicken genomes and there are some significantly different. I think there should be some different between Ensembl (Release 75) and NCBI chicken genomes. Many thanks.

Gary

Ensembl                                 UCSC
chrAADN03011039.1    chr27_AADN03011039_random
chrAADN03011267.1    chrLGE22C19W28_N03011267_random
chrAADN03013871.1    chrUn_AADN03013871
chrAADN03015172.1    chrUn_AADN03015172
chrAADN03015727.1    chrUn_AADN03015727
chrAADN03018789.1    chrUn_AADN03018789
chrAADN03021099.1    chrUn_AADN03021099
chrAADN03021535.1    chrUn_AADN03021535
chrAADN03021832.1    chrUn_AADN03021832
chrAADN03022685.1    chrUn_AADN03022685
chrAADN03022998.1    chrUn_AADN03022998
chrJH375162.1              chr2_JH375162_random
chrJH375454.1              chrUn_JH375454
chrJH375623.1              chrUn_JH375623
chrJH375692.1              chrUn_JH375692
chrJH375734.1              chrUn_JH375734
chrJH375752.1              chrUn_JH375752
chrJH376016.1              chrUn_JH376016
chrJH376272.1              chrUn_JH376272
chrJH376285.1              chrUn_JH376285
chrJH376326.1              chrUn_JH376326
chrJH376330.1              chrUn_JH376330
chrJH376331.1              chrUn_JH376331

 

 

Sent: Monday, March 23, 2015 10:41 AM
Subject: RE: [genome] Why UCSC cannot identify Ensembl chicken galGal4 reference genome?
 

Hello, Gary.

UCSC is actually using Ensembl release 78 on our galGal4 assembly now, but the reference assembly between UCSC and Ensembl/Galaxy is the same.  The problem is that the scaffold naming scheme is slightly different at UCSC.  If you drop the “chr” portion of the scaffold names you referenced and search them at
http://genome.ucsc.edu/cgi-bin/hgGateway?db=galGal4, you will find that they do all exist at UCSC under slightly different names.  For example, “
chrAADN03011039.1” is actually “chr27_AADN03011039_random” and “JH375162.1” is actually “chr2_JH375162_random”.

Please contact us again at
gen...@soe.ucsc.edu if you have any further questions. 
All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.


---
Steve Heitner
UCSC Genome Bioinformatics Group

 

From: 賴勇志 [mailto:d92b...@ntu.edu.tw]
Sent: Saturday, March 21, 2015 4:53 PM
To: gen...@soe.ucsc.edu
Subject: [genome] Why UCSC cannot identify Ensembl chicken galGal4 reference genome?

 

Hi,

 

I found that several scaffolds or contigs in Ensembl release 75 chicken galGal4 reference genome as the below cannot be identified by UCSC Genome Browser. All of them can be identified by main Galaxy. Could you tell me why there is the different of chicken galGal4 reference genome between Ensembl (main Galaxy) and UCSC? Many thanks.

 

Gary

 

chrAADN03011039.1

chrAADN03011267.1

chrAADN03013871.1

chrAADN03015172.1

chrAADN03015727.1

chrAADN03018789.1

chrAADN03021099.1

chrAADN03021535.1

chrAADN03021832.1

chrAADN03022685.1

chrAADN03022998.1

chrJH375162.1

chrJH375454.1

chrJH375623.1

chrJH375692.1

chrJH375734.1

chrJH375752.1

chrJH376016.1

chrJH376272.1

chrJH376285.1

chrJH376326.1

chrJH376330.1

chrJH376331.1

--

Hiram Clawson

unread,
Mar 23, 2015, 10:52:22 PM3/23/15
to 賴勇志, gen...@soe.ucsc.edu
Good Afternoon Gary:

For any genome assembly at UCSC that is equivalent to Ensembl, there is a database
table in the UCSC database which translates the names from UCSC to Ensembl.
Please note the text dump of this table:

http://hgdownload.soe.ucsc.edu/goldenPath/galGal4/database/ucscToEnsembl.txt.gz

You can make an 'sed' translation file from those two columns, for example
from UCSC to Ensembl, make one line for each translation:

s/chr27_AADN03011039_random/chrAADN03011039.1/g;

and so on for each line, all in one file ucscToEnsembl.galGal4.sed

Then use that translation on any file:
sed -f ucscToEnsembl.galGal4.sed fileWithUcscNames.txt > fileWithEnsemblNames.txt

--Hiram

賴勇志

unread,
Mar 24, 2015, 12:10:13 PM3/24/15
to Hiram Clawson, gen...@soe.ucsc.edu
Hi Hiram,

Thank you so much. The answer is just what I need. Thanks again.

Gary

-----原始郵件-----
From: Hiram Clawson
Sent: Monday, March 23, 2015 7:52 PM
To: 賴勇志 ; gen...@soe.ucsc.edu
Subject: Re: Fw: [genome] Why UCSC cannot identify Ensembl chicken galGal4

賴勇志

unread,
Apr 1, 2015, 5:30:09 PM4/1/15
to gen...@soe.ucsc.edu
Hi,

I downloaded Ensembl Genes GTF file from UCSC Table Browser (fig1). It
includes only 386,159 lines (fig2). However, I downloaded Ensembl Genes GTF
file from Ensembl (fig3,
ftp://ftp.ensembl.org/pub/release-79/gtf/gallus_gallus/Gallus_gallus.Galgal4.79.gtf.gz).
It includes 445,501 lines (fig4). The file size of them is also very
different, Ensembl: about 156 Mb, UCSC: about 47 Mb. Do you know what’s the
different between the two gtf files? Many thanks.

Gary
fig1.JPG
fig2.JPG
fig3.JPG
fig4.JPG

賴勇志

unread,
Apr 3, 2015, 5:37:35 PM4/3/15
to Hiram Clawson, gen...@soe.ucsc.edu
Hi Hiram,

Many thanks for your explanation. There are 15,932 chromosomes in the table
of ucscToEnsembl.txt. However, I downloaded GTF file from Ensembl Release 79
for chicken genome, there are only 934 chromosomes in the GTF file
(ftp://ftp.ensembl.org/pub/release-79/gtf/gallus_gallus). Do you know why
the number of chromosome is different between ucscToEnsembl.txt and Ensembl
GTF file? Many thanks.

Gary

-----原始郵件-----
From: Hiram Clawson
Sent: Monday, March 23, 2015 7:52 PM
To: 賴勇志 ; gen...@soe.ucsc.edu
Subject: Re: Fw: [genome] Why UCSC cannot identify Ensembl chicken galGal4

Matthew Speir

unread,
Apr 3, 2015, 5:50:24 PM4/3/15
to 賴勇志, Hiram Clawson, gen...@soe.ucsc.edu
Hi Gary,

This difference is due to the fact that some chromosomes in the galGal4
assembly do not have any data on them. There are a total of 14998
chromosomes in the galGal4 assembly that do not have any data from
Ensembl Genes. If we add that number to the number of different
chromosomes that you counted in the Ensembl GTF file, which was 934, we
get total the number of chromosomes in the galGal4 assembly, which is
15932.

I hope this is helpful. If you have any further questions, please reply
to gen...@soe.ucsc.edu. All messages sent to that address are archived
on a publicly-accessible Google Groups forum. If your question includes
sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

Hiram Clawson

unread,
Apr 3, 2015, 5:50:53 PM4/3/15
to 賴勇志, gen...@soe.ucsc.edu
Good Afternoon Gary:

Because there are no Ensembl gene predictions on all the tiny contigs:

hgsql -N -e 'select chrom from ensGene;' galGal4 | sort -u | wc -l
934

--Hiram

On 4/3/15 2:33 PM, 賴勇志 wrote:

賴勇志

unread,
Apr 3, 2015, 6:06:32 PM4/3/15
to Hiram Clawson, gen...@soe.ucsc.edu
Hi Hiram,

Thanks a lot for your so quickly explanation.

Gary

-----原始郵件-----
From: Hiram Clawson
Sent: Friday, April 03, 2015 2:50 PM
To: 賴勇志 ; gen...@soe.ucsc.edu
Subject: Re: [genome] Why UCSC cannot identify Ensembl chicken galGal4
Reply all
Reply to author
Forward
0 new messages