liftOver chain file

475 views
Skip to first unread message

Jinsoo Ahn

unread,
Apr 16, 2021, 1:32:20 PM4/16/21
to gen...@soe.ucsc.edu
Hello, 

I am just wondering if a chain file for the below pig genome can be provided.  
Sscrofa11.1/susScr11 to USMARCv1.0
So, the name could be susScr11ToUSMARCv1.0.over.chain.gz.

 
But GFF file for USMARCv1.0 is not found, so I would like to create one using a command line below. 
liftOver GCF_000003025.6_Sscrofa11.1_genomic.gff -gff susScr11ToUSMARCv1.0.over.chain.gz USMARCv1.0.gff unmapped

  
Please let me know. 

Thanks, 

Jinsoo

Matthew Speir

unread,
Apr 23, 2021, 11:52:13 AM4/23/21
to Jinsoo Ahn, UCSC Genome Browser Discussion List
Hello, Jinsoo.

Thank you for your question about the pig USMARCv1.0 assembly.

We have made a hub available for this assembly: https://genome.ucsc.edu/h/GCA_002844635.1. The underlying files for this assembly hub, including a GTF file for the Ensembl Genes track, are available on our download server: http://hgdownload.soe.ucsc.edu/hubs/GCA/002/844/635/GCA_002844635.1/

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

---

Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.



--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAAcAN6G2VbRbyEu%3DHVtAym17t81xcypzKHVmtRN85DdDL93QEg%40mail.gmail.com.

Jinsoo Ahn

unread,
Apr 26, 2021, 11:40:20 AM4/26/21
to gen...@soe.ucsc.edu, msp...@ucsc.edu
Hello, 

Thanks for your reply and your work on the hub and download server. 

It looks like the GTF file in the download server is generated based on the Ensembl GTF. 
As far as I checked, this Ensembl GTF is incomplete, therefore the GTF file in the download server is also incomplete. 

Please find a paper regarding this issue. As I highlighted in page 9, it says:
"The sequence present at the telomeric
end of the long arm of the USMARCv1.0 chromosome 7 assembly
(after correcting the orientation of the USMARCv1.0 SSC7) is
missing from the Sscrofa11.1 SSC7 assembly, and currently located
on a 3.8-Mb unplaced scaffold (AEMK02000452.1). This unplaced
scaffold harbours several genes including DIO3, CKB, and
NUDT14 whose orthologues map to human chromosome 14 as
would be predicted from the pig-human comparative map [40]."

Let's say I focus on DIO3. 

In NCBI genome assembly (Sscrofa11.1) and annotation (GFF), DIO3 is in  AEMK02000452.1 (unplaced scaffold). This is the problem that the paper mentioned.

Fortunately, in another NCBI genome assembly (USMARCv1.0), DIO3 is in "CM009092.1 (chromosome 7)", but GFF is missing. So,we would like to liftOver the GFF for Sscrofa11.1 to make a GFF for USMARCv1.0 in order to place DIO3 in CM009092.1 (chromosome 7).  


On the other hand, there are Ensembl genome assembly (USMARCv1.0) and annotation (GTF), but DIO3 is missing in those files. Therefore, in GTF in your download server (GCA_002844635.1_USMARCv1.0.ensGene.v103.gtf.gz), DIO3 is missing. 


So, we needed a chain file which can be named "susScr11ToUSMARCv1.0.over.chain.gz", as I wrote previously. 

Regarding the chromosome format, I will change NCBI NC_ format to UCSC chr format, before doing liftOver as below. 
liftOver GCF_000003025.6_Sscrofa11.1_genomic.gff -gff susScr11ToUSMARCv1.0.over.chain.gz USMARCv1.0.gff unmapped


Please let me know if the chain file can be provided in https://hgdownload.soe.ucsc.edu/goldenPath/susScr11/liftOver/.

Thanks. 

Best, 

Jinsoo 


PS. 
I just checked another gene in chromosome 7 (DLK1) for testing. 
DLK1 is present in GTF in your download server as below, but the gene_id is actually transcript_id.  
CM009092.1 ensGene.v103 transcript 4187047 4197167 . + . gene_id "ENSSSCT00070020262.1"; transcript_id "ENSSSCT00070020262.1"; 
 



pig_genome_Warr_2020.pdf

Matthew Speir

unread,
Apr 28, 2021, 5:02:39 PM4/28/21
to Jinsoo Ahn, UCSC Genome Browser Discussion List
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

---
Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.


Jinsoo Ahn

unread,
Apr 30, 2021, 11:36:34 AM4/30/21
to Matthew Speir, gen...@soe.ucsc.edu
Hello Matthew,

Thank you so much. It will definitely help me for my research. 

Best, 

Jinsoo 

Jinsoo Ahn

unread,
May 12, 2021, 10:11:22 AM5/12/21
to gen...@soe.ucsc.edu
Hello, 

I have been trying to liftOver as in the command, but I found that the file size decreased and only the first chromosome was lifted over.

./liftOver GCF_000003025.6_Sscrofa11.1_genomic.chr.gff -gff susScr11ToGCA_002844635.1.over.chain.gz USMARCv1.0.gff unmapped

wc -l 
1932691 GCF_000003025.6_Sscrofa11.1_genomic.chr.gff (537 MB)
65586 USMARCv1.0.gff (18 MB)  

I am wondering if there is any point that I have missed. 

The chain file that I used is below.
            susScr11 to USMARCv1.0

The gff file was derived from NCBI, and I replaced NC_ identifiers with chr identifiers. 
  GCF_000003025.6_Sscrofa11.1_genomic.chr.gff   


Please let me know. 
Thanks, 

PS. Belows are some lines in the top of the files. CM009086.1 is chr1. 


$ head -n 30 GCF_000003025.6_Sscrofa11.1_genomic.chr.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Sscrofa11.1
#!genome-build-accession NCBI_Assembly:GCF_000003025.6
#!annotation-source NCBI Sus scrofa Annotation Release 106
##sequence-region chr1 1 274330532
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9823
chr1    RefSeq  region  1       274330532       .       +       .       ID=NC_010443.5:1..274330532;Dbxref=taxon:9823;Name=1;breed=Duroc;chromosome=1;gbkey=Src;genome=chromosome;isolate=TJ Tabasco;mol_type
=genomic DNA;sex=female
chr1    Gnomon  gene    18      3870    .       +       .       ID=gene-LOC100125545;Dbxref=GeneID:100125545;Name=LOC100125545;gbkey=Gene;gene=LOC100125545;gene_biotype=protein_coding
chr1    Gnomon  mRNA    18      3870    .       +       .       ID=rna-XM_021085497.1;Parent=gene-LOC100125545;Dbxref=GeneID:100125545,Genbank:XM_021085497.1;Name=XM_021085497.1;gbkey=mRNA;gene=LOC10012554
5;model_evidence=Supporting evidence includes similarity to: 2 mRNAs%2C 28 ESTs%2C 2 Proteins%2C 12 long SRA reads%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including
182 samples with support for all annotated introns;product=TATA-box binding protein;transcript_id=XM_021085497.1
chr1    Gnomon  exon    18      961     .       +       .       ID=exon-XM_021085497.1-1;Parent=rna-XM_021085497.1;Dbxref=GeneID:100125545,Genbank:XM_021085497.1;gbkey=mRNA;gene=LOC100125545;product=TATA-b
ox binding protein;transcript_id=XM_021085497.1
chr1    Gnomon  exon    2371    2465    .       +       .       ID=exon-XM_021085497.1-2;Parent=rna-XM_021085497.1;Dbxref=GeneID:100125545,Genbank:XM_021085497.1;gbkey=mRNA;gene=LOC100125545;product=TATA-b
ox binding protein;transcript_id=XM_021085497.1
...


[jsahn25@pitzer-login04 usmarc]$ zcat susScr11ToGCA_002844635.1.over.chain.gz | head -n 10
chain 24209742767 chr1 274330532 + 5437 274327149 CM009086.1 281083304 - 17997 274562591 1
1401    0       1
2296    4       0
56      0       2
197     1       0
443     0       1
914     0       1
933     0       1
1018    0       14
66      0       1


$ head -n 30 USMARCv1.0.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Sscrofa11.1
#!genome-build-accession NCBI_Assembly:GCF_000003025.6
#!annotation-source NCBI Sus scrofa Annotation Release 106
##sequence-region chr1 1 274330532
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9823
CM009086.1      Gnomon  gene    281065323       281069176       .       +       .       ID=gene-LOC100125545;Dbxref=GeneID:100125545;Name=LOC100125545;gbkey=Gene;gene=LOC100125545;gene_biotype=protein_codi
ng
CM009086.1      Gnomon  mRNA    281065323       281069176       .       +       .       ID=rna-XM_021085497.1;Parent=gene-LOC100125545;Dbxref=GeneID:100125545,Genbank:XM_021085497.1;Name=XM_021085497.1;gbk
ey=mRNA;gene=LOC100125545;model_evidence=Supporting evidence includes similarity to: 2 mRNAs%2C 28 ESTs%2C 2 Proteins%2C 12 long SRA reads%2C and 100%25 coverage of the annotated genomic feature by RNAseq
alignments%2C including 182 samples with support for all annotated introns;product=TATA-box binding protein;transcript_id=XM_021085497.1
CM009086.1      Gnomon  exon    281065323       281066267       .       +       .       ID=exon-XM_021085497.1-1;Parent=rna-XM_021085497.1;Dbxref=GeneID:100125545,Genbank:XM_021085497.1;gbkey=mRNA;gene=LOC
100125545;product=TATA-box binding protein;transcript_id=XM_021085497.1
CM009086.1      Gnomon  exon    281067677       281067771       .       +       .       ID=exon-XM_021085497.1-2;Parent=rna-XM_021085497.1;Dbxref=GeneID:100125545,Genbank:XM_021085497.1;gbkey=mRNA;gene=LOC
100125545;product=TATA-box binding protein;transcript_id=XM_021085497.1
...

Jairo Navarro Gonzalez

unread,
May 14, 2021, 1:40:09 PM5/14/21
to Jinsoo Ahn, UCSC Genome Browser Discussion List

Hello,

Thank you for using the UCSC Genome Browser and sending your follow-up.

One of our engineers has recommended using the GTF file we have on hgdownload for susScr11
ncbiRefSeq genes in the 'archive' directory. This file already has the chromosome names translated
to UCSC names and does represent the NCBI RefSeq genes we display in the track on susScr11.

https://hgdownload.soe.ucsc.edu/goldenPath/archive/susScr11/ncbiRefSeq/

Please note the warning the liftOver command says immediately with that -gff argument:

WARNING: -gff is not recommended.
Use 'ldHgGene -out=<file.gp>' and then 'liftOver -genePred <file.gp>'

Even with that gff warning, the liftOver command does convert the GTF file
(susScr11.2021-02-11.ncbiRefSeq.gtf.gz):

liftOver -gff susScr11.2021-02-11.ncbiRefSeq.gtf.gz \
susScr11ToGCA_002844635.1.over.chain.gz USMARCv1.0.gff unmapped

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.

All messages sent to that address are archived on a publicly accessible Google Groups forum.


If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

Want to share the Browser with colleagues?

Host a workshop: https://bit.ly/ucscTraining


Jinsoo Ahn

unread,
Jun 14, 2021, 1:06:53 PM6/14/21
to gen...@soe.ucsc.edu, jnav...@ucsc.edu, Kichoon Lee

Hello, 

Thank you for your previous reply and I am sorry about this late response. The UCSC GTF file (susScr11.2021-02-11.ncbiRefSeq.gtf.gz) worked with the chain file (susScr11ToGCA_002844635.1.over.chain.gz)


But, I think there is a problem with the genome (USMARCv1.0) itself, as shown below. 
1. I found that chromosome 7 of the pig genome (USMARCv1.0) was "inverted" (p.9 of attached Warr paper, and Fig. S9 in Warr paper's supplementary figure). Also, chromosome 1, 5, 6-11, and 13-16 are inverted (p.4 of the Warr paper).  

2. After liftOver, I also found that some parts in chromosome 7 are partially "re-inverted and inserted" as shown in Fig 2 (page 3) of README_liftOver.pdf.


So, I think like the USMARCv1.0 should be right-oriented first, and then some inserted parts need to be relocated. After that, the chain file may need to be generated again. How do you think? 


Thanks, 

Jinsoo   
 



pig_genome_Warr_2020.pdf
pig_genome_Warr_2020_supple.pdf
README_liftOver.pdf

Daniel Schmelter

unread,
Jun 17, 2021, 7:57:12 PM6/17/21
to Jinsoo Ahn, UCSC Genome Browser Discussion List, Jairo Navarro Gonzalez, Kichoon Lee

Hello Jinsoo,

Thanks for contacting the Genome Browser with your question about the Pig genome.

The UCSC Genome Browser does not serve as a curator of genome assemblies. We only display assemblies and annotations from external groups and consortiums. If you want to submit a correction request, please do so with GenBank and RefSeq.

One of our engineers also could not determine which assembly the Warr 2020 paper was referring to, noting that it could be comparing the more recent with the previous reference assemblies for the pig genome. Specifically, the newer USMARCv1.0 assembly could be correcting incorrect inversions in the susScr11 assembly.

For further communication, please reply-all to gen...@soe.ucsc.edu. Those emails are archived in a public help forum. For private questions, you may send emails instead to genom...@soe.ucsc.edu.

All the best,

Daniel Schmelter
UCSC Genome Browser


Jinsoo Ahn

unread,
Jun 18, 2021, 4:59:02 PM6/18/21
to gen...@soe.ucsc.edu, dsch...@ucsc.edu
 
Hello Daniel,

Thanks for your response. I understand that this issue can be directed to GenBank and RefSeq, or I may contact the corresponding author of the Warr 2020 paper. 

The telomeric end of chromosome 7 is absent in susScr11 assembly. 

As mentioned in the Warr paper, USMARCv1.0 assembly included the telomeric end of chromosome 7. However, the whole chromosome 7 was inverted. When I used the chain file, I also found that some chromosomal regions were not continuous. This means that certain regions were inserted in unexpected places.  

I can contact GenBank and RefSeq or the author. 

Best, 

Jinsoo  

Hiram Clawson

unread,
Jun 18, 2021, 5:10:50 PM6/18/21
to Jinsoo Ahn, gen...@soe.ucsc.edu, dsch...@ucsc.edu
Good Afternoon Jinsoo:

For assembly issues, you should talk to the group that constructed
the assembly. GenBank and RefSeq do not construct the assemblies,
they merely pass them along. You need to find the assembly team
for these assemblies. Information about the assemblies is
in the assembly_report.txt file:

For susScr11:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/GCF_000003025.6_Sscrofa11.1_assembly_report.txt


# Assembly name: Sscrofa11.1
# Description: Sscrofa11 with Y sequences from WTSI_X_Y_pig V2
# Organism name: Sus scrofa (pig)
# Infraspecific name: breed=Duroc
# Taxid: 9823
# BioSample: SAMN02953785
# BioProject: PRJNA13421
# Submitter: The Swine Genome Sequencing Consortium (SGSC)
# Date: 2017-2-7
# Assembly type: haploid
# Release type: major
# Assembly level: Chromosome
# Genome representation: full
# WGS project: AEMK02
# Assembly method: Falcon v. OCT-2015
# Expected final version: Yes
# Genome coverage: 65.0x
# Sequencing technology: PacBio
# RefSeq category: Representative Genome
# GenBank assembly accession: GCA_000003025.6
# RefSeq assembly accession: GCF_000003025.6
# RefSeq assembly and GenBank assemblies identical: no

For the USMARcv1.0 assembly:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/844/635/GCA_002844635.1_USMARCv1.0/GCA_002844635.1_USMARCv1.0_assembly_report.txt

# Assembly name: USMARCv1.0
# Organism name: Sus scrofa (pig)
# Infraspecific name: breed=Cross-bred
# Isolate: 201423004
# Sex: male
# Taxid: 9823
# BioSample: SAMN07325927
# BioProject: PRJNA392765
# Submitter: USDA ARS
# Date: 2017-12-20
# Assembly type: haploid
# Release type: major
# Assembly level: Chromosome
# Genome representation: full
# WGS project: NPJO01
# Assembly method: Celera Assembler v. 8.3rc2
# Expected final version: yes
# Genome coverage: 65.0x
# Sequencing technology: PacBio; Illumina NextSeq 500
# GenBank assembly accession: GCA_002844635.1
Reply all
Reply to author
Forward
0 new messages