Fwd: Dog genome GTF file

331 views
Skip to first unread message
Message has been deleted

meharji arumilli

unread,
Jul 28, 2014, 12:22:15 PM7/28/14
to gen...@soe.ucsc.edu
Hi,

I am trying to convert a bed file into GTF format for dog genome version canFam3 in the following way:

bigBedToBed canis_familiaris.protein_coding.bb canis_familiaris.protein_coding.bed

and the bed file is converted into GTF format:

bedToGenePred canis_familiaris.protein_coding.bed stdout | genePredToGtf -utr file stdin canis_familiaris.protein_coding.gtf
 
But the GTF format has the limitations:

1. gene_id and transcript_id given for each annotation is the same.

2. And different isoforms of the same genes are generated.

I have reviewed a few posts in this forum to overcome these issues but still couldn't find a fix for this.

Could someone give your ideas/suggestions to generate a perfect GTF format.



--
Best Regards
Mehar

Matthew Speir

unread,
Aug 6, 2014, 6:04:02 PM8/6/14
to meharji arumilli, gen...@soe.ucsc.edu
Hi Meharji,

Thank you for your questions about creating a GTF file from a bigBed file. You can use our utility genePredSingleCover to get a genePred file that contains only one isoform per gene. You can download the utility for your system here: http://hgdownload.soe.ucsc.edu/admin/exe/. You can then run this utility on the command line to see the following usage message:

    genePredSingleCover - create single-coverage genePred files

    genePredSingleCover [options] inGenePred outGenePred

    Create a genePred file that have single CDS coverage of the genome.
    UTR is allowed to overlap.  The default is to keep the gene with the
    largest numberr of CDS bases.

    Options:
      -scores=file - read scores used in selecting genes from this file.
       It consists of tab seperated lines of
           name chrom txStart score
       where score is a real or integer number. Higher scoring genes will
       be choosen over lower scoring ones.  Equaly scoring genes are
       choosen by number of CDS bases.  If this option is supplied, all
       genes must be in the file


This will solve the problem of having multiple entries for a single gene in your GTF, but it won't solve the issue of needing a unique name for the gene and transcript names. The issue is that your genePred file does not contain a "name2" column, as described here http://genome.ucsc.edu/FAQ/FAQformat.html#format9 under the Gene Predictions (Extended) section. This "name2" column is what is used by the genePredToGtf utility to create a gene name, and if no "name2" column is found it defaults using the "name" column. This means that the genePredToGtf utility is using the same "name" for both the gene name and transcript name. I'm not entirely sure how you can overcome this issue. You may be able to write a script that adds a "name2" column in that is slightly different than the "name" column, e.g. turning ENPP1 into ENPP1_1 and adding this as "name2". While the names in this "name2" column wouldn't be that informative using this strategy, it does make it so that your gene and transcript names are different.

Alternatively, you may want to use a different gene set altogether, such as Ensembl Genes. You can use the genePredToGtf utility directly on the ensGene table for canFam3, or you can download the pre-generated GTF file from the Ensembl website, ftp://ftp.ensembl.org/pub/release-75/gtf/canis_familiaris/Canis_familiaris.CanFam3.1.75.gtf.gz. The only issue is that this file will contain genes other than protein coding genes, such as non-coding RNAs and pseudogenes. If that file does not fit your needs, you should be able to get a genePred file from the Table Browser that contains only the protein coding genes from the Ensembl Genes track. Use the following Table Browser settings to get this file:

    1. Navigate to the Table Browser, http://genome.ucsc.edu/cgi-bin/hgTables.
    2. Select the following settings:

        Clade: Mammal
        Organism: Dog
        Assembly: Sep. 2011 (Broad CanFam3.1/canFam3)
        Group: Genes and Gene Predictions
        Track: Ensembl Genes
        Table: ensGene
        Region: genome
        Output Format: selected fields from primary and related tables
        Output File: enter a file name to save your results to a file, or leave blank to display results in the browser

    3. Click 'Create' next to Filter.
    4. Select the  ensemblSource from the 'Linked Tables' section
    5. Click 'allow filtering using fields in checked tables'.
    6. Type 'protein_coding' in the 'source' field of the canFam3.ensemblSource
            The 'source' line should read: source does match protein_coding
    7. Click 'Submit'.
    8. After you return to the main Table Browser page, click 'get output'.
    9. In the “Select Fields from canFam3.ensGene” section, check the following checkboxes: name, chrom, txStart, txEnd, cdsStart, cdsEnd, exonCount, exonStarts, exonEnds, score, name2, cdsStartStat, cdsEndStat, exonFrames
    10. Click 'get output'.

You may need to run the genePredSinlgeCover to ensure that there are not multiple entries for a single gene in this file. After that, you should be able to run genePredToGtf to get a GTF file that meets you requirements:
    - One entry per gene
    - gene_id and transcript_id are different from each other

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


mehar

unread,
Sep 18, 2015, 12:38:24 PM9/18/15
to gen...@soe.ucsc.edu
Dear all,

I have downloaded the "canis_familiaris.lncRNA.bb" from the canFam3 trackHub. The bigBed file is converted to bed format. However, we could not find any description of the information in the columns i.e. no header lines. Below is the first line:

chr1    642563  643557  CFRNASEQ_IGNC_Spliced_00005261  1000    -       642563  643557  77,175,74       2       635,115 0,879   5261    <b>Annotation

Could you let me understand the content in the file, especially from column 5 onwards..

Br
Mehar



On 28/07/14 12:04, meharji arumilli wrote:
Hi,

I am trying to convert a bed file into GTF format for dog genome version canFam3. The annotations files which includes

are downloaded in .bb format from the below source:


Each of these files in .bb are turned into a bed file, for instance

bigBedToBed canis_familiaris.protein_coding.bb canis_familiaris.protein_coding.bed

and the bed file is converted into GTF format:

bedToGenePred canis_familiaris.protein_coding.bed stdout | genePredToGtf -utr file stdin canis_familiaris.protein_coding.gtf

Finally, the four GTF files for antisense, lncRNA, protein_coding and unclassified are concatenated into a single GTF file.

 
But the GTF format has the limitation that the gene_id and transcript_id given for each annotation is the same.  Since that doesn't meet our needs, i referred to the posts in this group

which suggests to do the following to directly fetch gtf file:

add this three line file ".hg.conf" to your home directory:
$ cat $HOME/.hg.conf
db.host=genome-mysql.cse.ucsc.edu
db.user=genomep
db.password=password
central.db=hgcentral

And set the permissions:

$ chmod 600 .hg.conf

Now you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file knownGene.gtf: 

$ genePredToGtf hg19 knownGene knownGene.gtf



Since, my genome version is canFam3, hg19 can be replaced with canFam3 in the above command. To retrieve ensembl genes 
we use ensGene, for NCBI annotation we use refGene and for UCSC annotations we use knowngene in the command. Since the

annotations, which i am trying to convert are released by Broad institute, i wonder which gene type should be used in the 
command line.

Could someone give any ideas or suggestions to get a refined file of annotations in GTF format which includes 

for dog genome.
--
Best Regards
Mehar

Brian Lee

unread,
Sep 18, 2015, 1:38:04 PM9/18/15
to mehar, gen...@soe.ucsc.edu
Dear Mehar,

Thank you for using the UCSC Genome Browser and your question about interpreting the data in a canFam3 Track Hub in a binary bigBed file called "canis_familiaris.lncRNA.bb".

This binary indexed data includes a description at the very top that outlines what the columns represent. You can take a peak at this information in two ways, one is just to look at the top of the file:
 curl -s https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/canFam3/annotation/canis_familiaris.protein_coding.bb | head -n 21

"Canine annotation BED 12+5 file format" 
(
string chrom; "Reference sequence chromosome or scaffold" 
uint chromStart; "Transcription start position" 
uint chromEnd; "Transcription end position" 
string name; "Name of gene" 
uint score; "Score" 
char[1] strand; "+ or - for strand" 
uint thickStart; "Element start" 
uint thickEnd; "Element end" 
uint reserved; "Element based color" 
int blockCount; "Exon numbers" 
int[blockCount] blockSizes; "Exon sizes" 
int[blockCount] chromStarts; "Start of blocks" 
string ID; "ID of element"\
string status; "Annotation status" 
string geneID;"External gene name (estimated)" 
lstring dsnExp; "Expression in DSN libraries (FPKM)" 
lstring polyAExp; "Expression in polyA libraries(FPKM)" 
)

The other is to enter this track (or the Track Hub) into the browser and then navigate to the Table Browser and select the custom track (or the Track Hub) when the canFam3 database is also selected and then click the "describe table schema" button.

To quickly replicated these steps, click this link that loads the custom track on the Table Browser with canFam3 selected, http://genome.ucsc.edu/cgi-bin/hgTables?db=canFam3&hgct_customText=https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/canFam3/annotation/canis_familiaris.protein_coding.bb Then set the "group:" to "Custom Tracks" and then click the "describe table schema" button, and you will see the same information from above with some example information from the file.

To read more about the bigBed (and BED) format read our FAQ:
http://genome.ucsc.edu/FAQ/FAQformat.html#format1.5
http://genome.ucsc.edu/goldenPath/help/bigBed.html
http://genome.ucsc.edu/FAQ/FAQformat.html#format1

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group


--


mehar

unread,
Sep 18, 2015, 2:31:49 PM9/18/15
to Brian Lee, gen...@soe.ucsc.edu
Dear Brian,

Thank you for providing a detailed source of explanation. However, i still have two queries:

1. What does "Element start" and "Element end" coordinates represent? Do they mean the start and end sites for CDS sequence?

2. chromStart and chromEnd gives the transcription start and end positions.  Could you suggest a way to get the TranscriptID?

And some of the rows are duplicates with same start and end positions. Below is an example which is present thrice with the first 10 columns shown:

chr1    247828  322180  ENPP1   1000    -       247828  322180  55,126,184      24 
chr1    247828  322180  ENPP1   1000    -       247828  322180  55,126,184      25     
chr1    247828  322180  ENPP1   1000    -       247828  322180  55,126,184      25

Could you help me to interpret this.

Br
Mehar

Brian Lee

unread,
Sep 19, 2015, 1:51:19 AM9/19/15
to mehar, gen...@soe.ucsc.edu

Dear Mehar,

Thank you for using the UCSC Genome Browser and your question about interpreting data from the a file originating in the Dog canFam3 Public Hub.

Data from Public Hubs originate outside of UCSC, so it is best to contact the hub provider to inquire more about the data.

Unfortunately in your last email thread you were referencing a message from over a year ago and I provided a response to a file referenced there for protein coding, when you were asking about a new file: canis_familiaris.lncRNA.bb.

You can repeat the steps mentioned before to see the underlying column names:

curl -s https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/canFam3/annotation/canis_familiaris.lncRNA.bb | head -n 20

Or load the file in the Table Browser by selecting "Custom Tracks" and click the "describe table schema" button: http://genome.ucsc.edu/cgi-bin/hgTables?db=canFam3&hgct_customText=https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/canFam3/annotation/canis_familiaris.lncRNA.bb

If you go to the Public Hubs page for this hub, http://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=Broad+Improved+Canine+Annotation+v1, and click the "Hub Name" for the "Broad Improved Canine Annotation v1" hub, you will arrive at the source of external hub.txt,https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/hub.txt, where you will find an email contact listed: email jjoh...@broadinstitute.org

Please take hub data questions up with hub contact source, for example, your question about why some of the rows have duplicate start and end positions. You should also review the documentation public hubs have already put together before contacting them. For example there is an html page outlining the lincRNAs: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=canFam3&g=hub_16627_LNCRNA&hubUrl=https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/hub.txt

Please note that page suggests that "For questions or more information on the data in this Track Hub, please contact the Broad Institute Vertebrate Genome Biology group:vertebra...@broadinstitute.org"

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group

mehar

unread,
Sep 21, 2015, 12:36:17 PM9/21/15
to Brian Lee, gen...@soe.ucsc.edu
Dear Brian,

Sorry for the confusion. I did managed to find the table schema for canis_familiaris.lncRNA.bb as suggested by you. And the query related to duplicate start and end positions is regarding the *protein_coding.bb file and i will point this to the hub contact source.

Br
Mehar
Reply all
Reply to author
Forward
0 new messages