--
Hi,I am trying to convert a bed file into GTF format for dog genome version canFam3. The annotations files which includes
are downloaded in .bb format from the below source:
- canis_familiaris.antisense.bb
- canis_familiaris.lncRNA.bb
- canis_familiaris.protein_coding.bb
- canis_familiaris.unclassified.bb
Each of these files in .bb are turned into a bed file, for instance
bigBedToBed canis_familiaris.protein_coding.bb canis_familiaris.protein_coding.bed
and the bed file is converted into GTF format:
bedToGenePred canis_familiaris.protein_coding.bed stdout | genePredToGtf -utr file stdin canis_familiaris.protein_coding.gtf
--Finally, the four GTF files for antisense, lncRNA, protein_coding and unclassified are concatenated into a single GTF file.
But the GTF format has the limitation that the gene_id and transcript_id given for each annotation is the same. Since that doesn't meet our needs, i referred to the posts in this group
which suggests to do the following to directly fetch gtf file:
add this three line file ".hg.conf" to your home directory:
$ cat $HOME/.hg.conf db.host=genome-mysql.cse.ucsc.edu db.user=genomep db.password=password central.db=hgcentralAnd set the permissions:
$ chmod 600 .hg.confNow you can use the command to extract GTF files directly from the UCSC database. For example, fetch the UCSC gene track from hg19 into the local file knownGene.gtf:
$ genePredToGtf hg19 knownGene knownGene.gtfSince, my genome version is canFam3, hg19 can be replaced with canFam3 in the above command. To retrieve ensembl genes we use ensGene, for NCBI annotation we use refGene and for UCSC annotations we use knowngene in the command. Since the annotations, which i am trying to convert are released by Broad institute, i wonder which gene type should be used in the command line.Could someone give any ideas or suggestions to get a refined file of annotations in GTF format which includes for dog genome.
Best Regards
Mehar
curl -s https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/canFam3/annotation/canis_familiaris.protein_coding.bb | head -n 21 "Canine annotation BED 12+5 file format" ( string chrom; "Reference sequence chromosome or scaffold" uint chromStart; "Transcription start position" uint chromEnd; "Transcription end position" string name; "Name of gene" uint score; "Score" char[1] strand; "+ or - for strand" uint thickStart; "Element start" uint thickEnd; "Element end" uint reserved; "Element based color" int blockCount; "Exon numbers" int[blockCount] blockSizes; "Exon sizes" int[blockCount] chromStarts; "Start of blocks" string ID; "ID of element"\ string status; "Annotation status" string geneID;"External gene name (estimated)" lstring dsnExp; "Expression in DSN libraries (FPKM)" lstring polyAExp; "Expression in polyA libraries(FPKM)" )
The other is to enter this track (or the Track Hub) into the browser and then navigate to the Table Browser and select the custom track (or the Track Hub) when the canFam3 database is also selected and then click the "describe table schema" button.
To quickly replicated these steps, click this link that loads the custom track on the Table Browser with canFam3 selected, http://genome.ucsc.edu/cgi-bin/hgTables?db=canFam3&hgct_customText=https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/canFam3/annotation/canis_familiaris.protein_coding.bb Then set the "group:" to "Custom Tracks" and then click the "describe table schema" button, and you will see the same information from above with some example information from the file.
To read more about the bigBed (and BED) format read our FAQ:
http://genome.ucsc.edu/FAQ/FAQformat.html#format1.5
http://genome.ucsc.edu/goldenPath/help/bigBed.html
http://genome.ucsc.edu/FAQ/FAQformat.html#format1
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genome Bioinformatics Group
--
Dear Mehar,
Thank you for using the UCSC Genome Browser and your question about interpreting data from the a file originating in the Dog canFam3 Public Hub.
Data from Public Hubs originate outside of UCSC, so it is best to contact the hub provider to inquire more about the data.
Unfortunately in your last email thread you were referencing a message from over a year ago and I provided a response to a file referenced there for protein coding, when you were asking about a new file: canis_familiaris.lncRNA.bb.
You can repeat the steps mentioned before to see the underlying column names:
curl -s https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/canFam3/annotation/canis_familiaris.lncRNA.bb | head -n 20
Or load the file in the Table Browser by selecting "Custom Tracks" and click the "describe table schema" button: http://genome.ucsc.edu/cgi-bin/hgTables?db=canFam3&hgct_customText=https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/canFam3/annotation/canis_familiaris.lncRNA.bb
If you go to the Public Hubs page for this hub, http://genome.ucsc.edu/cgi-bin/hgHubConnect?hubSearchTerms=Broad+Improved+Canine+Annotation+v1, and click the "Hub Name" for the "Broad Improved Canine Annotation v1" hub, you will arrive at the source of external hub.txt,https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/hub.txt, where you will find an email contact listed: email jjoh...@broadinstitute.org
Please take hub data questions up with hub contact source, for example, your question about why some of the rows have duplicate start and end positions. You should also review the documentation public hubs have already put together before contacting them. For example there is an html page outlining the lincRNAs: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=canFam3&g=hub_16627_LNCRNA&hubUrl=https://www.broadinstitute.org/ftp/pub/vgb/dog/trackHub/hub.txt
Please note that page suggests that "For questions or more information on the data in this Track Hub, please contact the Broad Institute Vertebrate Genome Biology group:vertebra...@broadinstitute.org"
All the best,
Brian Lee
UCSC Genome Bioinformatics Group