gff3 to bed conversion

1,791 views
Skip to first unread message

alejandro...@gmail.com

unread,
Feb 28, 2014, 4:04:06 PM2/28/14
to methylkit_...@googlegroups.com
Hi, I want to annotate DMR and I'm working with Arabidopsis thaliana. I only found gff3 files and then I need to convert them to 12BED files.
I was looking in previous discussion but I could not find information about the final structure of the 12BED files that methylKit will accept.
To try to figure out this I starting downloading the "refseq.hg18.bed.txt file mentioned in the methylKit viggnette here:

http://code.google.com/p/methylkit/downloads/detail?name=refseq.hg18.bed.txt

the file do not have headers.

I also downloaded a bed file from hg18 from the UCSC genome browser. I found that the structure of the columns are not equivalent. Then I still can not make sense of the column. I'm particulary worry about the last two columns. then numbers does not look to be the beggining and end of exon or introns.

Can someone please describe each of the 12 columns in the refseq.hg18.bed.txt used as example in the methylKit vignnette?
Thank
Alejandro

Kalyan K Pasumarthy

unread,
Mar 1, 2014, 1:40:29 AM3/1/14
to methylkit_...@googlegroups.com
Dear Alejandro,

MethylKit and the UCSC uses the bed format with 12 columns. Detailed explanation is available from here. I will limit my explanation to the last two columns (11th and 12th).
  • 11th column is about the sizes of the exons. The order of the numbers correspond to the order of the exons. Column no. 10 indicates the total number of exons and one will find an equivalent count of sizes in the 11th column.
  • 12th column is about the distance from where each exon starts with relative to the 'chrstart' column(2nd column). Do not confuse it to the absolute genomic co-ordinate. It is relative to the 2nd column. (sometimes it is possible that 12th column is absent. I dont know why is it so. May be in cases like pseudo genes)
  • An eg for the last two co-ordinates is as follows:
chrom                                             :chr1
chromstart                                      : 52380633
chrromend                                      : 52502307   
No. of exons                                   : 5 (column 10)
Size of each exon            :30,106,106,2108,275,(column 11)   
Start of each exon (wrt chromstart): 0,73995,90797,95114,121399, (column 12)

I have two options you may consider for generating annotation objects for arabidopsis. I hope they may work in methylKit (not sure). Both these options utilise the bioconductor packages. So, I assume that you are familiar with them (Easy to work with). There may be some typos in the names of functions and packges listed below.

Option1
GenomicFeatures
package from bioconductor offers a function makeTranscriptDbFromGFF. You may use this to generate the transcript database from the gff3 file you have. Once you have generated you may
  • generate separate objects for introns, exons, cds, genes, intergenic. Generated objects are 'GRanges' objects
  • Use functions from 'GenomicRanges' package to tweak the cds object to generate TSS ranges
  • combine all the needed 'GRanges' object into a 'GenomicRangesList' object. Note that the objects generated above can not be combined into a 'GRangesList'. To my understanding, every annotation functions that works for 'GRangesList' (used in methylKit) will also work for 'GenomicRangesList' objects
  • A worked out example for gencode human annotation is available on this blog post
Option2
Bioconductor has a ready to use transcript database for arabidopsis. You may check its relevance for your work. It is available from here. There are also other versions available for arabidopsis (search with keyword arabidopsis and look for packages that start with TxDb.Athaliana). Installing one of this package will reduce the task of building your own annotation files. You may then
  • use functions like cds,exons,genes,transcripts from 'GenomicFeatures' package to generate the 'GRanges' objects. You may follow the functions in the above blog post to generate whatever you need.
  • sometimes functions like cdsBy, exonsBy, transcriptsBy may also be useful
  • In case you want to limit your annotation to certain genes or features related to certain genes you may try 'select' function from 'AnnotationDbi' package.

Hope you will be able to work out.


Regards,
Kalyan


--
You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
To post to this group, send email to methylkit_...@googlegroups.com.
Visit this group at http://groups.google.com/group/methylkit_discussion.
For more options, visit https://groups.google.com/groups/opt_out.

Johan Zicola

unread,
Jul 11, 2018, 10:21:09 AM7/11/18
to methylkit_discussion
This reply is probably too late for Alejandro but it will save time for other plant scientists:

Here the method for Arabidopsis thaliana. For other plant species, just look around in ftp://ftp.ensemblgenomes.org/pub/release-39/plants
N.B: I assume you work on a Linux or Mac OS X machine.
  1. Download the gtf file Arabidopsis_thaliana.TAIR10.39.gtf.gz at ftp://ftp.ensemblgenomes.org/pub/release-39/plants/gtf/arabidopsis_thaliana/
  2. Check if chromosome names match with the nomenclature you use in your methylKit objects (for instance replace in the gtf file 1 by Chr1, 2 by Chr2, ... in vim by typing :%s/^\([1-9]\)/Chr\1/g  meaning 'add to all digits at beginning of lines the prefix "Chr"')
  3. Download from http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ the two scripts gtfToGenePred and genePredToBed and put them in your ~/bin/ or wherever you keep your programs
  4. Uncompress your gtf file ($ gunzip Arabidopsis_thaliana.TAIR10.39.gtf.gz )
  5. Generate a genePred file ($ ~/bin/gtfToGenePred Arabidopsis_thaliana.TAIR10.39.gtf   Arabidopsis_thaliana.TAIR10.39.genePred )
  6. Generate a 12bed file ($ ~/bin/genePredToBed Arabidopsis_thaliana.TAIR10.39.genePred   Arabidopsis_thaliana.TAIR10.39.bed )
  7. Sort your bed file by chromosome and start position ($ sort -k 1,1 -k2,2n Arabidopsis_thaliana.TAIR10.39.bed -o Arabidopsis_thaliana.TAIR10.39.bed)
  8. Go in R and upload this bed file ( > readTranscriptFeatures("/path/to/Arabidopsis_thaliana.TAIR10.39.bed")
Your gene.obj is now ready to be used as "feature" argument in annotateWithGeneParts or regionCounts.

Best,

Johan

Alan Hudson

unread,
Aug 31, 2018, 4:40:51 AM8/31/18
to methylkit_discussion
Hi,

I have a similar problem and would appreciate some advice

I am trying to use a RefSeq gff3 file to annotate DMRs with gene parts in methylKit. I have followed both suggested solutions in this thread but with limited success

I have tried using gffToGenePred and genePredToBed, this produces a 12-column bed format file, however when I try to use readTranscriptFeatures() on the resulting bed file, I get:

> Reading the table...
> Calculating intron coordinates...
> Error in rep(1:nrow(ref), ref[, 10]) : invalid 'times' argument

I have also tried extracting the relevant gene feature coordinates using commands in the GenomicFeatures package. However, when the individual GRange objects for each individual gene part (TSS, intron, exon etc.) are combined they form a GenomicRangesList and when I use the command annotateWithGeneParts(), I get the following error message:

> Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘annotateWithGeneParts’ for signature ‘"GRanges", "SimpleGenomicRangesList"’

Any advice on how to overcome either of these obstacles?

Many thanks,

Alan

Alan Hudson

unread,
Aug 31, 2018, 10:13:43 PM8/31/18
to methylkit_discussion
I think I have an answer to at least get readTranscriptFeatures() to recognize my BED files.

I found that the chromosome names from the RefSeq gff3 file I am using, had an "_" as in "NC_XXXXX". By merely removing the underscore from the chromosome name and then converting the gff3 using gffToGenePred and genePredToBed, the resulting BED file now appears to work with readTranscriptFeatures()

reli...@gmail.com

unread,
Oct 10, 2018, 8:23:45 AM10/10/18
to methylkit_discussion
Hi Alan, I have a similar problem to the one you encountered with your bed file and also realized that some of my chromosome IDs have underscores while others do not. I would like to delete the underscores and try the readTranscriptFeatures() again to. Please could you help provide me with the command you used to remove the underscores from your chromosome names? I know very little about scripts.

Many thanks

Relindis

Alan Hudson

unread,
Oct 10, 2018, 4:06:56 PM10/10/18
to methylkit_...@googlegroups.com
Hi Relindis,

I didn't use a script. I just went for the very basic method of opening the files in a text editor (BBEdit) and the using the find and replace functions to replace "_" with "". I did this for both the GFF3 annotation file and the DMR files - saving with a different name to avoid confusing the edited files with the original files.

I guess this method could cause errors but it seemed to work fine for me (and was quick).

Best wishes,

Alan

--
You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/WC6ZM8GY0kc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.

To post to this group, send email to methylkit_...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages