Question about methylKit

1,035 views
Skip to first unread message

L Rutter

unread,
Oct 26, 2013, 8:16:53 PM10/26/13
to methylkit_...@googlegroups.com
Hello:

I am using methylKit to compare wasp (P.dominula) methylation data, with queens as the variable and workers as the control. I am following the online methylKit vignette. I have a few questions, and would be very grateful to hear any advice!

I input my data as such:

library(methylKit)
file.list=list("queenDataHasC.txt", "workerDataHasC.txt")
myobj=read(file.list, sample.id=list("queen", "worker"), assembly="refpdom.fasta", treatment=c(0,1), context="CpG")

I am able to output tables and histograms for the following commands:

getMethylationStats (it is not bimodal, though, it is right-skewed)
getCoverageStats (it is unimodal, and not too skewed either direction)

Then, I continue through the vignette:

library("graphics")
meth=unite(myobj, destrand=FALSE)
myDiff10p=get.methylDiff(myDiff,difference=10,qvalue=0.01) (I obtain five DMRs)

However, when I try to annotate the DMR:

gene.obj=read.transcript.features(system.file("extdata", "refpdom.fasta.bed.txt", package = "methylKit"))

I get the error:

Error in if (grepl("^track", f.line)) (skip = 1) : 
  argument is of length zero
In addition: Warning message:
In file(con, "r") :
  file("") only supports open = "w+" and open = "w+b": using the former

Which leads me to three questions:
1) What is the "extdata" used in the Vignette? I did not use it to create "file.list" (even though specified in the Vignette, because I would get an error), but I did use it to create "gene.obj".

2) How is the "refseq.hg18.bed.txt" different than "hg18" in the vignette? What is the .bed.txt file? I have the reference genome "refpdom.fasta", which I used to create "myobj", but I do not know whether or how to create the .bed.txt file (which is why I am not surprised about this error).

3) Is it problematic that my control and variable information is of different length? For instance, my two input files are "queenDataHasC.txt" and "workerDataHasC.txt". One file is 2,521,760 lines long, the other is 2,674,819 lines long. (They were originally 58,828,763 and 57,901,315 lines long. However, for each file, I removed lines with freqC = 0).

By the way, when I do a "head" on "queenHasC.txt" and "workerHasC.txt", they look like:

chrBase chr     base    strand  coverage        freqC   freqT
PdomScaf0001.221        PdomScaf0001    221     F       10      20      80
PdomScaf0001.288        PdomScaf0001    288     F       24      4.16667 95.8333
PdomScaf0001.290        PdomScaf0001    290     F       25      4       96
PdomScaf0001.298        PdomScaf0001    298     F       25      4       96
PdomScaf0001.300        PdomScaf0001    300     F       25      4       96
PdomScaf0001.303        PdomScaf0001    303     F       25      4       96
PdomScaf0001.322        PdomScaf0001    322     F       24      4.16667 95.8333
PdomScaf0001.328        PdomScaf0001    328     F       22      4.54545 95.4545
PdomScaf0001.331        PdomScaf0001    331     F       21      4.7619  95.238

chrBase chr     base    strand  coverage        freqC   freqT
PdomScaf0001.43 PdomScaf0001    43      F       7       14.2857 85.7143
PdomScaf0001.51 PdomScaf0001    51      R       8       12.5    87.5
PdomScaf0001.171        PdomScaf0001    171     R       13      7.69231 92.3077
PdomScaf0001.279        PdomScaf0001    279     R       31      3.22581 96.7742
PdomScaf0001.310        PdomScaf0001    310     R       31      3.22581 96.7742
PdomScaf0001.481        PdomScaf0001    481     R       26      3.84615 96.1538
PdomScaf0001.692        PdomScaf0001    692     F       27      3.7037  96.2963
PdomScaf0001.711        PdomScaf0001    711     F       31      3.22581 96.7742
PdomScaf0001.849        PdomScaf0001    849     R       35      5.71429 94.2857

And a "head" on refpdom.fasta looks like:

head refpdom.fasta 
>PdomScaf0001:1954..1009999
gccatttgttttcttaatttgtgataactaggatacatttcttctggtattcttttgata
gaacataactgcatatgatttcttcccaatgtaagtataataagaacagttcatttaaaa
atctttcatagttgcagtattgttgtattcttcttatcaatcttgataatagtgagtaga
cacttctagccaaattatttacctcacctgtggaatgaatccatttatcaatcacccata
gtgaatcgatgggtacagcttttgtttttgataaagatcgcagtgaaatcatgcataaag
gaactttgtttttcttttaatcaacggaatatcgtttcctcctaatctccgagtggagtt
tcctttcactaagatcgatagcgtattcaaacaggaaataaaataaagaagaaactacat
attaattttattgtctacgtattcatttttatttacatttaattcgattttattttatat
gattttggataatattctagataacttatcgttttgttctattttattttacagatataa

Thank you for any advice, and thank you for a handy software!
L Rutter

Altuna Akalin

unread,
Nov 8, 2013, 9:49:10 AM11/8/13
to
Which leads me to three questions:
1) What is the "extdata" used in the Vignette? I did not use it to create "file.list" (even though specified in the Vignette, because I would get an error), but I did use it to create "gene.obj".

   extdata directory is in the  installation folder and contains example data. You should not need that for your own data.  Just location of your files. Check the examples and and type "file.list" in R console, you will see it just contains paths to files.

 
2) How is the "refseq.hg18.bed.txt" different than "hg18" in the vignette? What is the .bed.txt file? I have the reference genome "refpdom.fasta", which I used to create "myobj", but I do not know whether or how to create the .bed.txt file (which is why I am not surprised about this error).

BED file is a tabular text file contains the locations of genomic features (chr,start,end,strand, etc.). You seem to just have a fasta file which contains the sequences of transcripts possibly. 
 See this post and click the links there to know more about BED format and how to download it. I don't know if you can download it for your species of interest.
 
3) Is it problematic that my control and variable information is of different length? For instance, my two input files are "queenDataHasC.txt" and "workerDataHasC.txt". One file is 2,521,760 lines long, the other is 2,674,819 lines long. (They were originally 58,828,763 and 57,901,315 lines long. However, for each file, I removed lines with freqC = 0).

It is not a problem if you are aware of the fact that you can not compare regions/bases that are not covered in one of the samples (if you have one control and one test sample). If you have removed the lines with freqC=0, then it also explains why you don't have bimodal distribution. You removed everything that has zero methylation (I guess that was intentional?). 


On Sunday, October 27, 2013 2:16:53 AM UTC+2, L Rutter wrote:
Hello:

I am using methylKit to compare wasp (P.dominula) methylation data, with queens as the variable and workers as the control. I am following the online methylKit vignette. I have a few questions, and would be very grateful to hear any advice!

I input my data as such:

library(methylKit)
file.list=list("queenDataHasC.txt", "workerDataHasC.txt")
myobj=read(file.list, sample.id=list("queen", "worker"), assembly="refpdom.fasta", treatment=c(0,1), context="CpG")

I am able to output tables and histograms for the following commands:

getMethylationStats (it is not bimodal, though, it is right-skewed)
getCoverageStats (it is unimodal, and not too skewed either direction)

Then, I continue through the vignette:

library("graphics")
meth=unite(myobj, destrand=FALSE)
myDiff10p=get.methylDiff(myDiff,difference=10,qvalue=0.01) (I obtain five DMRs)

However, when I try to annotate the DMR:

gene.obj=read.transcript.features(system.file("extdata", "refpdom.fasta.bed.txt", package = "methylKit"))

I get the error:

Error in if (grepl("^track", f.line)) (skip = 1) : 
  argument is of length zero
In addition: Warning message:
In file(con, "r") :
  file("") only supports open = "w+" and open = "w+b": using the former


lrutt...@gmail.com

unread,
Nov 13, 2013, 11:23:34 AM11/13/13
to methylkit_...@googlegroups.com, lrutt...@gmail.com
Hello:

Altuna and I discussed 6-bed format versus 12-bed format, and I am posting the solution I used to get the 12-bed format directly from .gff3 file.

I used the Galaxy link (https://galaxy.cbio.mskcc.org/tool_runner?tool_id=fml_gff2bed). This converts .gff3 to 12-bed format. You must first upload your .gff3 dataset to the history (https://usegalaxy.org/). Then when you go to the link (https://galaxy.cbio.mskcc.org/tool_runner?tool_id=fml_gff2bed), the .gff3 file should be available, and just click "Execute" to convert to 12-column .bed file.

You can expect the number of lines in the files to reduce from .gff3 to 12-bed format.

An explanation from Jen of Galaxy is as follows:

"A reduction in the number of lines is completely expected. You should have the same number of lines in the BED12 file as you have unique "ID"s in the GFF3 file, if you want to double check. This is the same as the number of unique transcripts in any of the files, if summed up correctly.

Why is this? A GFF3 file has one line per feature. Part of an exon, etc. with further (this can vary) annotation that tells you if it is 5' UTR, or CDS, etc. When converting GFF3 -> BED12, all the features for a particular transcript are rolled up into a single line of output. Transcripts will have a unique "ID" by GFF3 specification - and all features that belong to that transcript will have it in the GFF3 file's attributes field (it is usually the first attribute listed).

When you converted just GFF3 -> BED6, each feature was individually converted into a distinct line of BED formatted output, there was no summing up to group the information for the transcript as a whole. "

Thank you!

L. Rutter

shuklas...@gmail.com

unread,
Apr 23, 2018, 8:10:00 AM4/23/18
to methylkit_discussion
How to convert CSV file into bed file??
Reply all
Reply to author
Forward
0 new messages