Question about methylKit

L Rutter

unread,

Oct 26, 2013, 8:16:53 PM10/26/13

to methylkit_...@googlegroups.com

Hello:

I am using methylKit to compare wasp (P.dominula) methylation data, with queens as the variable and workers as the control. I am following the online methylKit vignette. I have a few questions, and would be very grateful to hear any advice!

I input my data as such:

library(methylKit)

file.list=list("queenDataHasC.txt", "workerDataHasC.txt")

myobj=read(file.list, sample.id=list("queen", "worker"), assembly="refpdom.fasta", treatment=c(0,1), context="CpG")

I am able to output tables and histograms for the following commands:

getMethylationStats (it is not bimodal, though, it is right-skewed)

getCoverageStats (it is unimodal, and not too skewed either direction)

Then, I continue through the vignette:

library("graphics")

meth=unite(myobj, destrand=FALSE)

myDiff10p=get.methylDiff(myDiff,difference=10,qvalue=0.01) (I obtain five DMRs)

However, when I try to annotate the DMR:

gene.obj=read.transcript.features(system.file("extdata", "refpdom.fasta.bed.txt", package = "methylKit"))

I get the error:

Error in if (grepl("^track", f.line)) (skip = 1) :

argument is of length zero

In addition: Warning message:

In file(con, "r") :

file("") only supports open = "w+" and open = "w+b": using the former

Which leads me to three questions:

1) What is the "extdata" used in the Vignette? I did not use it to create "file.list" (even though specified in the Vignette, because I would get an error), but I did use it to create "gene.obj".

2) How is the "refseq.hg18.bed.txt" different than "hg18" in the vignette? What is the .bed.txt file? I have the reference genome "refpdom.fasta", which I used to create "myobj", but I do not know whether or how to create the .bed.txt file (which is why I am not surprised about this error).

3) Is it problematic that my control and variable information is of different length? For instance, my two input files are "queenDataHasC.txt" and "workerDataHasC.txt". One file is 2,521,760 lines long, the other is 2,674,819 lines long. (They were originally 58,828,763 and 57,901,315 lines long. However, for each file, I removed lines with freqC = 0).

By the way, when I do a "head" on "queenHasC.txt" and "workerHasC.txt", they look like:

chrBase chr base strand coverage freqC freqT

PdomScaf0001.221 PdomScaf0001 221 F 10 20 80

PdomScaf0001.288 PdomScaf0001 288 F 24 4.16667 95.8333

PdomScaf0001.290 PdomScaf0001 290 F 25 4 96

PdomScaf0001.298 PdomScaf0001 298 F 25 4 96

PdomScaf0001.300 PdomScaf0001 300 F 25 4 96

PdomScaf0001.303 PdomScaf0001 303 F 25 4 96

PdomScaf0001.322 PdomScaf0001 322 F 24 4.16667 95.8333

PdomScaf0001.328 PdomScaf0001 328 F 22 4.54545 95.4545

PdomScaf0001.331 PdomScaf0001 331 F 21 4.7619 95.238

chrBase chr base strand coverage freqC freqT

PdomScaf0001.43 PdomScaf0001 43 F 7 14.2857 85.7143

PdomScaf0001.51 PdomScaf0001 51 R 8 12.5 87.5

PdomScaf0001.171 PdomScaf0001 171 R 13 7.69231 92.3077

PdomScaf0001.279 PdomScaf0001 279 R 31 3.22581 96.7742

PdomScaf0001.310 PdomScaf0001 310 R 31 3.22581 96.7742

PdomScaf0001.481 PdomScaf0001 481 R 26 3.84615 96.1538

PdomScaf0001.692 PdomScaf0001 692 F 27 3.7037 96.2963

PdomScaf0001.711 PdomScaf0001 711 F 31 3.22581 96.7742

PdomScaf0001.849 PdomScaf0001 849 R 35 5.71429 94.2857

And a "head" on refpdom.fasta looks like:

head refpdom.fasta

>PdomScaf0001:1954..1009999

gccatttgttttcttaatttgtgataactaggatacatttcttctggtattcttttgata

gaacataactgcatatgatttcttcccaatgtaagtataataagaacagttcatttaaaa

atctttcatagttgcagtattgttgtattcttcttatcaatcttgataatagtgagtaga

cacttctagccaaattatttacctcacctgtggaatgaatccatttatcaatcacccata

gtgaatcgatgggtacagcttttgtttttgataaagatcgcagtgaaatcatgcataaag

gaactttgtttttcttttaatcaacggaatatcgtttcctcctaatctccgagtggagtt

tcctttcactaagatcgatagcgtattcaaacaggaaataaaataaagaagaaactacat

attaattttattgtctacgtattcatttttatttacatttaattcgattttattttatat

gattttggataatattctagataacttatcgttttgttctattttattttacagatataa

Thank you for any advice, and thank you for a handy software!

L Rutter

Altuna Akalin

unread,

Nov 8, 2013, 9:49:10 AM11/8/13

to

Which leads me to three questions:
1) What is the "extdata" used in the Vignette? I did not use it to create "file.list" (even though specified in the Vignette, because I would get an error), but I did use it to create "gene.obj".

extdata directory is in the installation folder and contains example data. You should not need that for your own data. Just location of your files. Check the examples and and type "file.list" in R console, you will see it just contains paths to files.

2) How is the "refseq.hg18.bed.txt" different than "hg18" in the vignette? What is the .bed.txt file? I have the reference genome "refpdom.fasta", which I used to create "myobj", but I do not know whether or how to create the .bed.txt file (which is why I am not surprised about this error).

BED file is a tabular text file contains the locations of genomic features (chr,start,end,strand, etc.). You seem to just have a fasta file which contains the sequences of transcripts possibly.

See this post and click the links there to know more about BED format and how to download it. I don't know if you can download it for your species of interest.

http://zvfak.blogspot.ch/2012/09/downloading-annotation-files-for.html

3) Is it problematic that my control and variable information is of different length? For instance, my two input files are "queenDataHasC.txt" and "workerDataHasC.txt". One file is 2,521,760 lines long, the other is 2,674,819 lines long. (They were originally 58,828,763 and 57,901,315 lines long. However, for each file, I removed lines with freqC = 0).

It is not a problem if you are aware of the fact that you can not compare regions/bases that are not covered in one of the samples (if you have one control and one test sample). If you have removed the lines with freqC=0, then it also explains why you don't have bimodal distribution. You removed everything that has zero methylation (I guess that was intentional?).

On Sunday, October 27, 2013 2:16:53 AM UTC+2, L Rutter wrote:

Hello:

I am using methylKit to compare wasp (P.dominula) methylation data, with queens as the variable and workers as the control. I am following the online methylKit vignette. I have a few questions, and would be very grateful to hear any advice!

I input my data as such:

library(methylKit)
file.list=list("queenDataHasC.txt", "workerDataHasC.txt")
myobj=read(file.list, sample.id=list("queen", "worker"), assembly="refpdom.fasta", treatment=c(0,1), context="CpG")

I am able to output tables and histograms for the following commands:

getMethylationStats (it is not bimodal, though, it is right-skewed)
getCoverageStats (it is unimodal, and not too skewed either direction)

Then, I continue through the vignette:

library("graphics")
meth=unite(myobj, destrand=FALSE)
myDiff10p=get.methylDiff(myDiff,difference=10,qvalue=0.01) (I obtain five DMRs)

However, when I try to annotate the DMR:

gene.obj=read.transcript.features(system.file("extdata", "refpdom.fasta.bed.txt", package = "methylKit"))

I get the error:

Error in if (grepl("^track", f.line)) (skip = 1) :
argument is of length zero
In addition: Warning message:
In file(con, "r") :

file("") only supports open = "w+" and open = "w+b": using the former

lrutt...@gmail.com

unread,

Nov 13, 2013, 11:23:34 AM11/13/13

to methylkit_...@googlegroups.com, lrutt...@gmail.com

Hello:

Altuna and I discussed 6-bed format versus 12-bed format, and I am posting the solution I used to get the 12-bed format directly from .gff3 file.

I used the Galaxy link (https://galaxy.cbio.mskcc.org/tool_runner?tool_id=fml_gff2bed). This converts .gff3 to 12-bed format. You must first upload your .gff3 dataset to the history (https://usegalaxy.org/). Then when you go to the link (https://galaxy.cbio.mskcc.org/tool_runner?tool_id=fml_gff2bed), the .gff3 file should be available, and just click "Execute" to convert to 12-column .bed file.

You can expect the number of lines in the files to reduce from .gff3 to 12-bed format.

An explanation from Jen of Galaxy is as follows:

"A reduction in the number of lines is completely expected. You should have the same number of lines in the BED12 file as you have unique "ID"s in the GFF3 file, if you want to double check. This is the same as the number of unique transcripts in any of the files, if summed up correctly.

Why is this? A GFF3 file has one line per feature. Part of an exon, etc. with further (this can vary) annotation that tells you if it is 5' UTR, or CDS, etc. When converting GFF3 -> BED12, all the features for a particular transcript are rolled up into a single line of output. Transcripts will have a unique "ID" by GFF3 specification - and all features that belong to that transcript will have it in the GFF3 file's attributes field (it is usually the first attribute listed).

When you converted just GFF3 -> BED6, each feature was individually converted into a distinct line of BED formatted output, there was no summing up to group the information for the transcript as a whole. "

Thank you!

L. Rutter

shuklas...@gmail.com

unread,

Apr 23, 2018, 8:10:00 AM4/23/18

to methylkit_discussion

How to convert CSV file into bed file??

Reply all

Reply to author

Forward