MethylKit PCA Analysis

jinga...@gmail.com

unread,

Sep 8, 2014, 3:58:50 AM9/8/14

to methylkit_...@googlegroups.com

Hi,

I am using methylKit to do a methylation analysis on a MethylCAP-Bisulfite-treated sequencing data (that is, DNA fragments were pulled down with MethylCap and then, underwent bisulfite treatment followed by sequencing). I have a few questions about methylKit:

1) How does methylKit calculate PCA? Is it based on the methylation level at each base pair or does methylKit create clusters and calculates a cumulative methylation

score for a particular region or cluster and then use the methylation score for each region to do a PCA analysis? What is the information that the methylKit input to prcomp() function?

2) How can I download/capture this information (the methylation profile for each tissue sample) that methylKit submits to the prcomp() function?

3) Also, if I have certain regions of interest (for example, promoters or CpG islands) and I have their chromosomal positions in the genome, how can I get the methylation scores for these regions of interest? Using methylKit, is there a way to filter out everything so that I only have the methylation scores for these regions of interest?

4) I was told that I could get the loading matrix for the PCA from pca$rotation (see below), but there are only columns named PCA1, 2, 3, 4.......I am guessing that each number in these columns represent the "loading" (weight) for each variable for a principal component. So, shouldn't there also be a column for the variable, in this case, the variable would be the methylated regions / positions in the genome that contribute to, for example, PCA1? For example, in the PCA1 column below, I do not know what methylated region / position has the loading of "-0.0257456039"?

Loading matrix:

PCA1	PCA2	PCA3	PCA4
-0.0257456039	-0.045478216	-0.00483179	0.0210543866
-0.0442088488	-0.0081143345	-0.0009079279	0.0350200662
-0.02690392	-0.0285829765	-0.0242796867	0.0123196758
-0.0256274555	-0.0559394237	0.0523019875	0.041726609
-0.003918779	0.0343353494	0.054446566	-0.0098142739

Thank you.

Best regards,

Jing Xian

Altuna Akalin

unread,

Sep 8, 2014, 6:28:26 PM9/8/14

to methylkit_...@googlegroups.com

1) How does methylKit calculate PCA? Is it based on the methylation level at each base pair or does methylKit create clusters and calculates a cumulative methylation
score for a particular region or cluster and then use the methylation score for each region to do a PCA analysis? What is the information that the methylKit input to prcomp() function?

with transpose=TRUE transpose of percent methylation matrix goes into prcomp()

with transpose=FALSE percent methylation matrix as it is goes into prcomp()

2) How can I download/capture this information (the methylation profile for each tissue sample) that methylKit submits to the prcomp() function?

http://www.rdocumentation.org/packages/methylKit/functions/percMethylation-methods

3) Also, if I have certain regions of interest (for example, promoters or CpG islands) and I have their chromosomal positions in the genome, how can I get the methylation scores for these regions of interest? Using methylKit, is there a way to filter out everything so that I only have the methylation scores for these regions of interest?

you can use [] notation to subset methylkit objects. within [] there should be a logical vector could be obtained from GenomicRanges operations.

example:

# mbase is a methylBase object

# CpGi is a GRanges object containing CpG islands

mbase[ as(mbase,"GRanges") %over% CpGi, ]

search the forum, see the vignette, there are many examples.

if transpose=TRUE, variables would be regions/bases yes. They don't have a variable name. I suggest you take the percent methylation matrix and do your own PCA using prcomp() you will have better control. Also PCASamples removes rows of percent methylation matrix that has low variation.

Best,

Altuna

Thank you.

Best regards,
Jing Xian

--
You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
To post to this group, send email to methylkit_...@googlegroups.com.
Visit this group at http://groups.google.com/group/methylkit_discussion.
For more options, visit https://groups.google.com/d/optout.

chongall...@gmail.com

unread,

Sep 10, 2014, 7:35:39 AM9/10/14

to methylkit_...@googlegroups.com

Dear Altuna,

I was planning to ask the same thing but found your reply in this post. However, I am still a little confused.

1) When I did:

mbase[ as(mbase,"GRanges") %over% CpGi, ]

I got the following result:

"chr1" 845810 845810 "+" 12 12 0 11 11 0

"chr1" 845816 845816 "+" 12 12 0 11 11 0

"chr1" 854876 854876 "+" 17 17 0 22 5 17

"chr1" 854951 854951 "+" 14 3 11 10 10 0

"chr1" 854957 854957 "+" 14 12 2 15 13 2

"chr1" 854960 854960 "+" 14 14 0 15 15 0

"chr1" 854966 854966 "+" 11 11 0 15 8 7

"chr1" 854972 854972 "+" 11 11 0 15 3 12

From what I understand the columns denote the following:

1. chr

2. start position

3. end position

4. strands

5. coverage1

6. numCs1

7. numTs1

8. coverage2

9. numCs2

10. numTs2

Therefore, my questions are:

a) What is coverage1? numCs1? and numTs1? I know that they represent values for my first sample.

Is "coverage" the number of times that base is covered by the sequence reads?

Does "numCs" represent the number of methylated cytosines (since these have not been converted by the bisulfite treatment) identified by the reads? So, for

"chr1" 854966 854966 "+" 11 11 0 15 8 7

In the first sample, there were 11 reads that covered this base and in all 11 reads, the base identified at this position by all 11 reads is cytosine.

Does "numTs" represent the number of thymine? As in the above example,

"chr1" 854966 854966 "+" 11 11 0 15 8 7

does this mean that of the 15 reads that covered this base at this position (854966), 7 were converted to thymine by the bisulfite treatment?

b) When methylKit does a hierarchical clustering of the samples based on the methylation profile of the samples, what does it use for the to as the methylation level value? Does it use the "coverage" value or the "numCs" value or the "numTs" value or perhaps a ratio of numCs" to "numTs"? Is the hierarchical clustering based on the methylation level at EACH base pair?

2) Is it possible to produce a "heatmap" for the hierarchical clustering?

3) Based on the annotation of the UCSC bed file, I noticed that

"chr1" 845810 845810 "+" 12 12 0 11 11 0

"chr1" 845816 845816 "+" 12 12 0 11 11 0

belong to a single annotated CpG island. Therefore, I was wondering how can one calculate the cumulative methylation value for this CpG island?

4) Finally, when I do the following:

> getMethylationStats(myobj[[2]],plot=FALSE,both.strands=FALSE)

methylation statistics per base

summary:

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.000 0.000 5.263 36.020 86.050 100.000

percentiles:

0% 10% 20% 30% 40%

0.000000 0.000000 0.000000 0.000000 0.000000

50% 60% 70% 80% 90%

5.263158 40.000000 76.470588 91.666667 97.468354

95% 99% 99.5% 99.9% 100%

100.000000 100.000000 100.000000 100.000000 100.000000

At 50%, you can see the value is "5.263158". What does the value "5.263158" represent? Is this the percentage of bases that are 50% methylated? What does it mean? How do I interpret this table?

I'm sorry if my questions seem a little naive but I am new to this and could really do with some help.

Thanks in advance.

Best,

Allen

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discussion+unsub...@googlegroups.com.

Altuna Akalin

unread,

Sep 10, 2014, 1:35:47 PM9/10/14

to methylkit_...@googlegroups.com

Therefore, my questions are:

a) What is coverage1? numCs1? and numTs1? I know that they represent values for my first sample.

You can infer this from vignette and other sources such as presentations. If you are interested in % methylation values, use the appropriate function, it is less confusing.

Is "coverage" the number of times that base is covered by the sequence reads?

yes

Does "numCs" represent the number of methylated cytosines (since these have not been converted by the bisulfite treatment) identified by the reads? So, for
"chr1" 854966 854966 "+" 11 11 0 15 8 7
In the first sample, there were 11 reads that covered this base and in all 11 reads, the base identified at this position by all 11 reads is cytosine.

yes

Does "numTs" represent the number of thymine? As in the above example,
"chr1" 854966 854966 "+" 11 11 0 15 8 7

Yes

does this mean that of the 15 reads that covered this base at this position (854966), 7 were converted to thymine by the bisulfite treatment?

Yes

b) When methylKit does a hierarchical clustering of the samples based on the methylation profile of the samples, what does it use for the to as the methylation level value? Does it use the "coverage" value or the "numCs" value or the "numTs" value or perhaps a ratio of numCs" to "numTs"? Is the hierarchical clustering based on the methylation level at EACH base pair?

Clustering is done using % methylation values, it uses the ratio of Cs to Cs+Ts as far as I remember

2) Is it possible to produce a "heatmap" for the hierarchical clustering?

it wouldn't be wise to produce a heatmap for millions of methylation values. You can try it on the matrix you get from percMethylation(), i don't recommend it.

3) Based on the annotation of the UCSC bed file, I noticed that

"chr1" 845810 845810 "+" 12 12 0 11 11 0
"chr1" 845816 845816 "+" 12 12 0 11 11 0

belong to a single annotated CpG island. Therefore, I was wondering how can one calculate the cumulative methylation value for this CpG island?

Cumulative methylation values are calculated by summing up numCs and numTs for bases that overlap with that region. See examples of such procedure at Meissner papers . You can think of that as weighted mean for the regional methylation.

4) Finally, when I do the following:

> getMethylationStats(myobj[[2]],plot=FALSE,both.strands=FALSE)
methylation statistics per base
summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.000 5.263 36.020 86.050 100.000
percentiles:
0% 10% 20% 30% 40%
0.000000 0.000000 0.000000 0.000000 0.000000
50% 60% 70% 80% 90%
5.263158 40.000000 76.470588 91.666667 97.468354
95% 99% 99.5% 99.9% 100%
100.000000 100.000000 100.000000 100.000000 100.000000

At 50%, you can see the value is "5.263158". What does the value "5.263158" represent? Is this the percentage of bases that are 50% methylated? What does it mean? How do I interpret this table?

I suggest you read up on percentiles and quantiles for a given set of values, in this case those set of values are % methylation values per base. 50th percentile is equivalent to median. Maybe percentage signs in the output are confusing you.

http://en.wikipedia.org/wiki/Percentile

Best,

Altuna

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.

Allen Chong

unread,

Sep 11, 2014, 6:47:40 AM9/11/14

to methylkit_...@googlegroups.com

Dear Altuna,

Thanks for your reply.

1) I was looking through the methylation profile file for one of my samples and I noticed that, in general, only bases that have been methylated are recorded for "coverage", "numCs", and "numTs". That is to say, that if a base was not methylated then there is generally no information about it. However, I found that on some occasions, it was reported. That is to say, I found, for example,

"chr1" 28739459 28739459 "-" 27 0 27

So, this is an unmethylated base that was reported. Why? Why does MethylKit sometimes report these unmethylated bases and sometimes throws them out. Please note that this is the methylation profile for only one sample.

2) I also noticed in https://code.google.com/p/methylkit/ that it says:

"# merge all samples to one table by using base-pair locations that are covered in all samples

# setting destrand=TRUE, will merge reads on both strans of a CpG dinucleotide. This provides better

# coverage, but only advised when looking at CpG methylation"

meth=unite(myobj,destrand=TRUE)

It says that it will use "base-pair locations that are covered in ALL samples". Therefore, am I right to understand that if I have 10 samples and for a particular location/base, 8 samples have values for "coverage", "numCs", and "numTs" but the remaining 2 samples do not have any information for this particular base, then MethylKit will not include the information for this base in the meth=unite(myobj,destrand=TRUE) function?

If this is true, isn't this criteria a little too stringent? Because of 2 samples, all information for this base would be thrown out? Can this be avoided?

3) it also says that this is "only advised when looking at CpG methylation". I am a little confused. What else can I be looking at with MethylKit if I am not looking at CpG methylation? When would I NOT use "destrand=TRUE"?

I look forward to your reply and once again, thank you for your time and patience.

Best,

Allen

--
You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/q6qUN-pyvwQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.

Altuna Akalin

unread,

Sep 11, 2014, 7:06:06 AM9/11/14

to methylkit_...@googlegroups.com

1) I was looking through the methylation profile file for one of my samples and I noticed that, in general, only bases that have been methylated are recorded for "coverage", "numCs", and "numTs". That is to say, that if a base was not methylated then there is generally no information about it. However, I found that on some occasions, it was reported. That is to say, I found, for example,

"chr1" 28739459 28739459 "-" 27 0 27

So, this is an unmethylated base that was reported. Why? Why does MethylKit sometimes report these unmethylated bases and sometimes throws them out. Please note that this is the methylation profile for only one sample.

How do you know most of the unmethylated bases are removed? There is no description how you achieved that control. There is coverage and quality thresholds if you are using read.bismark() function that's the only way a base would be discarded. see the doc for that function.

2) I also noticed in https://code.google.com/p/methylkit/ that it says:
"# merge all samples to one table by using base-pair locations that are covered in all samples
# setting destrand=TRUE, will merge reads on both strans of a CpG dinucleotide. This provides better 
# coverage, but only advised when looking at CpG methylation"
meth=unite(myobj,destrand=TRUE)
It says that it will use "base-pair locations that are covered in ALL samples". Therefore, am I right to understand that if I have 10 samples and for a particular location/base, 8 samples have values for "coverage", "numCs", and "numTs" but the remaining 2 samples do not have any information for this particular base, then MethylKit will not include the information for this base in the meth=unite(myobj,destrand=TRUE) function?

If this is true, isn't this criteria a little too stringent? Because of 2 samples, all information for this base would be thrown out? Can this be avoided?

Please read the vignette, See page 7

3) it also says that this is "only advised when looking at CpG methylation". I am a little confused. What else can I be looking at with MethylKit if I am not looking at CpG methylation? When would I NOT use "destrand=TRUE"?

I look forward to your reply and once again, thank you for your time and patience.

Why do you think methylation is specific to only CpG dinucleotides? Where did you read that?

"DNA methylation typically occurs in a CpG dinucleotide context; non-CpG methylation is prevalent in embryonic stem cells,^[5]^[6]^[7] and has also been indicated in neural development."

http://en.wikipedia.org/wiki/DNA_methylation

by using destrand=TRUE, you assume CpG methylation is mostly symmetric and you might gain some information on CpG dinucleotides that are covered in opposite strands in two different samples. In that case methylation levels will be on adjacent positions but opposite strands. and when using unite() with destrand=FALSE you will lose that CpG.

I think you should read up about methylation and some basics before you start using any software relating to methylation.

Best,

Altuna

chongall...@gmail.com

unread,

Sep 14, 2014, 1:41:08 PM9/14/14

to methylkit_...@googlegroups.com

Dear Altuna,

Thanks for your reply. I replied to you a few days ago but it seems like my post somehow never got through. Anyway, here goes again.

### How do you know most of the unmethylated bases are removed? There is no description how you achieved that control. There is coverage and quality thresholds if you are using read.bismark() function that's the only way a base would be discarded. see the doc for that function.###

I was looking at a specific gene promoter and I noticed that for a particular sample, there were no methylation data ("coverage", "numCs", "numTs") for that region. At first, I thought it may be because there were no reads for that region for this particular sample. However, when I loaded the SAM file for this sample into IGV, I noticed that there were indeed reads in this region. But I also noticed that for all the reads in this region for this sample, they were coincidentally all unmethylated (for bases where C is in the reference genome, the reads for that position were all Ts). So, I guessed that they made not have been reported because they were unmethylated. However, when I looked at methylation data for the same promoter region in another sample, I found that there were instances when the unmethylated regions were reported.

You said: "There is no description how you achieved that control."

This is only an observation. I am not sure what kind of control you are expecting.

You said: " There is coverage and quality thresholds if you are using read.bismark() function that's the only way a base would be discarded."

I used the default parameters for the read.bismark() function so I don't know if that would discard any bases. I have decided to redo this and set mincov to 10 and minqual to 0 to see if this changes anything. However, the reason I asked this question was because I thought you may know the reason for this or you may have made the same observation when you were testing your software.

###Why do you think methylation is specific to only CpG dinucleotides? Where did you read that?

"DNA methylation typically occurs in a CpG dinucleotide context; non-CpG methylation is prevalent in embryonic stem cells,[5][6][7] and has also been indicated in neural development."

http://en.wikipedia.org/wiki/DNA_methylation###

I think you've mis-interpreted the intent of my question. When I asked what else I could be looking at with MethylKit, I was actually thinking that perhaps your software might be able to estimate CNV. I have read that some other methylation analysis software are able to do this (see CopyNumber450K package). I could not find any mention of this in the MethylKit user guide so I decided to ask to be sure.

Best,

Allen

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discussion+unsubscrib...@googlegroups.com.

To post to this group, send email to methylkit_...@googlegroups.com.
Visit this group at http://groups.google.com/group/methylkit_discussion.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discussion+unsub...@googlegroups.com.
To post to this group, send email to methylkit_...@googlegroups.com.
Visit this group at http://groups.google.com/group/methylkit_discussion.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/q6qUN-pyvwQ/unsubscribe.

To unsubscribe from this group and all its topics, send an email to methylkit_discussion+unsub...@googlegroups.com.

To post to this group, send email to methylkit_...@googlegroups.com.
Visit this group at http://groups.google.com/group/methylkit_discussion.
For more options, visit https://groups.google.com/d/optout.

Altuna Akalin

unread,

Sep 17, 2014, 4:57:14 AM9/17/14

to methylkit_...@googlegroups.com, chongall...@gmail.com

I was looking at a specific gene promoter and I noticed that for a particular sample, there were no methylation data ("coverage", "numCs", "numTs") for that region. At first, I thought it may be because there were no reads for that region for this particular sample. However, when I loaded the SAM file for this sample into IGV, I noticed that there were indeed reads in this region. But I also noticed that for all the reads in this region for this sample, they were coincidentally all unmethylated (for bases where C is in the reference genome, the reads for that position were all Ts). So, I guessed that they made not have been reported because they were unmethylated. However, when I looked at methylation data for the same promoter region in another sample, I found that there were instances when the unmethylated regions were reported.

Yes, only thing I could think of would be coverage and quality. I can't think of anything else that would discard the reads unless there is some format issue with your sam file. For example, sorting via Picard sometimes does weird things. I would use bismark's bedgraph function see if that also gets no methylation there. If you wan't more help you need to send that portion of the sam file. If I can't replicate the problem, there is not way to fix it or even verify if there is a problem.

You said: "There is no description how you achieved that control."
This is only an observation. I am not sure what kind of control you are expecting.

Great! It is good enough, I'm just on auto-reply sometimes, similar questions asked over and over again and most of the time without explanation of how observations are done, or not even a single clue on how to replicate their problem.

You said: " There is coverage and quality thresholds if you are using read.bismark() function that's the only way a base would be discarded."
I used the default parameters for the read.bismark() function so I don't know if that would discard any bases. I have decided to redo this and set mincov to 10 and minqual to 0 to see if this changes anything. However, the reason I asked this question was because I thought you may know the reason for this or you may have made the same observation when you were testing your software.

I don't have similar observation, see the suggestions above

###Why do you think methylation is specific to only CpG dinucleotides? Where did you read that?

"DNA methylation typically occurs in a CpG dinucleotide context; non-CpG methylation is prevalent in embryonic stem cells,[5][6][7] and has also been indicated in neural development."
http://en.wikipedia.org/wiki/DNA_methylation###

I think you've mis-interpreted the intent of my question. When I asked what else I could be looking at with MethylKit, I was actually thinking that perhaps your software might be able to estimate CNV. I have read that some other methylation analysis software are able to do this (see CopyNumber450K package). I could not find any mention of this in the MethylKit user guide so I decided to ask to be sure.

Nope we don't, I think such a nice feature would be worth putting on the vignette if we had. To be honest, your question was open to misunderstanding :) many people asked about different methylation contexts in the past (again auto-reply habit).

Best,

Altuna

...

Reply all

Reply to author

Forward