PyClone cluster DEFAULT values: Too many clusters

Shadrielle Melijah Espiritu

unread,

Mar 22, 2016, 5:26:03 PM3/22/16

to Pyclone User Group

The config file I'm using is basically the default one in the bitbucket Usage page for PyClone. (10000 iterations of MCMC). After running PyClone analyse and using a burnin of 1000, I run PyClone cluster and it seems to put a lot of the SNVs into its own cluster. Is there a way to parameterize PyClone cluster to output "only 5 clusters" or "only 3 clusters"?

Data: ~5000 SNV mutations
Clusters: 448 total, where ~4500 mutations assigned to one cluster, ~100 to another cluster, and everything else on its own cluster

Any insights would be appreciated. Thank you!

Andrew

unread,

Mar 22, 2016, 5:35:15 PM3/22/16

to Pyclone User Group

Hi Shadrielle,

There is no parameter specify the number of clusters. If you are getting a large number of singleton clusters that is likely indicative of a problem with the analysis. You appear to be using an extremely large number of SNVs, PyClone is typically used with 100's not 1000's of SNVs. That is not necessarily an issue, but you will likely need to run the MCMC analysis for much longer for the chain to converge. I would guess 100,000-1,000,000 iterations would be required based on my experience. You should also be running multiple restarts to check the results match.

Cheers,

Andy

Message has been deleted

Shadrielle Melijah Espiritu

unread,

Mar 22, 2016, 6:32:19 PM3/22/16

to Pyclone User Group

I'll give that a try. Can both PyClone analyse and cluster be run in parallel environment?

Thanks Andy

Andrew

unread,

Mar 22, 2016, 7:28:21 PM3/22/16

to Pyclone User Group

Unfortunately, neither can run in parallel. You can however run multiple restarts in parallel.

Are you looking at whole genome sequencing data? 5000 SNVs seems like a lot to deep sequence. If the data is low depth you may want to use the binomial instead of the beta-binomial model.

Cheers,

Andy

Shadrielle Melijah Espiritu

unread,

Mar 22, 2016, 7:35:17 PM3/22/16

to Pyclone User Group

It's not highly deep sequenced. Tumours are only 50X while normals are 30X for our WGS data. Will the binomial model run faster (with better results)?

Andrew

unread,

Mar 22, 2016, 7:39:37 PM3/22/16

to Pyclone User Group

For the purposes of PyClone analysis, less than 1000x is considered low coverage. For the 50x genomes the binomial may work better, and it will be slightly faster. I have to warn you that using PyClone on WGS data has not really been tested. We typically focus on targeted resequencing data. However, if you have multiple tumours per patient you may get reasonable results. Single tumour sample analysis will probably not work well though.

Cheers,

Andy

Shadrielle Melijah Espiritu

unread,

Mar 23, 2016, 3:28:04 PM3/23/16

to Pyclone User Group

For multiple tumours per patient, I should combine them in the same config file as follows?

samples:
Sample1:
    mutations_file: yaml1
    tumour_content:
      value: 0.9
    error_rate: 0.001
Sample2:
    mutations_file: yaml2
    tumour_content:
      value: 0.0.8
    error_rate: 0.001

Will PyClone still report cellular frequencies for the SNP mutations associated with each tumour sample?

Thanks for the quick replies

Andrew

unread,

Mar 23, 2016, 5:34:01 PM3/23/16

to Pyclone User Group

1) The samples file looks correct. Double check against the example here.

2) I am not sure I understand what you mean by the second point but here are a few things to note.

- PyClone will report the cellular prevalence of SNVs in each samples

- I assume you mean somatic SNVs not germline SNPs. PyClone doesn't handle the latter.

- Each sample should have the same set of SNVs. So even if a mutation is not predicted in a sample, you will need to include the count data from that sample in the file.

- Make sure the `mutation_id` is consistent for SNVs across sample.

- sample_1:chr2:12345, sample_2:chr2:12345 would be the wrong way to name things

- chr2:12345 is the correct way

- you do not need to specify the chromosome and coordinates as above, just make sure the name is consistent across samples

Cheers,

Andy

Shadrielle Melijah Espiritu

unread,

Mar 24, 2016, 11:58:28 AM3/24/16

to Pyclone User Group

That's definitely helpful knowing. Since PyClone can't run in parallel, I expect it would run a very long time given the parameters that you've suggested above. My thoughts would be to break up runs into every 100 SNPs and run PyClone separately on each set of 100 common SNPs. Is that wise to do (this way, a bit more parallel)?

Thanks

Andrew

unread,

Mar 24, 2016, 1:11:27 PM3/24/16

to Pyclone User Group

Hi Shadrielle,

Splitting the dataset is not a great idea, as you will likely lose the ability to share statistical strength between samples. Furthermore, it will be difficult to figure out the correspondence between clusters in different chunks. A few solutions that may work.

1) If you have single sample data, expands may work. It is designed to scale to exome/WGS data.

2) If you have multiple samples, but the copy number is relatively stable i.e. a large number of regions with no copy number change, then sciclone may work.

3) If neither of the above hold you could run something like a Gaussian or Binomial mixture model on the data to cluster the SNVs by variant allele frequency (VAF). Then you can randomly sample SNVs from each cluster as representatives of the cluster and use these for PyClone analysis.

The reason 3) may work is as follows. SNVs with similar VAFs likely have the same mutational genotype and are present in the same clones. Thus SNVs in clusters from the GMM or BMM would effectively be treated the same by PyClone. Hence, we can just look at a subset of them to get similar performance from PyClone.

Cheers,

Andy

Tommy Tang

unread,

Feb 20, 2017, 3:14:57 PM2/20/17

to Pyclone User Group

Hi Andy, I have seen you mention "restart" in several posts. what do you mean by that?

running the same exact command but specify different seed?

Thanks!

Tommy

Andrew

unread,

Feb 21, 2017, 8:55:32 AM2/21/17

to Pyclone User Group

Exactly.

Reply all

Reply to author

Forward