pmclust has zero variance

20 views
Skip to first unread message

Audris Mockus

unread,
May 18, 2017, 2:44:47 PM5/18/17
to RBigDataProgramming
Problem: pmclustEnv$Z.colSums[i.k] is sometimes zero leading to NaNs in likelihood
with multiple processes.

Reproduce: 
#generate 12 gaussian clusters
X.spmd = result
barrier();

K <- 12
barrier();
PARAM.org <- set.global(K = K)
PARAM.org <- initial.em(PARAM.org)

The issue can be observed as 
 Error in if (any(u < .pmclustEnv$CONTROL$U.min | u > .pmclustEnv$CONTROL$U.max)) { : 
  missing value where TRUE/FALSE needed

With instrumented code one can see that
.pmclustEnv$Z.colSums[i.k]
is zero in pmclust/R/pm_em_base.r
function:
m.step.spmd <- function(PARAM){
  if(exists("X.spmd", envir = .pmclustEnv)){
    X.spmd <- get("X.spmd", envir = .pmclustEnv)
  }

  ### MLE For ETA
  PARAM$ETA <- .pmclustEnv$Z.colSums / sum(.pmclustEnv$Z.colSums)
  PARAM$log.ETA <- log(PARAM$ETA)
...


Thank you, Audris

Wei-Chen Chen

unread,
May 20, 2017, 2:08:39 PM5/20/17
to rbigdatap...@googlegroups.com
This can happen in several situation, but in general it implicates that the i.k-th component is being degenerated. For example, bad initial values, over specified total number of clusters K, and so on.

There are some ways to reduce the chance, but I am not sure if there is any of them were implemented. For example, skip or stop to update the cov matrix once the largest eigen value is below some thresholds.
Reply all
Reply to author
Forward
0 new messages