categorical distribution in mixture model issue

1,364 views
Skip to first unread message

retzerjj

unread,
May 7, 2014, 2:30:41 PM5/7/14
to stan-...@googlegroups.com
Hello all,
We were trying to run a finite mixture model which puts Dirichlet priors on both rows and columns using the data set from the R "blockcluster" package. The translation / compiliation to C++ appears to be working however an error occurs when the "categorical" command is executed.

           Error : Error in function stan::prob::categorical_log(i): Number of categories is 0, but must be between (1, 2)
     error occurred during calling the sampler; sampling not done



I don't "think" this is related to a discrete parameter issue since we have no discrete _parameters_, just discrete (integer) vectors created within the model block. I'm not sure if our argument for categorical command is correct or if there is some other issue.

Any advice would be greatly appreciated,
Many thanks,
Joe Retzer 

-------------------------------------- R / Rstan Code ------------------------------------

library(rstan)    
library(blockcluster); data(binarydata); data <- binarydata # for example data set

set_cppo("fast")  # for best running speed
set_cppo('debug') # make debug easier

N = 1000; D = 100                      # data size
K = 2;    L = 3                        # row and column clusters

my_data <- list(x = data, K = K, L = L, N = N, D = D, alpha1 = rep(1,K), alpha2 = rep(1,L))

my_code <- '
data {
  int<lower=1> N;
  int<lower=1> D;
  int<lower=1,upper=N> K;                // Number of row clusters
  int<lower=1,upper=D> L;                // Number of col clusters
  int<lower=0,upper=1> x[N,D];
  vector<lower=0>[K] alpha1;
  vector<lower=0>[L] alpha2;
}

parameters {
   real<lower=0,upper=1> theta[K,L];
   simplex[K] pi[N];
   simplex[L] pj[D];
}
model {
  int rclass[N,D];
  int cclass[N,D];
  for(k in 1:K){                            # prior on theta - binomial prob for co-clusters
    for(l in 1:L){
      theta[k,l] ~ beta(1, 1);   
    }
  }

  for(i in 1:N) {                           # hierarchical prior on rows
    pi[i] ~ dirichlet(alpha1);              # the individual class prob pi
  }


  for(j in 1:D) {                           # hierarchical prior on columns  
    pj[j] ~ dirichlet(alpha2);              # the individual class prob pj
  }   

  for(i in 1:N){                            # likelihood
    for(j in 1:D){
      rclass[i,j] ~ categorical(pi[i]);
      cclass[i,j] ~ categorical(softmax(pj[j]));   
      x[i,j] ~ bernoulli(theta[rclass[i,j], cclass[i,j]]);
    }
  }
}
'

fit <- stan(model_code = my_code, data = my_data, iter = 1000, chains = 4)




Bob Carpenter

unread,
May 7, 2014, 6:33:22 PM5/7/14
to stan-...@googlegroups.com
I don't know what the R package blockcluster does, but
our categoricals are indexed from 1 through K.

Your problem is here:

> model {
> int rclass[N,D];
...
> rclass[i,j] ~ categorical(pi[i]);

You don't ever set rclass. The default value is 0, and then
that causes a problem because 0 is not an appropriate outcome
for a categorical.

It looks like what you're trying to do is sample from the
categorical(pi[i]) and assign the result to rclass[i,j], which
is NOT how Stan sampling statements work. They just increment
the log probability with the value of

log(Categorical(rclass[i,j] | pi[i]))


which presupposes that rclass[i,j] is set to a valid value.

Also, given that you have a parameter declaration

> simplex[L] pj[D];

you almost certainly don't want to be doing this:

> cclass[i,j] ~ categorical(softmax(pj[j]));

pj[j] is already declared to be a simplex, so it's already
the right kind of thing to be a parameter to categorical().
Now you're transforming it again by exponentiating the
values in pj[j] and then normalizing.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Retzer Joe

unread,
May 7, 2014, 11:47:52 PM5/7/14
to stan-...@googlegroups.com
Hi Bob,
   Thanks so much for getting back to us. Just a couple of things:

On May 7, 2014, at 5:33 PM, Bob Carpenter wrote:

I don't know what the R package blockcluster does, but
our categoricals are indexed from 1 through K.  


I only mention the blockcluster package to specify the source of our data, its functionality has nothing to do with the stan program. Sorry for not being more clear on this.


Your problem is here:

model {
 int rclass[N,D];
...
     rclass[i,j] ~ categorical(pi[i]);

You don't ever set rclass.  The default value is 0, and then
that causes a problem because 0 is not an appropriate outcome
for a categorical.


Right, we saw that with some print statements we had but removed before posting. We did try to limit / set the range with:

     int<lower=1, upper=K> rclass[N,D];
  int<lower=1, upper=L> cclass[N,D];

in the model block, but got the error:

     Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
       failed to parse Stan model 'my_code' with error message:
     EXPECTATION FAILURE LOCATION: file=input; line=20, column=36

       int<lower=1,upper=K> rclass[N,D];
                                        ^-- here

     DIAGNOSTIC(S) FROM PARSER:
     require unconstrained. found range constraint.
     Parser expecting: <eps>

However we may be setting rclass inappropriately. Is there another way to limit rclass range?

It looks like what you're trying to do is sample from the
categorical(pi[i]) and assign the result to rclass[i,j], which
is NOT how Stan sampling statements work.  They just increment
the log probability with the value of

 log(Categorical(rclass[i,j] | pi[i]))


Yes, that is exactly what we are trying to do. I suspect our approach therefore is not correct. We would need to both set rclass (and cclass) appropriately and change our likelihood specification code, is that correct? 


which presupposes that rclass[i,j] is set to a valid value.

Also, given that you have a parameter declaration

  simplex[L] pj[D];

you almost certainly don't want to be doing this:

     cclass[i,j] ~ categorical(softmax(pj[j]));    

pj[j] is already declared to be a simplex, so it's already
the right kind of thing to be a parameter to categorical().
Now you're transforming it again by exponentiating the
values in pj[j] and then normalizing.


Yes, absolutely correct. The "softmax" code was left over after some re-coding attempts and should have been removed before posting, sorry about that.


Thanks again Bob for looking at our code and your advice, its very much appreciated,
Joe





You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/LK0kGNdyGdE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,
May 9, 2014, 5:20:11 AM5/9/14
to stan-...@googlegroups.com

On May 8, 2014, at 5:47 AM, Retzer Joe <retz...@gmail.com> wrote:
>
>> Your problem is here:
>>
>>> model {
>>> int rclass[N,D];
>> ...
>>> rclass[i,j] ~ categorical(pi[i]);
>>
>> You don't ever set rclass. The default value is 0, and then
>> that causes a problem because 0 is not an appropriate outcome
>> for a categorical.
>>
>
> Right, we saw that with some print statements we had but removed before posting. We did try to limit / set the range with:
>
> int<lower=1, upper=K> rclass[N,D];
> int<lower=1, upper=L> cclass[N,D];

The problem isn't not setting the bounds, the problem is that you're
trying to use "~" to do random number generation and assignment to
rclass, and that's not how Stan works. It requires rclass and pi to
already be defined (not just declared as local variables) when you call
it. The parameters are all defined by default externally by the sampler,
but any locals, transformed data, transformed parameters, and generated
quantities need to be set explicitly before being used.

>> It looks like what you're trying to do is sample from the
>> categorical(pi[i]) and assign the result to rclass[i,j], which
>> is NOT how Stan sampling statements work. They just increment
>> the log probability with the value of
>>
>> log(Categorical(rclass[i,j] | pi[i]))
>
>
> Yes, that is exactly what we are trying to do. I suspect our approach therefore is not correct. We would need to both set rclass (and cclass) appropriately and change our likelihood specification code, is that correct?
>

Correct.

What you really need to do is marginalize out any discrete
parameters.

- Bob

retzerjj

unread,
May 21, 2014, 1:48:26 PM5/21/14
to stan-...@googlegroups.com

Thanks Bob. I tried re-writing to address the likelihood updating and to marginalize out the discrete parameters (using somewhat similar examples found on the list and the manual) and came up with the code below.

my_code <- '
 data {
   int<lower=1> N;
   int<lower=1> D;
   int<lower=1> K;                // Number of row clusters
   int<lower=1> L;                // Number of col clusters

   int<lower=0,upper=1> x[N,D];
   vector<lower=0>[K] alpha1;
   vector<lower=0>[L] alpha2;
 }
 
 parameters {
    real<lower=0,upper=1> theta[K,L];
   //     real pi[N,K];
   //     real pj[D,L];

   simplex[K] pi[N];
   simplex[L] pj[D];
 }
 
 model {
   int rowSegVec[N];
   int colSegVec[D];
   real ps[N,D];

   for(k in 1:K){                            # prior on theta - binomial prob for co-clusters
     for(l in 1:L){theta[k,l] ~ beta(1, 1);}
   }
 
   for(i in 1:N) {                           # hierarchical prior on rows
     pi[i] ~ dirichlet(alpha1);              # the individual class prob pi
   }
 
   for(j in 1:D) {                           # hierarchical prior on columns  
     pj[j] ~ dirichlet(alpha2);    # the individual class prob pj
   }   
 
   for(i in 1:N){                             # likelihood
     rowSegVec[i] ~ categorical(pi[i]);
     for(j in 1:D){
       colSegVec[j] ~ categorical(pj[j]);
       for(l in 1:L){
         for (k in 1:K){
            if (which.max(rowSegVec[i])) == k &  which.max(colSegVec[j]) == l)  
               increment_log_prob(x[i,j]*log(Theta[k,l])+(1-x[i,j])*log1m(Theta[k,l]));
           }
         }
       }
     }
   }
 '

I'm now running into the error:

TRANSLATING MODEL 'my_code' FROM Stan CODE TO C++ CODE NOW.

Error in stanc(file = file, model_code = model_code, model_name = model_name,  :
  failed to parse Stan model 'my_code' with error message:
EXPECTATION FAILURE LOCATION: file=input; line=42, column=40

            if (which.max(rowSegVec[i])) == k &  which.max(colSegVec[j]) == l)  
                                       ^-- here


DIAGNOSTIC(S) FROM PARSER:
no matches for function name="which.max"
    arg 0 type=int
available function signatures for which.max:
expression is ill formed
Parser expecting: <eps>

It would seem the R "which.max" function is not recognized by stan, though there could be other problems. I was wondering if there was an equivalent function in R and/or if there might be any additional obvious problems with the code re-write.
Many thanks,
Joe

 

Soga Shota

unread,
May 21, 2014, 8:28:36 PM5/21/14
to stan-...@googlegroups.com
probably a missing parenthesis or an extra one? 

if (which.max(rowSegVec[i])) == k &  which.max(colSegVec[j]) == l)  

to
   
if ((which.max(rowSegVec[i])) == k &  which.max(colSegVec[j]) == l)

or

if (which.max(rowSegVec[i]) == k &  which.max(colSegVec[j]) == l)

Bob Carpenter

unread,
May 21, 2014, 8:31:25 PM5/21/14
to stan-...@googlegroups.com
Stan doesn't have support for any functions not in its manual.
This includes most of the R functions.

I have no idea what which.max does in R, but it's not in Stan.
Maybe someone who knows R can help you translate or you could
explain what you're trying to do.

You'll also find that Stan uses the "&&" notation for conjunction from
C++ (where "&" is bitwise and).

- Bob

Avraham Adler

unread,
May 22, 2014, 9:57:38 AM5/22/14
to stan-...@googlegroups.com
which.max in R cycles through a vector and returns the index with the largest value. To do that in Stan, you'll probably have to write an extra loop and store the value. One observation, though. RowSegVec is a vector, so RowSegVec[i] is a single observation, which.max shouldn't apply, unless I missed something.

--Avi
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Retzer Joe

unread,
May 22, 2014, 10:14:20 AM5/22/14
to stan-...@googlegroups.com
Thanks Avi, you are right, I need to assign and refer to rowSegVec and colSegVec without an index. 
Joe


You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/LK0kGNdyGdE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Retzer Joe

unread,
May 22, 2014, 11:20:44 AM5/22/14
to stan-...@googlegroups.com
Hi all,

I re-defined my rowSegVec and colSegVec to be vectors (to be re-used within the relevant loop). I thought that an alternative to "which.max" could be as follows:

  - given the dimension of the vector is say K, the max will be identified as having rank(rowSegVec) equal to K - 1. 

The remainder of the code (listed below) did not change. I now get a parsing error:

TRANSLATING MODEL 'my_code' FROM Stan CODE TO C++ CODE NOW.
Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'my_code' with error message:
EXPECTATION FAILURE LOCATION: file=input; line=35, column=6

     rowSegVec ~ categorical(pi[i]);
     ^-- here


DIAGNOSTIC(S) FROM PARSER:
no matches for function name="categorical_log"
    arg 0 type=vector
    arg 1 type=vector
available function signatures for categorical_log:
0.  categorical_log(int, vector) : real
unknown distribution=categorical
Parser expecting: "}"


Any advice would be appreciated,
Joe


my_code <- '
 data {
   int<lower=1> N;
   int<lower=1> D;
   int<lower=1> K;                // Number of row clusters
   int<lower=1> L;                // Number of col clusters
   int<lower=0,upper=1> x[N,D];
   vector<lower=0>[K] alpha1;
   vector<lower=0>[L] alpha2;
 }
 
 parameters {
   real<lower=0,upper=1> theta[K,L];
   simplex[K] pi[N];
   simplex[L] pj[D];
 }
 
 model {
   vector[K] rowSegVec;
   vector[L] colSegVec;
   real ps[N,D];
   for(k in 1:K){                            # prior on theta - binomial prob for co-clusters
     for(l in 1:L){theta[k,l] ~ beta(1, 1);}
   }
 
   for(i in 1:N) {                           # hierarchical prior on rows
     pi[i] ~ dirichlet(alpha1);              # the individual class prob pi
   }
 
   for(j in 1:D) {                           # hierarchical prior on columns   
     pj[j] ~ dirichlet(alpha2);              # the individual class prob pj
   }    
 
   for(i in 1:N){                             # likelihood
     rowSegVec ~ categorical(pi[i]);
     for(j in 1:D){
       colSegVec ~ categorical(pj[j]);
       for(l in 1:L){
         for (k in 1:K){
      //    if (which.max(rowSegVec) == k   &&  which.max(colSegVec) == l)   
            if (rank(rowSegVec[k])   == K-1 &&  rank(colSegVec[l])   == L-1)   
               increment_log_prob(x[i,j]*log(theta[k,l])+(1-x[i,j])*log1m(theta[k,l]));
           }
         }
       }
     }
   }
 '

On May 22, 2014, at 8:57 AM, Avraham Adler wrote:

You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/LK0kGNdyGdE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Soga Shota

unread,
May 23, 2014, 4:48:32 AM5/23/14
to
I fixed your code a little bit for you. The error said all. your return type should be int instead of vector.

Also, your use of rank is not correct (check input arguments in the manual). I do not know what is the purpose of rank function in your code, but I do not think which.max is equivalent to rank anyway.

Assuming you use rank as alternative to which.max, here is your code with my fixes(at least compiled, but not sure this is what you want). 

data {
   int<lower=1> N;
   int<lower=1> D;
   int<lower=1> K;                // Number of row clusters
   int<lower=1> L;                // Number of col clusters
   int<lower=0,upper=1> x[N,D];
   vector<lower=0>[K] alpha1;
   vector<lower=0>[L] alpha2;
 }
 
 parameters {
   real<lower=0,upper=1> theta[K,L];
   simplex[K] pi[N];
   simplex[L] pj[D];
 }
 
 model {
   //vector[K] rowSegVec;
   int rowSegVec[K];
   //vector[L] colSegVec;
   int colSegVec[L];
   real ps[N,D];
   for(k in 1:K){                            # prior on theta - binomial prob for co-clusters
     for(l in 1:L){theta[k,l] ~ beta(1, 1);}
   }
 
   for(i in 1:N) {                           # hierarchical prior on rows
     pi[i] ~ dirichlet(alpha1);              # the individual class prob pi
   }
 
   for(j in 1:D) {                           # hierarchical prior on columns   
     pj[j] ~ dirichlet(alpha2);              # the individual class prob pj
   }    
 
   for(i in 1:N){                             # likelihood
     int row_max;
     int col_max;

     rowSegVec ~ categorical(pi[i]);
     for(j in 1:D){
       colSegVec ~ categorical(pj[j]);
       for(l in 1:L){
         for (k in 1:K){
            row_max <- 0; // first candidate
            col_max <- 0; // first candidate

            for(kk in 1:K)
            {
               // not sure what to do if where are two max or more.
               if(rowSegVec[kk] > row_max)
               {
                  row_max <- kk;
               }
            }
            for(ll in 1:L)
            {
               if(colSegVec[ll] > row_max)
               {
                  col_max <- ll;
               }
            }
            if (row_max   == K-1 &&  col_max   == L-1)   
               increment_log_prob(x[i,j]*log(theta[k,l])+(1-x[i,j])*log1m(theta[k,l]));
           }
         }
       }
     }
   }

Enjoy!

Retzer Joe

unread,
May 24, 2014, 2:30:34 PM5/24/14
to stan-...@googlegroups.com
Hi Soga,
   Thanks so much for looking at this. When I run the code you provided I still get the error

TRANSLATING MODEL 'my_code' FROM Stan CODE TO C++ CODE NOW.
Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'my_code' with error message:
EXPECTATION FAILURE LOCATION: file=input; line=41, column=6

     rowSegVec ~ categorical(pi[i]);
     ^-- here


DIAGNOSTIC(S) FROM PARSER:
no matches for function name="categorical_log"
    arg 0 type=int[1]
    arg 1 type=vector
available function signatures for categorical_log:
0.  categorical_log(int, vector) : real
unknown distribution=categorical
Parser expecting: "}"

What I'd like to do is generate a categorical RV using the simplex pi (or pj). I'm assuming I'd get a vector of 0's and one "1". The position of the one in the vector would identify the row (or column) of my theta matrix from which to draw a bernoulli RV parameter. It seems I cannot get past the rowSegVec ~ categorical(pi[i]); statement.

I've tried many iterations one being to create rowSegVec and colSegVec in the generated quantities block using categorical_rng.(note: categorical_rng will not work in the model block) However the function uses a simplex vector created in the model block. Also, if I put generated quantities code before the model block I again get an error. 

I'm also not able to create rowSegVec / colSegVec in the transformed parameter block as they need to be integer.

Joe



On May 22, 2014, at 11:24 PM, Soga Shota wrote:

I fixed a little bit for you. The error said all. your return type should be int instead of vector.

Also, your use of rank is not correct (check input arguments in the manual). I do not know what is the purpose of rank, but I do not think which.max is equivalent to rank anyway.

Assuming you use rank as alternative to which.max, here is your code with my fixes(at least compiled, but not sure this is what you want). 

data {
   int<lower=1> N;
   int<lower=1> D;
   int<lower=1> K;                // Number of row clusters
   int<lower=1> L;                // Number of col clusters
   int<lower=0,upper=1> x[N,D];
   vector<lower=0>[K] alpha1;
   vector<lower=0>[L] alpha2;
 }
 
 parameters {
   real<lower=0,upper=1> theta[K,L];
   simplex[K] pi[N];
   simplex[L] pj[D];
 }
 
 model {
   //vector[K] rowSegVec;
   int rowSegVec[K];
   //vector[L] colSegVec;
   int colSegVec[L];
   real ps[N,D];
   for(k in 1:K){                            # prior on theta - binomial prob for co-clusters
     for(l in 1:L){theta[k,l] ~ beta(1, 1);}
   }
 
   for(i in 1:N) {                           # hierarchical prior on rows
     pi[i] ~ dirichlet(alpha1);              # the individual class prob pi
   }
 
   for(j in 1:D) {                           # hierarchical prior on columns   
     pj[j] ~ dirichlet(alpha2);              # the individual class prob pj
   }    
 
   for(i in 1:N){                             # likelihood
     int row_max;
     int col_max;

     rowSegVec ~ categorical(pi[i]);
     for(j in 1:D){
       colSegVec ~ categorical(pj[j]);
       for(l in 1:L){
         for (k in 1:K){
            row_max <- 0; // first candidate
            col_max <- 0; // first candidate

            for(kk in 1:K)
            {
               // not sure what to do if where are two max or more.
               if(rowSegVec[kk] > row_max)
               {
                  row_max <- kk;
               }
            }
            for(ll in 1:L)
            {
               if(colSegVec[ll] > row_max)
               {
                  col_max <- ll;
               }
            }
            if (row_max   == K-1 &&  col_max   == L-1)   
               increment_log_prob(x[i,j]*log(theta[k,l])+(1-x[i,j])*log1m(theta[k,l]));
           }
         }
       }
     }
   }

Enjoy!

Bob Carpenter

unread,
May 24, 2014, 7:46:47 PM5/24/14
to stan-...@googlegroups.com
1. The problem is that you're trying to use an int array as
the outcome for a categorical, which Stan doesn't allow.

Stan distinguishes the categorical from the multinomial.
If theta is a K-simplex, then

y ~ categorical(theta);

assumes y is a single outcome coded as an integer in 1:K,
whereas

y ~ multinomial(theta);

assumes that y is a K-array of integer counts for each outcome.

2. The _rng functions are specified to only work in the generated
quantities block --- they're not part of the modeling and
can't be used as such.

- Bob
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Soga Shota

unread,
May 25, 2014, 7:59:03 PM5/25/14
to stan-...@googlegroups.com
> Thanks so much for looking at this. When I run the code you provided I still get the error

I rechecked I do not get an error when I compiled my code I posted.
I used cmdstan-2.2.0 with clang++. Did you get an error at a run-time? I did not check it because I dont have data...

Anyway, see comments from Bob. 
Reply all
Reply to author
Forward
0 new messages