Hi, Danny,
Thanks for the quick reply.
I didn't know that one-class problems aren't solved well with matrix
factorization; I am relatively new to collaborative filtering
methods. I'm glad you told me. What classes of methods are more
successful for one-class problems?
I have seen your blog posts about gensgd, and it looks promising. I
do have additional features I could include. I'd like to understand
more about how the method works. Could you point me to any
additional info on it?
The reason I hoped to use the sparse MF methods is that they
facilitate extracting meaning from the model itself, rather than it
being only a black box that gives good predictions. As you know,
with sparse methods a limited number of samples and/or features is
associated with each latent factor, and so one can get further in
associating the latent factors with some known entity/process/etc. I
was looking at doing something similar to the cancer example toward
the end of the paper on sparse latent semantic analysis that is
referenced in connection with GraphChi's sparse ALS implementation.
I don't want to take your time with too much info about my
particular problem, but here is some if it is relevant:
I'm working with mutation data. My prediction question is, given a
mutation profile from a tumor, what other mutations would such a
tumor be most likely to acquire? Or perhaps more important, what
mutations would they be
least likely to acquire (some of
these may be potential therapeutic targets). On the
model-interpretation side, the latent factors could be seen as
cellular pathways or processes; one could use the sparse list of
genes associated with each to try to relate it to known pathways,
and use the sparse list of samples to see in which subsets of
patients each factor is operative.
My reason for wanting a weighted method: the data is binary (0=not
mutated, 1=mutated). Not all mutations are of equal functional
significance. Some are "drivers" to which the model should be fit,
and some are "passengers" to which it shouldn't, with a whole
spectrum in between. And conversely, non-mutations carry varying
degrees of information; not all of them should be treated simply as
missing values, but not all should be fit to as explicit zero values
either. There are decent methods for assigning a likelihood of a
mutation being a driver, and I thought perhaps I could use these as
weights a la weighted ALS. Perhaps the features that are used to
identify drivers could be fed into gensgd directly, rather than
performing a separate weighting step and a factorization step... I'm
not sure.
Thanks again for your help,
James
P.S.
Here
is the weighted sparse NMF paper I mentioned, though on second
reading I am less confident that I should bring it to your
attention.