Use of latent informations associated to items with Mahout's SimilarityAnalysis.cooccurrences

85 views
Skip to first unread message

Marius Rabenarivo

unread,
Jun 2, 2017, 2:26:12 AM6/2/17
to actionml-user, us...@predictionio.incubator.apache.org
Hello everyone!

Do you have an idea on how to use latent informations associated to items like tag, word vector embedding in Mahout's SimilarityAnalysis.cooccurrences?

Regards,

Marius

Pat Ferrel

unread,
Jun 2, 2017, 11:47:17 AM6/2/17
to Marius Rabenarivo, actionml-user, us...@predictionio.incubator.apache.org
When a user expresses a preference for a tag, word or term as in search or even in content like descriptions, these can be considered secondary events. The most useful are tags and search terms in our experience. Content can be used but each term/token needs to be sent as a separate preference while search phrases can be used though again turning them into tokens may be better.

Please looks through the docs here: http://actionml.com/docs/ur or the siide deck here: https://www.slideshare.net/pferrel/unified-recommender-39986309

The major innovation of CCO, the algorithm behind the UR, is the use of these cross-domain indicators. They are not guaranteed to predict conversions but the CCO algo tests them and weights them low if they do not so we tend to test for strength of prediction of the entire category of indictor and drop them if weak or set a minLLR threshold and filter weak individual indicators out.

Technically these are not called latent, that has another meaning in Machine Learning having to do with Latent Factor Analysis.


--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAC-ATVEO_YON-5E95iPJjBR-FUgEv8TQsOA0rtD-xg0u-tNA_g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Marius Rabenarivo

unread,
Jun 2, 2017, 11:56:12 AM6/2/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
So I have to send an event like category-preference for each tag associated to an item right?

entityId: userd-id
event: category-preference
targetEntityId : tag/token

2017-06-02 19:47 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
When a user expresses a preference for a tag, word or term as in search or even in content like descriptions, these can be considered secondary events. The most useful are tags and search terms in our experience. Content can be used but each term/token needs to be sent as a separate preference while search phrases can be used though again turning them into tokens may be better.

Please looks through the docs here: http://actionml.com/docs/ur or the siide deck here: https://www.slideshare.net/pferrel/unified-recommender-39986309

The major innovation of CCO, the algorithm behind the UR, is the use of these cross-domain indicators. They are not guaranteed to predict conversions but the CCO algo tests them and weights them low if they do not so we tend to test for strength of prediction of the entire category of indictor and drop them if weak or set a minLLR threshold and filter weak individual indicators out.

Technically these are not called latent, that has another meaning in Machine Learning having to do with Latent Factor Analysis.

On Jun 1, 2017, at 11:26 PM, Marius Rabenarivo <mariusra...@gmail.com> wrote:

Hello everyone!

Do you have an idea on how to use latent informations associated to items like tag, word vector embedding in Mahout's SimilarityAnalysis.cooccurrences?

Regards,

Marius

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.

Pat Ferrel

unread,
Jun 2, 2017, 12:07:59 PM6/2/17
to Marius Rabenarivo, actionml-user, us...@predictionio.incubator.apache.org
Yes, each is analyzed separately as a separate event. If you are using REST you can send up to 50 events in a single array. Some SDKs may support this too.

Marius Rabenarivo

unread,
Jun 2, 2017, 12:19:45 PM6/2/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
so, the event field should be the token and targetEntityId the item ID, right?

Pat Ferrel

unread,
Jun 2, 2017, 1:35:49 PM6/2/17
to Marius Rabenarivo, actionml-user, us...@predictionio.incubator.apache.org
Please refer to the documents. The “event” is the name of the type of event or indicator if preference, it implies the type of the targetEntityId. So a “tag-pref’ event would be accompanied by a targetEntityId = tag-id. This is separate from attaching “tag” properties to items with the $set event for use with filter and boost rules. One looks at the data as a possible preference indicator and the other is used to restrict results. This is why we usually name events so they sound like a user preference of some type, whereas item property values are simply item attributes, intrinsic to the items and independent of an individual user.

The event can have any name that makes sense to you.


To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.

To post to this group, send email to action...@googlegroups.com.

Marius Rabenarivo

unread,
Jun 2, 2017, 10:14:54 PM6/2/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
What will be the size of the matrix if we send an event like tag-pref
We will get a |U|x|T| matrix I think (where T is the set of all tags).

So [AtA] will be a |T| x |T| matrix and we will do a dot product with the user history hT to get recommendation right?

I was assuming that A should be of side |U| x |I| where I is the set of all items as it should be added to other terms of the whole enchilada formula afterwards.

Thank you for your guidance Pat.

Marius Rabenarivo

unread,
Jun 2, 2017, 10:22:06 PM6/2/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
Please correct side to size in my previous e-mail

Pat Ferrel

unread,
Jun 3, 2017, 1:14:52 PM6/3/17
to Marius Rabenarivo, actionml-user, us...@predictionio.incubator.apache.org
A = history of all purchases (in the e-com case)
B = history of all tag preferences

r = [A’A]h_a + [A’B]h_b

The part in the slides about content-based recs is not needed here because you have captured them as user preferences.


To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.

To post to this group, send email to action...@googlegroups.com.

Marius Rabenarivo

unread,
Jun 3, 2017, 3:36:23 PM6/3/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
So the tag here is assumed to be a tag given by the user to an item?

I was thinking that it was some kind of tag we give to the item by some mean (classification, LDA, etc)

Pat Ferrel

unread,
Jun 4, 2017, 12:11:29 AM6/4/17
to Marius Rabenarivo, actionml-user, us...@predictionio.incubator.apache.org
Buy purchasing an item with a tag that you have given it, they are displaying a preference for that tag.


To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.

To post to this group, send email to action...@googlegroups.com.

Marius Rabenarivo

unread,
Jun 4, 2017, 12:14:42 AM6/4/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
And what the T in the slides is for?

How can we implement it if it's is not implemented yet?

2017-06-04 8:11 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
Buy purchasing an item with a tag that you have given it, they are displaying a preference for that tag.

On Jun 3, 2017, at 12:36 PM, Marius Rabenarivo <mariusra...@gmail.com> wrote:

So the tag here is assumed to be a tag given by the user to an item?

I was thinking that it was some kind of tag we give to the item by some mean (classification, LDA, etc)
2017-06-03 21:14 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
A = history of all purchases (in the e-com case)
B = history of all tag preferences

r = [A’A]h_a + [A’B]h_b

The part in the slides about content-based recs is not needed here because you have captured them as user preferences.

On Jun 2, 2017, at 7:22 PM, Marius Rabenarivo <mariusra...@gmail.com> wrote:

Please correct side to size in my previous e-mail
To post to this group, send email to actionml-user@googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.
To post to this group, send email to actionml-user@googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.
To post to this group, send email to actionml-user@googlegroups.com.

-- 
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.
To post to this group, send email to actionml-user@googlegroups.com.

Marius Rabenarivo

unread,
Jun 4, 2017, 7:09:33 AM6/4/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
IMHO, T represents tag it an Anonymous tag (or property) labeling task
and what you propose is Personalized tag (or property) labeling
as described in https://arxiv.org/pdf/1203.4487.pdf (Section 1.4.5 Emerging new classification) p. 40

Pat Ferrel

unread,
Jun 4, 2017, 2:06:51 PM6/4/17
to Marius Rabenarivo, actionml-user, us...@predictionio.incubator.apache.org
No offense Marius but I wrote the slides and the equation so I do indeed know what they are saying. Whether a user writes a tag or you are detecting the user preference for a tag you wrote, they are user indicators of preference. The LLR filtering of these secondary indicators is what CCO is all about and leaves you with a model that can be compared to a user’s history and contains only indicators that correlate to some conversion behavior.

T in the "whole enchilada" it used to personalize content based recommendations. Each row of T represent an item and it’s content as tokens. Tokens are stemmed, tokenized text terms, of can be entities in the item’s text (using some form of NLP) or tags, etc.  TT’ then gives you items and items that are most similar in terms of whatever content you were using in T. Now you take the users’s history of content item preference, which articles did they read for instance, and the most similar items in TT’. These will be personalized content-based recommendations.

This is not implemented in the UR but is in the CCO tools in Mahout. The reason it is not implemented is that it still requires users history and content-based recs are worse predictors than collaborative filtering with user history. In CF you treat the terms or tags as indicators of preference you do not find items similar by content. 

The personalized content-based recs may serve for edge conditions where you are recommending items with no usage behavior as the most common case, like news articles where you have no items all the time with no usage events. In this case extracting something better than “bag-of-words” for content is quite important. So highly detailed user tagging or NLP techniques can greatly increase the quality of results.

Marius Rabenarivo

unread,
Jun 4, 2017, 2:35:43 PM6/4/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
I didn't mean to tell you what it means, but I just wanted to make it clear for my part.

As I understand, the T part is a personalization that we should make if we want
to use content based information when doing recommendation.

For my use case, I want to use it for to overcome the cold start problem.

I was thinking that it was already implemented as you documented it in the slides
but I didn't find tag use in the code.

Is it SimilarityAnalysis.rowSimilarity() in Mahout that implement TT'? (just to confirm)

Pat Ferrel

unread,
Jun 4, 2017, 3:05:53 PM6/4/17
to Marius Rabenarivo, actionml-user, us...@predictionio.incubator.apache.org
TT’ does not solve cold start because you need user history for personalizations. There are several other techniques that I’ve mentioned many times on the list that help with cold start but TT’ is for a slightly different thing. It’s use is when you have a user’s history of item preferences but the items are too old to recommend and you only want to recommend new ones with no history. If you think about news, it is close to being like this. Or patent application, law opinions or judgments too. To be helpful there needs to be a lot of content for each item and you only want new things recommended.

What cold-start do you need to “solve” new anonymous users with no history or items with no conversions? Search the PIO list and AML group for past posts on this. 

Tag use is implemented as both CF and content similarity (not TT’). If you ask for item-based recommendation and the item has no conversions, you will get popular items by default. If you boost items with the same tags as the item the user is looking at, you get popular items mostly with similar tags. If you disable the popularity part you get items with similar tags, This requires that you attach tags to the items with $set and your query should contain the tags (or any other properties) of the example item. There are many ways of mixing this. You could also just get recs and mix-in new inventory by some small random amount. You can use different placements for these so you aren’t ruining recs with too much randomized cold-items. 

Anyway, the best way to do this depends on your GUI and data.


To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAC-ATVFoJQpX8XWJ25cQo7CEF8YR%3DRzWxVHTFFZWv_fjGgC6LA%40mail.gmail.com.

Marius Rabenarivo

unread,
Jun 4, 2017, 3:14:08 PM6/4/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
Thank you very much for all these clarifications?

Yes, I have items with no conversions.
I did read in the literature that content-based recs are less sensible to cold-start problem
so I headed to it.

You suggested to use Word2Vec in previous post for item with few content attached to it.

I already computed Word2Vec for my items using simple sum and want to use them to
do some smoothing in the sparse user-item matrix.

I was thinking that a kind of tensor operation may be used with CF with the Word2Vec vectors
atached to items.

2017-06-04 23:05 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
TT’ does not solve cold start because you need user history for personalizations. There are several other techniques that I’ve mentioned many times on the list that help with cold start but TT’ is for a slightly different thing. It’s use is when you have a user’s history of item preferences but the items are too old to recommend and you only want to recommend new ones with no history. If you think about news, it is close to being like this. Or patent application, law opinions or judgments too. To be helpful there needs to be a lot of content for each item and you only want new things recommended.

What cold-start do you need to “solve” new anonymous users with no history or items with no conversions? Search the PIO list and AML group for past posts on this. 

Tag use is implemented as both CF and content similarity (not TT’). If you ask for item-based recommendation and the item has no conversions, you will get popular items by default. If you boost items with the same tags as the item the user is looking at, you get popular items mostly with similar tags. If you disable the popularity part you get items with similar tags, This requires that you attach tags to the items with $set and your query should contain the tags (or any other properties) of the example item. There are many ways of mixing this. You could also just get recs and mix-in new inventory by some small random amount. You can use different placements for these so you aren’t ruining recs with too much randomized cold-items. 

Anyway, the best way to do this depends on your GUI and data.

On Jun 4, 2017, at 11:35 AM, Marius Rabenarivo <mariusra...@gmail.com> wrote:

I didn't mean to tell you what it means, but I just wanted to make it clear for my part.

As I understand, the T part is a personalization that we should make if we want
to use content based information when doing recommendation.

For my use case, I want to use it for to overcome the cold start problem.

I was thinking that it was already implemented as you documented it in the slides
but I didn't find tag use in the code.

Is it SimilarityAnalysis.rowSimilarity() in Mahout that implement TT'? (just to confirm)
2017-06-04 22:06 GMT+04:00 Pat Ferrel <p...@occamsmachete.com>:
No offense Marius but I wrote the slides and the equation so I do indeed know what they are saying. Whether a user writes a tag or you are detecting the user preference for a tag you wrote, they are user indicators of preference. The LLR filtering of these secondary indicators is what CCO is all about and leaves you with a model that can be compared to a user’s history and contains only indicators that correlate to some conversion behavior.

T in the "whole enchilada" it used to personalize content based recommendations. Each row of T represent an item and it’s content as tokens. Tokens are stemmed, tokenized text terms, of can be entities in the item’s text (using some form of NLP) or tags, etc.  TT’ then gives you items and items that are most similar in terms of whatever content you were using in T. Now you take the users’s history of content item preference, which articles did they read for instance, and the most similar items in TT’. These will be personalized content-based recommendations.

This is not implemented in the UR but is in the CCO tools in Mahout. The reason it is not implemented is that it still requires users history and content-based recs are worse predictors than collaborative filtering with user history. In CF you treat the terms or tags as indicators of preference you do not find items similar by content. 

The personalized content-based recs may serve for edge conditions where you are recommending items with no usage behavior as the most common case, like news articles where you have no items all the time with no usage events. In this case extracting something better than “bag-of-words” for content is quite important. So highly detailed user tagging or NLP techniques can greatly increase the quality of results.
On Jun 4, 2017, at 4:09 AM, Marius Rabenarivo <mariusra...@gmail.com> wrote:

IMHO, T represents tag it an Anonymous tag (or property) labeling task
and what you propose is Personalized tag (or property) labeling
as described in https://arxiv.org/pdf/1203.4487.pdf (Section 1.4.5 Emerging new classification) p. 40

Marius Rabenarivo

unread,
Jun 4, 2017, 4:28:08 PM6/4/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
You previously said that the combo of w2v + LDA can be combined with the existing UR but
would be a separate template add-on to create enriching events for the UR.

Can you give some guidance about how it should be implemented?

Marius Rabenarivo

unread,
Jun 4, 2017, 4:38:12 PM6/4/17
to Pat Ferrel, actionml-user, us...@predictionio.incubator.apache.org
Reply all
Reply to author
Forward
0 new messages