Recommendation for 'user-view' based

142 views
Skip to first unread message

mr

unread,
Aug 16, 2017, 3:29:47 AM8/16/17
to actionml-user
Hi All,
I want to ask on how to use recommendation based on user view not based on user rating. Most of samples that I studied always depend on rating.
In my case, I have tons of data of the user view the pages (in my case article). User can view an article more than one time, we have all recorded in log files.

Here's the illustration:

Article
---------
Art-A
Art-B
Art-C

User
-------
u1
u2
u3

Article-User
------------------
u1  Art-A   8/16/2017 05:00:01
u1  Art-B   8/14/2017 05:00:01
u2  Art-A   8/16/2017 05:00:01
u1  Art-A   7/3/2017 05:00:01
u3  Art-C   7/16/2017 05:00:01

What I want to do is to make recommendation of what a user should read based on those data. Now my question is
1. Is it possible to use Universal Recommendation for user-view based recommendation?
2. If yes, should I create separate event for same article read by same user?
3. As the data grow everyday, how is the strategy to train?

Thank you!



Pat Ferrel

unread,
Aug 16, 2017, 12:18:48 PM8/16/17
to mr, actionml-user
Short answer; Yes, article views or better yet reads are how you should use the UR.

1. Is it possible to use Universal Recommendation for user-view based recommendation?

yes, but if you can tell the difference between "view" and “read” this would be better. Read might be by watching if they scroll to the bottom or how long they are on an article page, or ???

views are a good substitute and can be used together with reads and shares to get the best quality recommendations from your data.

2. If yes, should I create separate event for same article read by same user? 

multiple reads are not used, not sure what they mean, maybe they only read half and came back. So if you send multiple events for the same user and article, we will use only one. You may be able to limit unneeded data if you send only one but it has no other effect.

3. As the data grow everyday, how is the strategy to train?

the answer to this depends on the lifetime of an article. If you have newsy articles that are usually only interesting for a few days or weeks, then you might want to limit the age of a recommendation but the fact that the same people read the same articles has importance even if you do not recommend old items. But you will need to limit how long you keep events, no one wants to keep the forever so we have a separate template called the db-cleaner that can be scheduled to trim events to some duration every so often. This is done by duration so if new users become active the data my continue to grow but that is a good problem and you can then scale to handle the new users.

Longer rant about “ratings”...

Ratings are seldom used anymore for recommendations. They are all but useless for several technical reasons. The only reason they were used is the Netflix prize, which still influences people but even Netflix doesn’t use ratings anymore either.

The reason ratings are terrible is that everyone rates things differently and actually predicting ratings does not relate very well with what people would like to read or watch. For instance I seldom would rate comedy acts with 5 but often like to watch comedy acts. A rating based recommender would never give me what I want. This is a fundamental problems with rating since a recommender is trying to predict a very specific type of human preference, do you want to predict their ratings or what they would watch? To predict what they would watch record watches or something they have watched 90% of the way through (discarding the credits at the end). Likewise with reading articles you want to find exactly what you want to predict or encourage, and I think that is reading and liking the article. You could argue that reading is all you care about.

That said the UR can also take other behavior and correlate it to “read”, like view, which is a weaker indicator than “read”. Also things like “share” are good since the user reads it well enough to recommend it to someone else.


--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/b113f95e-f4e0-4d9d-a62a-bc80f698bec6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mr

unread,
Aug 17, 2017, 4:07:02 AM8/17/17
to actionml-user
Thank you Pat for the answer. Please my comments/question below:

1. views are a good substitute and can be used together with reads and shares to get the best quality recommendations from your data.
Right now we only have "opened" or "viewed" data in log file, we havent track the read (eg using analytic). So based on your recommendation, this will something like
create_event(user-id,"view",article-id)

2.
multiple reads are not used, not sure what they mean, maybe they only read half and came back. So if you send multiple events for the same user and article, we will use only one. You may be able to limit unneeded data if you send only one but it has no other effect.
I have trouble to filter which data has been sent to event server if I have to check same user read same article multiple times (eg; in different days). So I think i will just bulk-load all of my data to the event server

3. the answer to this depends on the lifetime of an article. If you have newsy articles that are usually only interesting for a few days or weeks, then you might want to limit the age of a recommendation but the fact that the same people read the same articles has importance even if you do not recommend old items. But you will need to limit how long you keep events, no one wants to keep the forever so we have a separate template called the db-cleaner that can be scheduled to trim events to some duration every so often. This is done by duration so if new users become active the data my continue to grow but that is a good problem and you can then scale to handle the new users.
Our article is not news type. It is a forum style. So we want to give recommendation even for old article.
Will this scenario work:
1. load daily data to event-server,
2. train it every night,
3. every week run db-cleaner to sanitize data.

The data to be kept in event-server are
event($set,user,user-id)
event($set,article,article-id)
event(view,user,user-id,article,article-id)

thank you


Pat Ferrel

unread,
Aug 17, 2017, 10:59:03 AM8/17/17
to mr, actionml-user
Answers below

On Aug 17, 2017, at 1:07 AM, mr <ridw...@gmail.com> wrote:

Thank you Pat for the answer. Please my comments/question below:

1. views are a good substitute and can be used together with reads and shares to get the best quality recommendations from your data.
Right now we only have "opened" or "viewed" data in log file, we havent track the read (eg using analytic). So based on your recommendation, this will something like
create_event(user-id,"view",article-id)

yes

2. 
multiple reads are not used, not sure what they mean, maybe they only read half and came back. So if you send multiple events for the same user and article, we will use only one. You may be able to limit unneeded data if you send only one but it has no other effect.
I have trouble to filter which data has been sent to event server if I have to check same user read same article multiple times (eg; in different days). So I think i will just bulk-load all of my data to the event server

ok, this should work fine, the de-duplication is done in the algorithm

3. the answer to this depends on the lifetime of an article. If you have newsy articles that are usually only interesting for a few days or weeks, then you might want to limit the age of a recommendation but the fact that the same people read the same articles has importance even if you do not recommend old items. But you will need to limit how long you keep events, no one wants to keep the forever so we have a separate template called the db-cleaner that can be scheduled to trim events to some duration every so often. This is done by duration so if new users become active the data my continue to grow but that is a good problem and you can then scale to handle the new users.
Our article is not news type. It is a forum style. So we want to give recommendation even for old article. 
Will this scenario work:
1. load daily data to event-server, It is best to have real-time input about user events, train daily but if we know what a user is looking at in real-time we can make recommendations based on the user’s clicks in real-time. Typically events come in from an app-server in real-time, not from daily log processing. But logs will work, just not the best.

2. train it every night, 
3. every week run db-cleaner to sanitize data. 

The data to be kept in event-server are
event($set,user,user-id) no need
event($set,article,article-id) no need
event(view,user,user-id,article,article-id) the UR expects “user” and “item” the id can be an article-id but when forming the event it gets the “item” label. See the docs here: 
http://actionml.com/docs/ur_input#events read this carefully

no all you need is create_event(user-id,"view",article-id), There is no need to $set users or items, when the algorithm sees a user in the “view” it will know the user exists.

thank you

Things to consider:
  • Do you have categories of articles or any tags for content type, this may be useful
  • You did not answer my question about the “lifetime” of an article. Are they long lived like educational material, or of only short value like news? This information will help tune the algorithm

mr

unread,
Aug 17, 2017, 12:51:54 PM8/17/17
to actionml-user
  • Do you have categories of articles or any tags for content type, this may be useful

Yes I do, we have categories and tag for each article. And we have other event data like categories that a user subscribed to (or unsubscribe).

How will it affect the data to be sent to event server?

  • You did not answer my question about the “lifetime” of an article. Are they long lived like educational material, or of only short value like news? This information will help tune the algorithm
In my understanding, the term long lived is they are accessible anytime, after 1 year, 2 years or more. But judging from your example, it is more to the content. News content can be obsolete within days, or hours, while education content last longer. So to answer your question, it is a mixed, some are long lived (content wise), some are shorter. Our audience post anything they like depend on their interest.

thank you!

mr

unread,
Aug 26, 2017, 1:49:20 AM8/26/17
to actionml-user
Hi Pat,
I have tried the UR with my data. 

1. My data, after I import to the engine, is like this

[{"eventId":"c5028577586d4b71a7c8872d56a5898d","event":"view","entityType":"user","entityId":"ZwZ1A1mTPKijBHEMKGh6Ag==","targetEntityType":"item","targetEntityId":"a1e8642eb682338b456c","properties":{},"eventTime":"2017-08-16T03:25:31.000+07:07","creationTime":"2017-08-25T06:38:06.189Z"},

 {"eventId":"976a81c1d4c34916827a3daec1b1dff1","event":"view","entityType":"user","entityId":"m9FbdWFIDd3BAg==","targetEntityType":"item","targetEntityId":"00000000000015118551","properties":{},"eventTime":"2017-08-16T03:25:34.000+07:07","creationTime":"2017-08-25T06:38:06.642Z"},

 {"eventId":"a2d2ce99021945aabab92d1998375a8c","event":"view","entityType":"user","entityId":"aeyiL3BkA5z+Ag==","targetEntityType":"item","targetEntityId":"1de71a9975a5668b457b","properties":{},"eventTime":"2017-08-16T03:27:13.000+07:07","creationTime":"2017-08-25T06:38:07.092Z"},

 {"eventId":"749637af485043a0826cd3f842d1c028","event":"view","entityType":"user","entityId":"WRyhiXDJKHrRAg==","targetEntityType":"item","targetEntityId":"60461a9975506a8b4568","properties":{},"eventTime":"2017-08-16T03:27:20.000+07:07","creationTime":"2017-08-25T06:38:07.544Z"},

 {"eventId":"a3d248c9c7e74ce79812d1cb5a053eef","event":"view","entityType":"user","entityId":"kf+Rc30EF/KtAg==","targetEntityType":"item","targetEntityId":"ce95529a456f208b4567","properties":{},"eventTime":"2017-08-16T03:27:26.000+07:07","creationTime":"2017-08-25T06:38:07.995Z"},

 {"eventId":"935e15f086ab42b888b7988e1f1fe30b","event":"view","entityType":"user","entityId":"yqW3DxavBRbZAg==","targetEntityType":"item","targetEntityId":"d1ae3c118ea32a000001","properties":{},"eventTime":"2017-08-16T03:28:18.000+07:07","creationTime":"2017-08-25T06:38:08.225Z"},

 {"eventId":"3718fc0d4d624d76ac14fb17d104f07e","event":"view","entityType":"user","entityId":"ZwZ1A1mTV2WiL3BkKB4+Ag==","targetEntityType":"item","targetEntityId":"64631ed7194248000002","properties":{},"eventTime":"2017-08-16T03:28:20.000+07:07","creationTime":"2017-08-25T06:38:08.456Z"},

 {"eventId":"8c757a4886944b0ca1b536d0f2f72b29","event":"view","entityType":"user","entityId":"hUZPPVKHFNw0Ag==","targetEntityType":"item","targetEntityId":"41f432e2e62b4a8b456b","properties":{},

 ...

 ...

 until 10.000 data



2. My engine.json is like this (I just follow the default like the documentation here: http://actionml.com/docs/ur_config

{

  "comment":" This config file uses default settings for all but the required values see README.md for docs",

  "id": "default",

  "description": "Default settings",

  "engineFactory": "com.actionml.RecommendationEngine",

  "datasource": {

    "params" : {

      "name": "kfur",

      "appName": "kfur",

      "eventNames": [ "view"]

    }

  },

  "sparkConf": {

    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",

    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",

    "spark.kryo.referenceTracking": "false",

    "spark.kryoserializer.buffer": "300m",

    "spark.executor.memory": "4g",

    "es.index.auto.create": "true"

  },

  "algorithms": [

    {

      "comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames",

      "name": "ur",

      "params": {

        "appName": "kfur",

        "indexName": "urindex",

        "typeName": "items",

        "comment": "must have data for the first event or the model will not build, other events are optional",

        "eventNames": ["view"]

      }

    }

  ]

}


3. After that I run pio train and pio deploy

4. Then I tried to run query, but any query will result the same result list with 0.0 score,

existing user
 curl -H "Content-Type: application/json" -d '{"user": "PKijBHEMKGh6Ag=="}' http://localhost:8000/queries.json 

 return


{"itemScores":[{"item":"/myforum/subscribe","score":0.0},{"item":"38221a9975a9668b4592","score":0.0},{"item":"b6dca2c06efe4d8b4567","score":0.0},{"item":"69901ee5dfba048b4569","score":0.0},{"item":"9c759e7404a6528b456c","score":0.0},{"item":"00000000000012361728","score":0.0},{"item":"73921854f7d45b8b4567","score":0.0},{"item":"340492523303578b4567","score":0.0},{"item":"3e421cbfaa1a0b8b458d","score":0.0},{"item":"ae809a0951d1388b4567","score":0.0},{"item":"d67c9a095184578b4569","score":0.0},{"item":"6f31162ec253288b456c","score":0.0},{"item":"25871854f793488b4570","score":0.0},{"item":"62aeddd770f7088b4578","score":0.0},{"item":"6942c0cb1775678b4568","score":0.0},{"item":"de0da2c06e76728b456d","score":0.0},{"item":"5f25c0d77046038b456c","score":0.0},{"item":"7af256e6afa7378b4567","score":0.0},{"item":"72b49a0951f1378b456d","score":0.0},{"item":"c2b6ddd77055578b4581","score":0.0}]}


non-existing user

curl -H "Content-Type: application/json" -d '{    "user": "anyuser-not-in-the-data"}' http://localhost:8000/queries.json

return

{"itemScores":[{"item":"/myforum/subscribe","score":0.0},{"item":"38221a9975a9668b4592","score":0.0},{"item":"b6dca2c06efe4d8b4567","score":0.0},{"item":"69901ee5dfba048b4569","score":0.0},{"item":"9c759e7404a6528b456c","score":0.0},{"item":"00000000000012361728","score":0.0},{"item":"73921854f7d45b8b4567","score":0.0},{"item":"340492523303578b4567","score":0.0},{"item":"3e421cbfaa1a0b8b458d","score":0.0},{"item":"ae809a0951d1388b4567","score":0.0},{"item":"d67c9a095184578b4569","score":0.0},{"item":"6f31162ec253288b456c","score":0.0},{"item":"25871854f793488b4570","score":0.0},{"item":"62aeddd770f7088b4578","score":0.0},{"item":"6942c0cb1775678b4568","score":0.0},{"item":"de0da2c06e76728b456d","score":0.0},{"item":"5f25c0d77046038b456c","score":0.0},{"item":"7af256e6afa7378b4567","score":0.0},{"item":"72b49a0951f1378b456d","score":0.0},{"item":"c2b6ddd77055578b4581","score":0.0}]}


Anything that I did wrong?
(anyway, what is value should be in datasource->params->name in engine.json ?)

thank you!

Pat Ferrel

unread,
Aug 26, 2017, 11:44:22 AM8/26/17
to mr, actionml-user
How many events for PKijBHEMKGh6Ag ? if they have viewed only one item it is not recommended by default because they have already viewed it.

You are getting recommendations from the popularity model for the user with no data and PKijBHEMKGh6Ag. When there are no events for a user the UR falls back to popularity (items with more views in this case) and the score is 0. If the score is non-zero the recs are being personalized using the user’s behavior (events). The same occurs if there isn’t enough data for a user to return items they haven’t viewed. If they got 2 recs from personalization but they were already viewed but the user, they would not be returned. You can turn this off but often this is what you want.

To test for recs find a users with 10 events and make a query for that user. If you get all 0 scores, then I’d be suspicious of something off in your data or config.

 datasource->params->name is not used as far as I know. It’s like a comment that reminds you of what data you are using and doesn’t relate to any other name.


mr

unread,
Aug 27, 2017, 11:51:48 PM8/27/17
to actionml-user
Hi Pat,
You're right. I tried to a user which has 12 events, here's the result


user 4wWdcG+XKa0+Ag== has 12 events

 userid                         | itemid
------------------------------------------------------------------------
 4wWdcG+XKa0+Ag== | 84ea1cbfaaae128b4583
 4wWdcG+XKa0+Ag== | 4e5c5a516312258b4567
 4wWdcG+XKa0+Ag== | ee2b56e6afc3788b456c
 4wWdcG+XKa0+Ag== | 84ea1cbfaaae128b4583
 4wWdcG+XKa0+Ag== | de0512e257363f8b456b
 4wWdcG+XKa0+Ag== | ee2b56e6afc3788b456c
 4wWdcG+XKa0+Ag== | ef46582b2e9a5b8b456b
 4wWdcG+XKa0+Ag== | de0512e257363f8b456b
 4wWdcG+XKa0+Ag== | cb9e9a0951e5408b4569
 4wWdcG+XKa0+Ag== | ef46582b2e9a5b8b456b
 4wWdcG+XKa0+Ag== | c201902cfed51b8b4568
 4wWdcG+XKa0+Ag== | ae95c1d770bd1e8b4567


Recommendation
--------------------------
{"itemScores":[
{"item":"9f6d31e2e62d508b4569","score":0.5877036452293396},
{"item":"20af56e6af962e8b4567","score":0.5514767169952393},
{"item":"d8812e04c8437c8b456d","score":0.31430208683013916},
{"item":"c743d675d4ba498b456c","score":0.3030799329280853},
{"item":"dc83dac13ec6118b456b","score":0.3030799329280853},
{"item":"21231cbfaaa7208b4582","score":0.29070526361465454},
{"item":"34e52e04c89a3e8b4567","score":0.2778777480125427},
{"item":"dff2a2c06e71308b456d","score":0.27573835849761963},
{"item":"cd8c98e31b41728b456c","score":0.26849600672721863},
{"item":"4a489252338e578b4569","score":0.26849600672721863},
{"item":"cef431e2e652468b456f","score":0.030123908072710037},
{"item":"079060e24bcf028b4568","score":0.030123908072710037},
{"item":"f34dded770860d8b456c","score":0.030123908072710037},
{"item":"edf6c1cb17ee318b4568","score":0.030123908072710037},
{"item":"ee9aa2c06ed5308b456c","score":0.0298556350171566},
{"item":"09bda09a394f5b8b4568","score":0.0298556350171566},
{"item":"ca8e12e25706068b456b","score":0.028945349156856537},
{"item":"05eadad770577a8b4567","score":0.028945349156856537},
{"item":"d52e1a9975ae358b456b","score":0.028945349156856537},
{"item":"0542642eb63e2a8b456e","score":0.028945349156856537}]}

For a user that has events less than 8, UR will shows no result (will get the default recommendation).

My questions:
1. how to set the bar lower, instead of minimum 8 events, I want to be able to predict 3 events for instance
2. Based on your recommendation before, event should be store to event server in realtime, how do we strategize the training?

Here's my data distribution for this sample

Number of event  |  occurences
----------------------------------------------
   1                     |  1784
   2                     |  291
   3                     |  89
   4                     |  26
   5                     |  23
   6                     |  25
   7                     |  7
   8                     |  7
   9                     |  3
   10                   |  2
  >10                  | 12

For 10.000 data. There is 1 userid who has 6608 event


Thank you!

mr

unread,
Aug 28, 2017, 12:14:24 AM8/28/17
to actionml-user
Sorry  rephrase this one


1. how to set the bar lower, instead of minimum 8 events, I want to be able to predict 3 events for instance

I want to be able to predict for a user who has minimum 3 events.
 

Pat Ferrel

unread,
Aug 28, 2017, 1:51:44 PM8/28/17
to mr, actionml-user
There is no “bar” you can find so easliy. The issue is how many co and cross-occurrences with other people’s events and how truly they seem to correlate with a correlation metric (LLR). It could be as low as 1 or much higher. Don’t try to filter the events, let the UR go through it’s math and make the decision, it is what the algo does.

To test how many are getting personalized events, get a list of all users and, after training, query for all of them to see what % is getting personalized (non-zero scored) recs. 

BTW item-based recs, meaning; "people who viewed this item also viewed these other items” will work before individual recs work because they are not personalized and need less data. So if you can use them, do. The data you have may not support personalized recs with only one indicator event so also look for more. For instance if you have categories, you can trigger category-preference events when a user views something. This gives you more events for the same action and may improve both item and user based recommendations. 


mr

unread,
Aug 29, 2017, 9:34:31 PM8/29/17
to actionml-user
Hi Pat,
What is the range of "score" in result? I assume it between 0 and 1. However, I received score > 1, is that ok?

$ curl -H "Content-Type: application/json" -d '{"user":"Jr8X5DyGCji1Ag==", "num": 10}' http://localhost:8000/queries.json
{"itemScores":[
{"item":"cbc01ee5df9e608b456c","score":1.775160551071167},
{"item":"b12f162ec2d3188b456c","score":1.7025551795959473},
{"item":"81d1162ec2e24b8b4570","score":1.409639835357666},
{"item":"f8a98907e74c278b45a1","score":1.39519202709198},
{"item":"00000000000016227992","score":1.3620442152023315},
{"item":"a97956e6afb9238b456d","score":1.3620442152023315},
{"item":"dbff582b2ef9708b4567","score":1.3594446182250977},
{"item":"ac2c0f8b4669658b477c","score":1.3594446182250977},
{"item":"00000000000004321262","score":1.3539255857467651},
{"item":"00000000000016332320","score":1.3288136720657349}

Pat Ferrel

unread,
Aug 30, 2017, 8:04:35 AM8/30/17
to mr, actionml-user
The score is really only useful to rank items. As you add events and business rules the score will have different ranges. The only way to get an idea of how well it will perform is to do a cross-validation test or better yet an A/B test. 

The meaning is the sum of dot products of the user history vector (boolean values) with the user behavior vector of an item. The ones returned are the top-k.

More explanation here: http://actionml.com/blog/cco

Reply all
Reply to author
Forward
0 new messages