k-NN algorithms implementation

Miguel Silva

unread,

Aug 10, 2020, 2:41:37 PM8/10/20

to LensKit Recommender Toolkit Development and Support

Hello,

I implemented my own item-based k-nn algorithm using adjusted-cosine as similarity measure and weighted average as aggregation function, but its performance was considerably worse than the ItemItem available in the library. I am using nnbr = 10, min_nbr = 1 and min_sim = 0.0001 in both approaches and these were the average results I obtained for five train/test user-partitions with five test predictions for each user on the ML100K dataset:

To understand if my implementation is correct, I wanted to ask you if your version differs from the baseline Item-based k-nn and if it has any other feature that might be improving the accuracy results. Do you think these results are within expected? I would really appreciate your opinion since I was expecting that the adjustment of the cosine would have a greater impacted than what it did.

Thanks in advance.

Kind Regards,

Miguel

Michael Ekstrand

unread,

Aug 10, 2020, 2:53:16 PM8/10/20

to Miguel Silva, LensKit Recommender Toolkit Development and Support

Thank you for your question, Miguel!

When centering is turned on, LensKit's item-item CF computes cosines between centered item rating vectors: we subtract the item's mean rating. This is similar to the adjusted cosine, but we subtract the item mean rather than the user mean described in Sarwar's adjusted cosine. Adjusted cosine is surprisingly ineffective compared to centered cosine; I have written more about this here: https://md.ekstrandom.net/blog/2015/06/item-similarity

I suspect there is something else going on as well, though, because when centering is turned off it's a plain cosine without any adjustment, so ItemItem and MyIBKNN should have comparable RMSE. What are you doing with users who have rated one item but not both? Are you including their ratings in the denominator of the cosine or not? We include them (more specifically, we treat any missing centered rating as a 0), which both improves the accuracy of the recommender and enables our implementation to be incredibly fast (building the similarity matrix for MovieLens 20M in 15-30 seconds).

- Michael

--
You received this message because you are subscribed to the Google Groups "LensKit Recommender Toolkit Development and Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lenskit-recsy...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lenskit-recsys/24e35c29-a15c-4d23-a565-86a200c22035o%40googlegroups.com.

--

Michael D. Ekstrand — michael...@boisestate.edu — https://md.ekstrandom.net
Assistant Professor, Dept. of Computer Science, Boise State University

People and Information Research Team (PIReT) — http://coen.boisestate.edu/piret/

I may send mail outside of working hours; I do not expect you to. He/him.

Miguel Silva

unread,

Aug 11, 2020, 10:59:52 AM8/11/20

to LensKit Recommender Toolkit Development and Support

Thank you very much for your help!

I was adjusting the cosine using the users' mean and I didn´t know that the items' mean was more effective. I read your blog post and it was very clear and informative.

Regarding the cosine implementation, I believe that I am calculating it the same way as you do. If I understand correctly, imagining that we have two item vectors A= [2,0,1,0] and B = [0,0,3,0], the cosine computation would consider it as if A would be [2,1] and B=[0,3], instead of considering the user who rated both items, [1] and [3]. Is my understanding correct?

I suppose that the difference in the implementations may be in how It behaves when the prediction is considered impossible. I read in your docs that the algorithm should return np.nan in those cases. In my case, I return np.nan when: 1) the user doesn't exist in the system; 2) the item doesn't exist in the system; 3) the algorithm did not find at least min_nbrs for that item. But how is the model supposed to behave when the active user did not rate any of the neighbourhood items?

Once again, I really appreciate your help. I am learning a lot.

Miguel

Michael Ekstrand

unread,

Aug 11, 2020, 2:52:45 PM8/11/20

to Miguel Silva, LensKit Recommender Toolkit Development and Support

Miguel,

Your description of the cosine does sound right.

For neighborhood computations, where are you starting for finding the neighbors and how are you limiting the neighborhood size? We limit _after_ matching with the user's ratings, so a neighborhood size of 20 means that it will use up to 20 neighboring items that the user has rated. If you first limit the neighborhood size, and then look for rated neighbors, then you will skip predictions for many more items.

- Michael

--
You received this message because you are subscribed to the Google Groups "LensKit Recommender Toolkit Development and Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lenskit-recsy...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lenskit-recsys/54d09d7b-d1c4-47a7-a355-59edea284304o%40googlegroups.com.

Miguel Silva

unread,

Aug 11, 2020, 5:42:44 PM8/11/20

to LensKit Recommender Toolkit Development and Support

Hello Michael,

I reimplemented my IBKNN version following your idea and now I am achieving exactly the same accuracy results that I get with the ItemItem from your library! I was limiting the neighborhood size and then returning the user's rating average if he had not interacted with any of the neighbor items. My prior version did not need to store as many neighbors and similarities but yours clearly is the best option as a predictor.