Noelia Osés Fernández, PhD Senior Researcher | Investigadora Senior |
no...@vicomtech.org +[34] 943 30 92 30 |
Data Intelligence for Energy and Industrial Processes | Inteligencia de Datos para Energía y Procesos Industriales |
member of: |
Legal Notice - Privacy policy |
ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com
Office: 317.832.4404
Mobile: 317.531.0216
Mahout builds the model by doing matrix multiplication (PtP) then calculating the LLR score for every non-zero value. We then keep the top K or use a threshold to decide whether to keep of not (both are supported in the UR). LLR is a metric for seeing how likely 2 events in a large group are correlated. Therefore LLR is only used to remove weak data from the model.So Mahout builds the model then it is put into Elasticsearch which is used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only an indicator that the item survived the LLR test.The KNN is applied using the user’s history as the query and finding items the most closely match it. Since PtP will have items in rows and the row will have correlating items, this “search” methods work quite well to find items that had very similar items purchased with it as are in the user’s history.=============================== that is the simple explanation ========================================Item-based recs take the model items (correlated items by the LLR test) as the query and the results are the most similar items—the items with most similar correlating items.The model is items in rows and items in columns if you are only using one event. PtP. If you think it through, it is all purchased items in as the row key and other items purchased along with the row key. LLR filters out the weakly correlating non-zero values (0 mean no evidence of correlation anyway). If we didn’t do this it would be purely a “Cooccurrence” recommender, one of the first useful ones. But filtering based on cooccurrence strength (PtP values without LLR applied to them) produces much worse results than using LLR to filter for most highly correlated cooccurrences. You get a similar effect with Matrix Factorization but you can only use one type of event for various reasons.Since LLR is a probabilistic metric that only looks at counts, it can be applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC (purchase, category-preferences). We did an experiment using Mean Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/So the benefit and use of LLR is to filter weak data from the model and allow us to see if dislikes, and other events, correlate with likes. Adding this type of data, that is usually thrown away is one the the most powerful reasons to use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query is that is it fast, taking the user’s realtime events into the query but also because it is is trivial to add all sorts or business rules. like give me recs based on user events but only ones from a certain category, of give me recs but only ones tagged as “in-stock” in fact the business rules can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atro...@salesforce.com> wrote:
I'll echo Dan here. He and I went through the raw Mahout libraries called by the Universal Recommender, and while Noelia's description is accurate for an intermediate step, the indexing via ElasticSearch generates some separate relevancy scores based on their Lucene indexing scheme. The raw LLR scores are used in building this process, but the final scores served up by the API's should be post-processed, and cannot be used to reconstruct the raw LLR's (to my understanding).There are also some additional steps including down-sampling, which scrubs out very rare combinations (which otherwise would have very high LLR's for a single observation), which partially corrects for the statistical problem of multiple detection. But the underlying logic is per Ted Dunning's research and summarized by Noelia, and is a solid way to approach interaction effects for tens of thousands of items and including secondary indicators (like demographics, or implicit preferences).
ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com
Office: 317.832.4404
Mobile: 317.531.0216
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.
To post to this group, send email to actionml-user@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Are there any pio calls I can use to get these?This thread is very enlightening, thank you very much!Is there a way I can see what the P, PtP, and PtL matrices of an app are? In the handmade case, for example?
On 17 November 2017 at 19:52, Pat Ferrel <p...@occamsmachete.com> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating the LLR score for every non-zero value. We then keep the top K or use a threshold to decide whether to keep of not (both are supported in the UR). LLR is a metric for seeing how likely 2 events in a large group are correlated. Therefore LLR is only used to remove weak data from the model.So Mahout builds the model then it is put into Elasticsearch which is used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only an indicator that the item survived the LLR test.The KNN is applied using the user’s history as the query and finding items the most closely match it. Since PtP will have items in rows and the row will have correlating items, this “search” methods work quite well to find items that had very similar items purchased with it as are in the user’s history.=============================== that is the simple explanation ========================================Item-based recs take the model items (correlated items by the LLR test) as the query and the results are the most similar items—the items with most similar correlating items.The model is items in rows and items in columns if you are only using one event. PtP. If you think it through, it is all purchased items in as the row key and other items purchased along with the row key. LLR filters out the weakly correlating non-zero values (0 mean no evidence of correlation anyway). If we didn’t do this it would be purely a “Cooccurrence” recommender, one of the first useful ones. But filtering based on cooccurrence strength (PtP values without LLR applied to them) produces much worse results than using LLR to filter for most highly correlated cooccurrences. You get a similar effect with Matrix Factorization but you can only use one type of event for various reasons.Since LLR is a probabilistic metric that only looks at counts, it can be applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC (purchase, category-preferences). We did an experiment using Mean Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/So the benefit and use of LLR is to filter weak data from the model and allow us to see if dislikes, and other events, correlate with likes. Adding this type of data, that is usually thrown away is one the the most powerful reasons to use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query is that is it fast, taking the user’s realtime events into the query but also because it is is trivial to add all sorts or business rules. like give me recs based on user events but only ones from a certain category, of give me recs but only ones tagged as “in-stock” in fact the business rules can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atro...@salesforce.com> wrote:
I'll echo Dan here. He and I went through the raw Mahout libraries called by the Universal Recommender, and while Noelia's description is accurate for an intermediate step, the indexing via ElasticSearch generates some separate relevancy scores based on their Lucene indexing scheme. The raw LLR scores are used in building this process, but the final scores served up by the API's should be post-processed, and cannot be used to reconstruct the raw LLR's (to my understanding).There are also some additional steps including down-sampling, which scrubs out very rare combinations (which otherwise would have very high LLR's for a single observation), which partially corrects for the statistical problem of multiple detection. But the underlying logic is per Ted Dunning's research and summarized by Noelia, and is a solid way to approach interaction effects for tens of thousands of items and including secondary indicators (like demographics, or implicit preferences).
ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com
Office: 317.832.4404
Mobile: 317.531.0216
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.
To post to this group, send email to actionml-user@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
Mahout builds the model by doing matrix multiplication (PtP) then calculating the LLR score for every non-zero value. We then keep the top K or use a threshold to decide whether to keep of not (both are supported in the UR). LLR is a metric for seeing how likely 2 events in a large group are correlated. Therefore LLR is only used to remove weak data from the model.So Mahout builds the model then it is put into Elasticsearch which is used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only an indicator that the item survived the LLR test.The KNN is applied using the user’s history as the query and finding items the most closely match it. Since PtP will have items in rows and the row will have correlating items, this “search” methods work quite well to find items that had very similar items purchased with it as are in the user’s history.=============================== that is the simple explanation ========================================Item-based recs take the model items (correlated items by the LLR test) as the query and the results are the most similar items—the items with most similar correlating items.The model is items in rows and items in columns if you are only using one event. PtP. If you think it through, it is all purchased items in as the row key and other items purchased along with the row key. LLR filters out the weakly correlating non-zero values (0 mean no evidence of correlation anyway). If we didn’t do this it would be purely a “Cooccurrence” recommender, one of the first useful ones. But filtering based on cooccurrence strength (PtP values without LLR applied to them) produces much worse results than using LLR to filter for most highly correlated cooccurrences. You get a similar effect with Matrix Factorization but you can only use one type of event for various reasons.Since LLR is a probabilistic metric that only looks at counts, it can be applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC (purchase, category-preferences). We did an experiment using Mean Average Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/So the benefit and use of LLR is to filter weak data from the model and allow us to see if dislikes, and other events, correlate with likes. Adding this type of data, that is usually thrown away is one the the most powerful reasons to use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query is that is it fast, taking the user’s realtime events into the query but also because it is is trivial to add all sorts or business rules. like give me recs based on user events but only ones from a certain category, of give me recs but only ones tagged as “in-stock” in fact the business rules can have inclusion rules, exclusion rules, and be mixed with ANDs and ORs.BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atro...@salesforce.com> wrote:
I'll echo Dan here. He and I went through the raw Mahout libraries called by the Universal Recommender, and while Noelia's description is accurate for an intermediate step, the indexing via ElasticSearch generates some separate relevancy scores based on their Lucene indexing scheme. The raw LLR scores are used in building this process, but the final scores served up by the API's should be post-processed, and cannot be used to reconstruct the raw LLR's (to my understanding).There are also some additional steps including down-sampling, which scrubs out very rare combinations (which otherwise would have very high LLR's for a single observation), which partially corrects for the statistical problem of multiple detection. But the underlying logic is per Ted Dunning's research and summarized by Noelia, and is a solid way to approach interaction effects for tens of thousands of items and including secondary indicators (like demographics, or implicit preferences).
ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com
Office: 317.832.4404
Mobile: 317.531.0216
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-user+unsubscribe@googlegroups.com.
To post to this group, send email to actionml-user@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.