New functionality: User-KNN and Item-KNN

47 views
Skip to first unread message

Danny Bickson

unread,
Oct 26, 2011, 12:57:31 PM10/26/11
to GraphLab Users, Carlos Guestrin, graph...@googlegroups.com, Mohit Singh, jeff...@gmail.com
Hi!
Several of our users asked me to add User-KNN and Item-KNN to our clustering library.
(KNN = K-nearest neighbors). I have added both algorithms.
The current implementation uses brute force, of comparing all the pairs of users/items for
the validation/training data. For moderate dataset sizes this approach makes perfect sense -
this is what we need for KDD CUP data. Explanation of the algorithm is found in our KDD CUP
workshop paper here: http://kddcup.yahoo.com/workshop.php# (Look for LeBuShiShu team).

Additionally, sample based KNN is supported, where x% of all the pairs is computed. 

I am thinking about implementing more fancy algorithms like KD-Trees soon.

The input to the algorithm are two files: training and validation. In the example
below it is netflix and netflixe (validation).
The output is a list of the closest K neighbors for each user or item.

Here is an example run on Netflix data, of Item-KNN:
glcluster netflix 5 2 0 --pmfformat=true --float=false --ncpus=8 --knn_sample_percent=0.8
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fffc9bfd000
[Thread debugging using libthread_db enabled]
[New Thread 47046612228080 (LWP 27405)]
INFO:     clustering.cpp(do_main:460): Clustering Code (K-Means/Fuzzy K-means/K-Means++/LDA) written By Danny Bickson, CMU
Send bug reports and comments to danny....@gmail.com
Setting run mode Item-KNN
INFO:     advanced_config.cpp(verify_setup:57): Setting cluster initialization mode to: RANDOM
INFO:     clustering.cpp(start:305): Item-KNN starting

loading data file netflix
Loading netflix
Creating 3298163 edges (observed ratings)...
.................Loading netflixe
Creating 545177 edges (observed ratings)...
...Loading netflixt
INFO:     asynchronous_engine.hpp(run:111): Worker 0 started.
INFO:     asynchronous_engine.hpp(run:111): Worker 1 started.
INFO:     asynchronous_engine.hpp(run:111): Worker 2 started.
INFO:     asynchronous_engine.hpp(run:111): Worker 3 started.
INFO:     asynchronous_engine.hpp(run:111): Worker 4 started.
INFO:     asynchronous_engine.hpp(run:111): Worker 5 started.
INFO:     asynchronous_engine.hpp(run:111): Worker 6 started.
INFO:     asynchronous_engine.hpp(run:111): Worker 7 started.

handling validation row 95600
handling validation row 95700
handling validation row 95800
handling validation row 95900
handling validation row 96000
handling validation row 96100
handling validation row 96200
handling validation row 96300
handling validation row 96400
handling validation row 96500
handling validation row 96600
handling validation row 96700
handling validation row 96800
handling validation row 96900
handling validation row 97000
handling validation row 97100
handling validation row 97200
handling validation row 97300
handling validation row 97400
handling validation row 97500
handling validation row 97600
handling validation row 97700
handling validation row 97800
handling validation row 97900
handling validation row 98000
handling validation row 98100
handling validation row 98200
handling validation row 98300
handling validation row 98400
handling validation row 98500
handling validation row 98600
handling validation row 98700
handling validation row 98800
handling validation row 98900
handling validation row 99000
INFO:     asynchronous_engine.hpp(run:119): Worker 0 finished.
INFO:     asynchronous_engine.hpp(run:119): Worker 6 finished.
INFO:     asynchronous_engine.hpp(run:119): Worker 1 finished.
INFO:     asynchronous_engine.hpp(run:119): Worker 3 finished.
INFO:     asynchronous_engine.hpp(run:119): Worker 2 finished.
INFO:     asynchronous_engine.hpp(run:119): Worker 4 finished.
INFO:     asynchronous_engine.hpp(run:119): Worker 5 finished.
INFO:     asynchronous_engine.hpp(run:119): Worker 7 finished.

Distance statistics: min 0 max 369.21 avg 27.7443

 === REPORT FOR core() ===
[Numeric]
ncpus: 8
[Other]
affinities: false
compile_flags:
engine: async
scheduler: fifo
schedyield: true
scope: edge

 === REPORT FOR engine() ===
[Numeric]
num_edges: 0
num_syncs: 0
num_vertices: 99087
updatecount: 3561
[Timings]
runtime: 321.4 s
[Other]
termination_reason: task depletion (natural)
[Numeric]
updatecount_vector: 3561 (count: 8, min: 432, max: 450, avg: 445.125)
updatecount_vector.values: 448,432,447,445,450,444,449,446,


Reply all
Reply to author
Forward
0 new messages