Gensim and Spark

859 views
Skip to first unread message

Artyom Topchyan

unread,
Sep 28, 2015, 8:22:37 AM9/28/15
to gensim
Hi,

So I have kinda played around with implementing some ofthe dist Model of gensim in Spark. It seems to work ok. So I'm wondering is there any interest in functionality like this? Or is it something not practical as say Mllib supplies compatible algos?


Artyom

Radim Řehůřek

unread,
Sep 28, 2015, 9:43:31 AM9/28/15
to gensim
On Monday, September 28, 2015 at 9:22:37 PM UTC+9, Artyom Topchyan wrote:
Hi,

So I have kinda played around with implementing some ofthe dist Model of gensim in Spark. It seems to work ok. So I'm wondering is there any interest in functionality like this?

Yes, definitely. Do you have the code / experiments somewhere online?

 
Or is it something not practical as say Mllib supplies compatible algos?

MLlib mostly supplies hype :) We've evaluated the algos available both from gensim and MLlib (LSI, LDA, word2vec) on large datasets and found critical issues in MLlib in each case, rendering the implementation unusable in practice.

-rr

 


Artyom

Artyom Topchyan

unread,
Sep 28, 2015, 11:15:57 AM9/28/15
to gensim
Hi,

Well I have an implementation of LDA on Spark using gensim, but I have to iron it out before I shares. Was just wondering if there was any interest so I devote time to making it usable by other people :D. Ill keep you posted.

Radim Řehůřek

unread,
Sep 28, 2015, 12:36:16 PM9/28/15
to gensim
Great :)

The sooner you share your progress the better IMO -- I'm sure other people are interested in this too, and could even lend a hand!

Radim

Artyom Topchyan

unread,
Sep 29, 2015, 6:42:31 AM9/29/15
to gensim
So I have a fairly naive implementation I did a while ago here https://github.com/Arttii/gensim/blob/spark/gensim/models/lda_spark.py. It works, but has basically a lot of stuff that should not be there in it and I have also not tested with say the Wiki Dataset yet. This might be useless, but a starting point. I'm currently thinking about refactoring this to be a more obvious wrapper around a normal gensim model.

Artyom
Reply all
Reply to author
Forward
0 new messages