Word2Vec-Model to Dataframe: Building Machine Learning Applications with Sparkling Water

397 views

Skip to first unread message

oben...@gmail.com

unread,

Jun 27, 2016, 7:52:12 PM6/27/16

to H2O Open Source Scalable Machine Learning - h2ostream

Hello,

I am currently working on a sparkling water application and I am a total beginner in spark and h2o.

I am following the university h2o lesson 3 "CraigslistJobTitlesApp" and I do understand the approach.

In my work i want to build a word2vec model on a dataset(Textfile like .txt) and thereafter I want to get a dataframe with the words and their word representation(vector to use it as input for the h2o - cloud.

Now, I am stuck between the transformation from the word2vec model to the dataframe.
What is the best approach to define a data frame with word + Vector?
Do you have other solution or suggestions
I tried to get the answer in the source code(is in scala written) - but no success.

Environment:
Ubuntu 14.04 VirtualMachine
Spark 1.6.1 on hadoop2.6 build
Sparkling water 1.6.5
Using python 2.7

Code:

from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o

from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row

# Starting h2o application on spark cluster
hc = H2OContext(sc).start()

# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))

# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)

# Sanity check
model.findSynonyms("property",5)

# assign vector representation to variable
wordVectorDF = model.getVectors()

# Transform words (input) into dataframe
inp_data = sc.parallelize(mVec)
inp_data = inp_data.map(lambda row: row.split(" "))

# edit column name to word
data = inp_data.map(lambda p: Row(word = p[0]))
df = sqlContext.createDataFrame(data)