I am currently working on a sparkling water application and I am a total beginner in spark and h2o.
I am following the university h2o lesson 3 "CraigslistJobTitlesApp" and I do understand the approach.
In my work i want to build a word2vec model on a dataset(Textfile like .txt) and thereafter I want to get a dataframe with the words and their word representation(vector to use it as input for the h2o - cloud.
Now, I am stuck between the transformation from the word2vec model to the dataframe.
What is the best approach to define a data frame with word + Vector?
Do you have other solution or suggestions
I tried to get the answer in the source code(is in scala written) - but no success.
Environment:
Ubuntu 14.04 VirtualMachine
Spark 1.6.1 on hadoop2.6 build
Sparkling water 1.6.5
Using python 2.7
Code:
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
from pysparkling import *
import h2o
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import Vectors
from pyspark.sql import Row
# Starting h2o application on spark cluster
hc = H2OContext(sc).start()
# Loading input file
inp = sc.textFile("examples/custom/text8.txt").map(lambda row: row.split(" "))
# building the word2vec model with a vector size of 10
word2vec = Word2Vec()
model = word2vec.setVectorSize(10).fit(inp)
# Sanity check
model.findSynonyms("property",5)
# assign vector representation to variable
wordVectorDF = model.getVectors()
# Transform words (input) into dataframe
inp_data = sc.parallelize(mVec)
inp_data = inp_data.map(lambda row: row.split(" "))
# edit column name to word
data = inp_data.map(lambda p: Row(word = p[0]))
df = sqlContext.createDataFrame(data)
#i get only the words in a column but i want also the vector of the model