Error in running analysis tools

84 views
Skip to first unread message

ra...@getamplify.com

unread,
Oct 7, 2016, 4:54:17 AM10/7/16
to actionml-user
Hi 

I am getting below error -

Traceback (most recent call last):
  File "./analysis-tools/map_test.py", line 649, in <module>
    root()
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 716, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 696, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 1060, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 889, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 534, in invoke
    return callback(*args, **kwargs)
  File "./analysis-tools/map_test.py", line 124, in split
    train_df, test_df = split_data(df)
  File "./analysis-tools/map_test.py", line 65, in split_data
    split_date = get_split_date(df, cfg.splitting.split_event, cfg.splitting.train_ratio)
  File "./analysis-tools/map_test.py", line 56, in get_split_date
    .filter(lambda x: x[1] > total_primary_events * train_ratio)
  File "/home/hduser/PredictionIO/vendors/spark-1.6.0/python/pyspark/rdd.py", line 1318, in first
    raise ValueError("RDD is empty")
ValueError: RDD is empty


Why is RDD empty?


Thanks




Pat Ferrel

unread,
Oct 7, 2016, 11:45:38 AM10/7/16
to ra...@getamplify.com, actionml-user
You have no data, perhaps a wrong path?


--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/530cf624-8322-46ac-815b-4ec4bd0bfbe5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

rasna...@gmail.com

unread,
Oct 13, 2016, 5:34:22 AM10/13/16
to actionml-user, ra...@getamplify.com

Hi


Below is config.json -


{

"engine_config": "./engine.json",


"splitting": {

"version": "1",

"source_file": "hdfs://x.x.x.x:9000/cv-exp001/data",

"train_file": "hdfs://x.x.x.x:9000/cv-exp001/train",

"test_file": "hdfs://x.x.x.x:9000/cv-exp001/test",

"type": "date",

"train_ratio": 0.8,

"random_seed": 29750,

"split_event": "eventTime"

},


"reporting": {

"file": "./report.xlsx"

},


"testing": {

"map_k": 10,

"non_zero_users_file": "./non_zero_users.dat",

"consider_non_zero_scores_only": true,

"custom_combos": {

"event_groups": [["purchased","viewed"]]

}

},


"spark": {

"master": "spark://x.x.x.x:7077"

}

}


I have following setup -

1) All files of analysis tool are present in universal recommender folder including config.json

2) I have remote spark cluster, remote hbase and separate VM where pio is installed

3) Runing this command from pio VM -


    SPARK_HOME=/usr/local/spark PYTHONPATH=/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.9-src.zip ./map_test.py split


4) Data is also there in hdfs.


Any Help? Still getting this error.

ra...@getamplify.com

unread,
Oct 14, 2016, 3:05:51 AM10/14/16
to actionml-user, ra...@getamplify.com, rasna...@gmail.com
Hi 

In get_split_date function(df, split_event, train_ratio=0.8) of map_test.py, I am getting error

df is populated with data but after applying filter("event = '%s'" % split_event), empty RDD is returned.

Few rows from df are as follows -

[
Row(creationTime='2016-10-04T10:38:09.018Z', entityId='Nexus', entityType='item', event='$set', eventId='KpjNMVrQzY2s0TZhYB3vsAAAAVYo18onkB8R6AbUL98', eventTime='2016-07-26T20:14:05.863Z', properties=Row(available=None, categories=['Tablets', 'Electronics', 'Google'], countries=None, date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 7, 26, 16, 14, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:08.131Z', entityId='Nexus', entityType='item', event='$set', eventId='KpjNMVrQzY2s0TZhYB3vsAAAAVbFZeonhf7uh0Xfu0k', eventTime='2016-08-26T05:50:05.863Z', properties=Row(available=None, categories=None, countries=['Cuba'], date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 8, 26, 1, 50, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:08.011Z', entityId='Nexus', entityType='item', event='$set', eventId='KpjNMVrQzY2s0TZhYB3vsAAAAVbZ_1onvHTjv5LInqI', eventTime='2016-08-30T05:50:05.863Z', properties=Row(available=None, categories=['Computers'], countries=None, date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 8, 30, 1, 50, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:07.835Z', entityId='Nexus', entityType='item', event='$set', eventId='KpjNMVrQzY2s0TZhYB3vsAAAAVb21ionkejcwuSobII', eventTime='2016-09-04T20:14:05.863Z', properties=Row(available=None, categories=None, countries=['United States', 'Canada'], date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 9, 4, 16, 14, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:07.701Z', entityId='Nexus', entityType='item', event='$set', eventId='KpjNMVrQzY2s0TZhYB3vsAAAAVcLb5onpiQzzGE59oM', eventTime='2016-09-08T20:14:05.863Z', properties=Row(available=None, categories=['Tablets', 'Electronics', 'Google'], countries=None, date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 9, 8, 16, 14, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:09.233Z', entityId='Nexus', entityType='item', event='$set', eventId='KpjNMVrQzY2s0TZhYB3vsAAAAVePRadRkjjTYGsvboY', eventTime='2016-10-04T10:38:09.233Z', properties=Row(available='2016-10-01T15:26:05.863386+00:00', categories=['Tablets', 'Electronics', 'Google'], countries=['United States', 'Canada'], date='2016-10-03T15:26:05.863386+00:00', expires='2016-10-05T15:26:05.863386+00:00'), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 10, 4, 6, 38, 9, 233000)), 

Row(creationTime='2016-10-04T10:38:09.170Z', entityId='Surface', entityType='item', event='$set', eventId='MdgNfySNSsz0WVh1q6f3_gAAAVYMAPonmCNNl3LeJDA', eventTime='2016-07-21T05:50:05.863Z', properties=Row(available=None, categories=None, countries=['United States', 'Canada'], date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 7, 21, 1, 50, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:09.065Z', entityId='Surface', entityType='item', event='$set', eventId='MdgNfySNSsz0WVh1q6f3_gAAAVYgmmons9VaP743Pv8', eventTime='2016-07-25T05:50:05.863Z', properties=Row(available=None, categories=['Tablets', 'Electronics', 'Microsoft'], countries=None, date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 7, 25, 1, 50, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:08.178Z', entityId='Surface', entityType='item', event='$set', eventId='MdgNfySNSsz0WVh1q6f3_gAAAVa9KIoniI0HwgFyVNg', eventTime='2016-08-24T15:26:05.863Z', properties=Row(available=None, categories=None, countries=['Cuba'], date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 8, 24, 11, 26, 5, 863000)), Row(creationTime='2016-10-04T10:38:08.053Z', entityId='Surface', entityType='item', event='$set', eventId='MdgNfySNSsz0WVh1q6f3_gAAAVbRwfonrZtDaIOMBjU', eventTime='2016-08-28T15:26:05.863Z', properties=Row(available=None, categories=['Computers'], countries=None, date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 8, 28, 11, 26, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:07.884Z', entityId='Surface', entityType='item', event='$set', eventId='MdgNfySNSsz0WVh1q6f3_gAAAVbumMonkCzyNsBeTlc', eventTime='2016-09-03T05:50:05.863Z', properties=Row(available=None, categories=None, countries=['United States', 'Canada'], date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 9, 3, 1, 50, 5, 863000)), 

Row(creationTime='2016-10-04T10:38:07.756Z', entityId='Surface', entityType='item', event='$set', eventId='MdgNfySNSsz0WVh1q6f3_gAAAVcDMjonlxlsA8Xk4c8', eventTime='2016-09-07T05:50:05.863Z', properties=Row(available=None, categories=['Tablets', 'Electronics', 'Microsoft'], countries=None, date=None, expires=None), targetEntityId=None, targetEntityType=None, Date=datetime.datetime(2016, 9, 7, 1, 50, 5, 863000))

]



In config.json, split_event = "eventTime"
For this split_event, I am getting empty RDD.

Any help?

Thanks

ra...@getamplify.com

unread,
Oct 14, 2016, 7:11:28 AM10/14/16
to actionml-user, ra...@getamplify.com, rasna...@gmail.com
Hi 

Its working now, I changed type in splitting section of config.json to "random" .

Few doubts regarding this - 

1) Why type= date producing emptyRDD ?
2) If type =random, this means data will be splitted randomly ? 
3) If type=date, how does split happen?

Alexey Pan'kov

unread,
Oct 14, 2016, 2:22:34 PM10/14/16
to ra...@getamplify.com, actionml-user, rasna...@gmail.com
Hi,

The problem was splitting event. In your case all events are «$set». You tried to split by  «eventTime», but there is no such event, there is such field name.
The idea is that different events may be distributed unevenly in time, so you need to use split_event to «tell» the system what sort of events should be used to calculate split. In most of the cases you should just use primary event (the first in list in config.json).


--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.

rasna...@gmail.com

unread,
Oct 16, 2016, 3:44:04 AM10/16/16
to actionml-user, ra...@getamplify.com, rasna...@gmail.com
Hi

I just changed the splitting type to "random" and split_event is still eventTime and it is working. 
So, how does split happens in case of type = random and split_event = eventTime?

Alexey Pan'kov

unread,
Oct 16, 2016, 10:57:18 AM10/16/16
to rasna...@gmail.com, actionml-user, ra...@getamplify.com
Hi
split_event = eventTime is just wrong (because eventTime is not an event name but is a field name), but if  type = random then split_event is ignored so no matter what value it has. 

Pat Ferrel

unread,
Oct 16, 2016, 1:12:19 PM10/16/16
to rasna...@gmail.com, actionml-user, ra...@getamplify.com
The split_event should be the first/primary/conversion event in eventNames in engine.json.  There are 2 types of splits, time based and random. Use the split_event with time-based splits. This mimics the way new events are used in a live recommender.

Random splits are for times you do not have a reliable eventTime.

This tools is given as-is and is used all the time by us, but not really designed to be a turnkey application. The data you will get out of it will be hard to interpret without a lot of recommender, machine learning, and analytics experience. We don’t really have the resources to walk you through that in this mailing list.

Goog luck. 

rasna...@gmail.com

unread,
Oct 18, 2016, 2:53:07 AM10/18/16
to actionml-user, rasna...@gmail.com, ra...@getamplify.com
Hi

Is there any other mailing list for this?

ra...@getamplify.com

unread,
Nov 1, 2016, 12:02:06 PM11/1/16
to actionml-user, rasna...@gmail.com, ra...@getamplify.com
Hi

events stats sheet is not created in report.xlsx after permorming cross validation test.
row count in hbase = 345020

Thanks

ra...@getamplify.com

unread,
Nov 8, 2016, 7:02:18 AM11/8/16
to actionml-user, rasna...@gmail.com, ra...@getamplify.com
I ran this in python notebook - 
date_rdd = (df
                .filter("event = '%s'" % (PRIMARY_EVENT_NAME))
                .select("Date")
                .sort("Date", ascending=True)
                .rdd)

date_rdd.collect()

 gives following data -

[Row(Date=datetime.datetime(2016, 8, 24, 20, 25, 7, 572000)),
 Row(Date=datetime.datetime(2016, 8, 25, 15, 37, 7, 572000)),
 Row(Date=datetime.datetime(2016, 8, 26, 10, 49, 7, 572000)),
 Row(Date=datetime.datetime(2016, 8, 27, 6, 1, 7, 572000)),
 Row(Date=datetime.datetime(2016, 8, 28, 1, 13, 7, 572000)),
 Row(Date=datetime.datetime(2016, 8, 28, 20, 25, 7, 572000)),
 Row(Date=datetime.datetime(2016, 9, 6, 15, 37, 7, 572000)),
 Row(Date=datetime.datetime(2016, 9, 7, 10, 49, 7, 572000)),
 Row(Date=datetime.datetime(2016, 9, 8, 6, 1, 7, 572000))]


But getting error while running this - 
total_primary_events = date_rdd.count()


Stack trace:-

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-18-84dbd0a5d19b> in <module>()
      5             .rdd)
      6 
----> 7 total_primary_events = date_rdd.count()

/home/rasna/PredictionIO/vendors/spark-1.6.2/python/pyspark/rdd.py in count(self)
   1002         3
   1003         """
-> 1004         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   1005 
   1006     def stats(self):

/home/rasna/PredictionIO/vendors/spark-1.6.2/python/pyspark/rdd.py in sum(self)
    993         6.0
    994         """
--> 995         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
    996 
    997     def count(self):

/home/rasna/PredictionIO/vendors/spark-1.6.2/python/pyspark/rdd.py in fold(self, zeroValue, op)
    867         # zeroValue provided to each partition is unique from the one provided
    868         # to the final reduce call
--> 869         vals = self.mapPartitions(func).collect()
    870         return reduce(op, vals, zeroValue)
    871 

/home/rasna/PredictionIO/vendors/spark-1.6.2/python/pyspark/rdd.py in collect(self)
    769         """
    770         with SCCallSiteSync(self.context) as css:
--> 771             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    772         return list(_load_from_socket(port, self._jrdd_deserializer))
    773 

/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/home/rasna/PredictionIO/vendors/spark-1.6.2/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/local/lib/python3.5/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 1 times, most recent failure: Lost task 0.0 in stage 37.0 (TID 952, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/rasna/PredictionIO/vendors/spark-1.6.2/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions

	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)



Alexey Pan'kov

unread,
Nov 8, 2016, 7:44:28 AM11/8/16
to ra...@getamplify.com, actionml-user, rasna...@gmail.com
Looks like the error is self-explanatory: "Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions»

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To post to this group, send email to action...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages