Train causes java.util.NoSuchElement: head of empty list

88 views
Skip to first unread message

David Palmer

unread,
Aug 27, 2020, 9:00:57 PM8/27/20
to actionml-user

Firstly, I did see the other thread from June with this same exact exception, but the cause in my case seems to be different in that the event data i have is less than a year old and I can easily see it in MongoDB in the events collection.

My full stack trace is:

20:50:10.643 ERROR URAlgorithm       - Spark computation failed for engine test-test-org with params {{"engineId":"test-test-org","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"master":"local","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.executor.memory":"3g","spark.driver.memory":"3g","spark.es.index.auto.create":"true","spark.es.nodes":"http://localhost:9200","spark.es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"email_open"},{"name":"email_click"},{"name":"content_click"},{"name":"like"},{"name":"rsvp"},{"name":"watch"}]}}}
java.util.NoSuchElementException: head of empty list
    at scala.collection.immutable.Nil$.head(List.scala:431)
    at scala.collection.immutable.Nil$.head(List.scala:428)
    at com.actionml.engines.ur.URAlgorithm.calcAll(URAlgorithm.scala:336)
    at com.actionml.engines.ur.URAlgorithm$$anonfun$train$1.apply(URAlgorithm.scala:267)
    at com.actionml.engines.ur.URAlgorithm$$anonfun$train$1.apply(URAlgorithm.scala:254)
    at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
    at scala.util.Try$.apply(Try.scala:192)
    at scala.util.Success.map(Try.scala:237)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
    at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
    at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

My workflow is:
  • Create engine
  • Add event data (that is within a year old)
  • Train data (causes the above stack trace)
A snapshot of an event from MongoDB:

{
    "_id" : ObjectId("5f4854aa96669037d264bed1"),
    "eventId" : "5de346c52fc0b2eab8f722dd",
    "event" : "content_click",
    "entityType" : "user",
    "entityId" : "5c1c0ae4a1c02b0c3842042a",
    "targetEntityId" : "5de345932fc0b2eab8f722d3",
    "dateProps" : {
    },
    "categoricalProps" : {
    },
    "floatProps" : {
    },
    "booleanProps" : {
    },
    "eventTime" : ISODate("2019-12-01T04:51:17.802+0000")
}
db.getCollection("events").find({}).count()
results in: 5487 documents

Not sure what to try next.

Pat Ferrel

unread,
Aug 28, 2020, 11:19:51 AM8/28/20
to David Palmer, actionml-user
There is no data for one of your indicators.
--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/36c8a9d6-d15b-4422-a992-db6e2883a324n%40googlegroups.com.

David Palmer

unread,
Aug 29, 2020, 3:42:36 PM8/29/20
to actionml-user
Thanks for the response!

So yes, my test data only has data for one indicator ("content_click") but my real data payloads will have other indicators populated at some point. does this mean that i can't pre-define the engine with all of the possible indicators i'll be working with?

Pat Ferrel

unread,
Aug 29, 2020, 6:10:53 PM8/29/20
to David Palmer, actionml-user
The only way the engine has to train a model for queries is by defining a set of events as “indicators”. Due to the math of the algorithm all these indicator must be defined. So you can collect any data but you can’t train that data unless all indicators have at least some data. 

If you have no data for an indicator, why should you define it now? Another way to ask this is; if you have only occasional data for an indicator, will do you think it will be useful?

If you have some reason to collect data with missing indicators, can you explain it? Maybe you have not hooked all your indicator generating actions yet? In this case put a dummy event for every indicator in the test data and training will basically ignore the indicator with minimal data.

David Palmer

unread,
Aug 30, 2020, 11:17:08 AM8/30/20
to actionml-user
Interesting, ok your explanation is quite clear and I understand it now. Just to expand on what this use case I have is, the UR i'm implementing is essentially a "centralized" system where there will be disparate systems (that are unrelated to each other) that will be feeding this system events. And each system does slightly different stuff, so my thinking was, i would define a set of indicators that would provide these disparate systems they each could "understand"

I can never guarantee when or even if one of these systems ever provides an event for a given indicator, so this tells me i should either:

1. not provide specific indicators and just have a general indicator (i do not think i like this approach because I feel i lose some of the tuning capabilities of the harness)
2. just seed the engine with some "test data" for each indicator and let the system figure things out as it should. This to me seems more feasible, and will just have to require some documenting on my end to explain this decision because I could see another developer who is unfamiliar with UR wondering why when an engine is created test data is always added. Not the end of the world.

Pat Ferrel

unread,
Aug 30, 2020, 2:48:48 PM8/30/20
to David Palmer, actionml-user
Sounds like you should have one engine instance for each “disparate” system. At the heart a recommender tries to predict a user’s behavior based on behavior of similar users. If systems are disparate, behavior will be disparate so finding similar users will be difficult since each user may behave similar to a different set of users for each disparate data set.

I’m not sure I see the logic of combining the data. I guess it might be useful to ASK if behavior in one domain will predict behavior in another. This might be true but is in no way guaranteed. However if you have only behavior in one domain it is very likely we can predict unobserved (i.e. future) behavior.

On the other hand what you do depends on how “disparate” the behavior is. For instance you may be defining things to be different that are really not, like a video view = “watch", a product purchase = “buy” when these are both conversions. Treating them as such is probably better than giving them different indictors. If conversions are analogous.

Can you describe more specifically what you mean by “disparate”?


From: David Palmer <blinde...@gmail.com>
Date: August 30, 2020 at 8:17:08 AM
To: actionml-user <action...@googlegroups.com>
Subject:  Re: Train causes java.util.NoSuchElement: head of empty list

Interesting, ok your explanation is quite clear and I understand it now. Just to expand on what this use case I have is, the UR i'm implementing is essentially a "centralized" system where there will be disparate systems (that are unrelated to each other) that will be feeding this system events. And each system does slightly different stuff, so my thinking was, i would define a set of indicators that would provide these disparate systems they each could "understand"

I can never guarantee when or even if one of these systems ever provides an event for a given indicator, so this tells me i should either:

1. not provide specific indicators and just have a general indicator (i do not think i like this approach because I feel i lose some of the tuning capabilities of the harness)
2. just seed the engine with some "test data" for each indicator and let the system figure things out as it should. This to me seems more feasible, and will just have to require some documenting on my end to explain this decision because I could see another developer who is unfamiliar with UR wondering why when an engine is created test data is always added. Not the end of the world.



On Saturday, August 29, 2020 at 6:10:53 PM UTC-4 pat wrote:
Reply all
Reply to author
Forward
0 new messages