EsHadoopIllegalArgumentException - Cannot detect ES version

shish...@gmail.com

unread,

Apr 27, 2020, 1:03:37 AM4/27/20

to actionml-user

Hi

I am running the harness from the

harness-docker-compos

repository.

I create an engine with the following configuration:

{
  "engineId": "test_ur",
  "engineFactory": "com.actionml.engines.ur.UREngine",
  "sparkConf": {
    "master": "local",
    "spark.driver.memory": "3g",
    "spark.executor.memory": "1g",
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
    "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator",
    "spark.kryo.referenceTracking": "false",
    "spark.kryoserializer.buffer": "300m",
    "spark.es.index.auto.create": "true",
    "spark.es.nodes": "localhost",
    "es.nodes":"localhost",
    "spark.es.nodes.wan.only": "true",
    "es.nodes.wan.only":"true"
  },
  "algorithm": {
    "indicators": [
      {
        "name": "purchase"
      },
      {
        "name": "view"
      },
      {
        "name": "category-pref"
      }
    ],
    "num": 4
  }
}

I add nearly 250 events to the database and then run a training job. The get the following error in the harness logs:

04:45:19.004 ERROR NetworkClient     - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...
04:45:19.011 ERROR URAlgorithm       - Spark computation failed for engine test_ur with params {{"engineId":"test_ur","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"master":"local","spark.driver.memory":"3g","spark.executor.memory":"1g","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.es.index.auto.create":"true","spark.es.nodes":"localhost","es.nodes":"localhost","spark.es.nodes.wan.only":"true","es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"view"},{"name":"category-pref"}],"num":4}}}
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only

The elastic search is working and available at port 9200. I have even tried connecting it with a python client just to make sure that it works.

Even the spark config for ES is as mentioned in some of the other answers.

"spark.es.nodes": "localhost",
"es.nodes":"localhost",
"spark.es.nodes.wan.only": "true",
"es.nodes.wan.only":"true"

The spark jobs is able to creates indices, although the sizes are in bytes.

curl http://localhost:9200/_cat/indices

produces following results.

yellow open test_ur_1587957559524  x0FasDlKQoiPzcsH7Y1lbw 1 1 0 0  283b  283b
yellow open test_ur_1587962717198  4GwAdFlrQLOr9IMqQJg-pA 1 1 0 0  283b  283b
yellow open test_ur_1587954743979  HI4adyyyRQm25gVBIf1wRw 1 1 0 0  283b  283b
yellow open test_ur_1587945057452  gWGq94QlRBm948zar3eTEg 1 1 0 0  283b  283b

After this it produces the error mentioned above. How is it able to create these indices but not able to write data?

Thanks.

Pat Ferrel

unread,

Apr 27, 2020, 11:58:54 AM4/27/20

to shish...@gmail.com, actionml-user

Look in harness logs to see why Spark jobs fail. It is 99% caused by not enough memory but the ERRORs in logs will be a variety of things. Spark uses in-memory calculations to speed training and when it runs out it throws exceptions.

As I have said many times, docker-compose is limited in scaling and so is not recommended for production deployments. Ideally each service should be put in independent deployments so they can be scaled independently. We see most scaling needs in Spark but performance can be boosted by scaling Mongo and ES.

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/5118afa5-28cb-4a46-a6c6-dac3993b47f1%40googlegroups.com.

shish...@gmail.com

unread,

Apr 27, 2020, 12:29:41 PM4/27/20

to actionml-user

The spark job does not fail.

{
      "jobId": "f28cd967-90cd-4e4d-b800-9ac1d6e3fec6",
      "status": {
        "name": "successful"
      },
      "comment": "Spark job",
      "createdAt": "2020-04-27T04:44:56.985Z",
      "completedAt": "2020-04-27T04:45:19.030Z"
    },

In fact, I see the following popRank in the logs too:

04:45:15.830 INFO  URAlgorithm       - RankRDDs[1]
popRank
(4,5.0)
(14,8.0)
(19,8.0)
(7,7.0)
(42,8.0)
(36,5.0)
(6,4.0)
(37,4.0)
(45,4.0)
(2,13.0)
(16,14.0)
(41,7.0)

It is writing to ES that fails for some reason. ES is working fine, I can access it from the browser and also using python, from which can also write to it. Only harness is not able to write to it, excepting for creating those byte size indices.

I have also added all the configuration needed to write to ES, still can't figure out what is wrong. This is what is present in the harness logs.

04:45:17.191 INFO  DAGScheduler      - Job 19 finished: collect at URModel.scala:81, took 1.103642 s
04:45:17.198 INFO  ElasticSearchClient$$anon$1 - Create new index: test_ur_1587962717198, List(popRank, category, purchase, id), Map(id -> (keyword,true), category-pref -> (keyword,true), category -> (keyword,true), purchase -> (keyword,true), popRank -> (float,false), view -> (keyword,true))
04:45:18.465 WARN  RestClient        - request [PUT http://elasticsearch:9200/test_ur_1587962717198?include_type_name=true] returned 1 warnings: [299 Elasticsearch-7.6.0-7f634e9f44834fbc12724506cc1da681b0c3b1e3 "[types removal] Using include_type_name in create index requests is deprecated. The parameter will be removed in the next major version."]
04:45:18.468 INFO  ElasticSearchClient$$anon$1 - Number of ES connections for saveToEs: 1
04:45:18.995 INFO  HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
04:45:18.996 INFO  HttpMethodDirector - Retrying request
04:45:18.997 INFO  HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
04:45:18.999 INFO  HttpMethodDirector - Retrying request
04:45:19.000 INFO  HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
04:45:19.002 INFO  HttpMethodDirector - Retrying request

04:45:19.004 ERROR NetworkClient     - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...
04:45:19.011 ERROR URAlgorithm       - Spark computation failed for engine test_ur with params {{"engineId":"test_ur","engineFactory":"com.actionml.engines.ur.UREngine","sparkConf":{"master":"local","spark.driver.memory":"3g","spark.executor.memory":"1g","spark.serializer":"org.apache.spark.serializer.KryoSerializer","spark.kryo.registrator":"org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator","spark.kryo.referenceTracking":"false","spark.kryoserializer.buffer":"300m","spark.es.index.auto.create":"true","spark.es.nodes":"localhost","es.nodes":"localhost","spark.es.nodes.wan.only":"true","es.nodes.wan.only":"true"},"algorithm":{"indicators":[{"name":"purchase"},{"name":"view"},{"name":"category-pref"}],"num":4}}}

org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
	at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
	at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:79)
	at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:76)
	at org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:56)
	at com.actionml.core.search.elasticsearch.ElasticSearchClient.hotSwap(ElasticSearchSupport.scala:379)
	at com.actionml.engines.ur.URModel.save(URModel.scala:83)
	at com.actionml.engines.ur.URAlgorithm$$anonfun$train$1.apply(URAlgorithm.scala:296)
	at com.actionml.engines.ur.URAlgorithm$$anonfun$train$1.apply(URAlgorithm.scala:255)
	at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
	at scala.util.Try$.apply(Try.scala:192)
	at scala.util.Success.map(Try.scala:237)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
	at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
	at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[localhost:9200]]
	at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:160)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:432)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:428)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:388)
	at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:392)
	at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:168)
	at org.elasticsearch.hadoop.rest.RestClient.mainInfo(RestClient.java:745)
	at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:330)
	... 20 common frames omitted

On Monday, 27 April 2020 11:58:54 UTC-4, pat wrote:

Look in harness logs to see why Spark jobs fail. It is 99% caused by not enough memory but the ERRORs in logs will be a variety of things. Spark uses in-memory calculations to speed training and when it runs out it throws exceptions.

As I have said many times, docker-compose is limited in scaling and so is not recommended for production deployments. Ideally each service should be put in independent deployments so they can be scaled independently. We see most scaling needs in Spark but performance can be boosted by scaling Mongo and ES.

To unsubscribe from this group and stop receiving emails from it, send an email to action...@googlegroups.com.

Pat Ferrel

unread,

Apr 27, 2020, 2:08:00 PM4/27/20

to shish...@gmail.com, actionml-user

The log is saying that the Spark job failed to launch because the ES version could not be detected. Are you using the latest image for harness:develop?

To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/7fe62433-b4f6-45d9-bc0b-1e36eeae6f62%40googlegroups.com.

Pat Ferrel

unread,

Apr 27, 2020, 2:18:45 PM4/27/20

to shish...@gmail.com, actionml-user

Try `docker-compose pull` to get the latest images

To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/etPan.5ea71f7c.5da1302b.419d%40occamsmachete.com.

shish...@gmail.com

unread,

Apr 27, 2020, 2:41:31 PM4/27/20

to actionml-user

Hi

I tried the `docker-compose pull` it still gives the same issue with the latest image.

But, then I tried running each service separately and was able to get the training and serving to work. I guess there is some issue with the docker-compose image.

Also, I would suggest that if the job actually fails then it should not be marked `successful` one should get that the job `failed` when they get the job status.

I can create an issue on github describing the problem.

Thanks.

To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/7fe62433-b4f6-45d9-bc0b-1e36eeae6f62%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.

To unsubscribe from this group and stop receiving emails from it, send an email to action...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/etPan.5ea71f7c.5da1302b.419d%40occamsmachete.com.

Pat Ferrel

unread,

Apr 27, 2020, 2:49:05 PM4/27/20

to shish...@gmail.com, actionml-user

Ok, please do create an issue.

It is very difficult to check for all conditions that result in a failed job. The job is run asynchronously and until the Job is created (which relies on Spark accepting the job) it cannot be tagged as “failed”.

Make sure the job you see as “successful” has the same job-id as the one that failed. We would expect that there is no job-id until the job is accepted by Spark.

If it does have the same job-id, then we have a problem in the status that we should be able to fix so add that information to the issue report.

Thanks

To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/a5ad2447-34a9-4070-84ab-5f6975a3892a%40googlegroups.com.

Pat Ferrel

unread,

Apr 27, 2020, 7:43:58 PM4/27/20

to shish...@gmail.com, actionml-user

The error:

ERROR NetworkClient - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...

indicates that the CONTAINER cannot connect to ES on its “localhost:9200” location. The container’s “localhost” is not the same as the host’s localhost. Containers work somewhat like VMs. For a container to container connection use the container name, in this case “elasticsearch:9200” not “localhost:9200”. These container names are maintained by the “docker-compose network” as if they almost like DNS names.

TL;DR set “spark.es.nodes”:”elasticsearch” when using docker-compose. You may want to brush up on how docker-compose maintains a pseudo-network.

es.nodes is not used anymore, only keys that start with “spark.” are needed.

Also be aware that the docker-compose.yml references a tag id for all images. The published image will change even though the tag remains the same. docker-compose pull tells the system to refresh all images that are out of date. Another docker/docker-compose detail you will want to understand is how containers are instantiated. We include the watchtower container to refresh automatically in some cases.

Also be aware that by default our docker-compose project launches images from our develop branch, which does not contain released code. This is meant for people who need to run pre-released code.

If you run docker-compose in production, you should fork the project and pin versions to known image tags like release numbers. This is left for you to do as you wish.

To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/5118afa5-28cb-4a46-a6c6-dac3993b47f1%40googlegroups.com.

shish...@gmail.com

unread,

Apr 27, 2020, 9:10:09 PM4/27/20

to actionml-user

Thanks a lot.

I changed it to "elasticsearch" and it works now. I had tried a few different combinations of the ip address but none of them had worked. Since, it was able to create those byte size indices I assumed, ES is accessible from the containers, but for some reason it was not able to write to it.

Thanks for all the help.

On Monday, 27 April 2020 19:43:58 UTC-4, pat wrote:

The error:

ERROR NetworkClient - Node [localhost:9200] failed (java.net.ConnectException: Connection refused (Connection refused)); no other nodes left - aborting...

indicates that the CONTAINER cannot connect to ES on its “localhost:9200” location. The container’s “localhost” is not the same as the host’s localhost. Containers work somewhat like VMs. For a container to container connection use the container name, in this case “elasticsearch:9200” not “localhost:9200”. These container names are maintained by the “docker-compose network” as if they almost like DNS names.

TL;DR set “spark.es.nodes”:”elasticsearch” when using docker-compose. You may want to brush up on how docker-compose maintains a pseudo-network.

es.nodes is not used anymore, only keys that start with “spark.” are needed.

Also be aware that the docker-compose.yml references a tag id for all images. The published image will change even though the tag remains the same. docker-compose pull tells the system to refresh all images that are out of date. Another docker/docker-compose detail you will want to understand is how containers are instantiated. We include the watchtower container to refresh automatically in some cases.

Also be aware that by default our docker-compose project launches images from our develop branch, which does not contain released code. This is meant for people who need to run pre-released code.

If you run docker-compose in production, you should fork the project and pin versions to known image tags like release numbers. This is left for you to do as you wish.

To unsubscribe from this group and stop receiving emails from it, send an email to action...@googlegroups.com.

Pat Ferrel

unread,

Apr 28, 2020, 12:20:15 PM4/28/20

to shish...@gmail.com, actionml-user

Docker, and it’s more complex cousin Kubernetes, both deliver many benefits over old style native host installations but they come with a new set of concepts and tools.

Docker-compose makes installation super easy, compared with native installs. But this ease can be deceptive — there is a cost for this.

For instance:

- you won’t ssh to an ip address, you’ll use docker of k8s to login to a container or run a command on the container

- you won’t tail -f logs, you’ll use docker-compose and k8s to view them

- you will need to explicitly attach a service to localhost with configuration and this forwards traffic back and forth between the host and containers

- docker-compose runs on a single machine so all containers share the host resources

- containers must communicate via network interfaces that involve routing even on a single machine — in fact k8s comes with a built in DNS resolver

- All config for a container is passed in via env variables set when a container is launched — these are specified in different ways for docker-compose and k8s

- in docker-compose the same storage can be attached to more than one container and will be seen in different places in the filesystems of the host and each container. This mapping of filesystems can be hard to visualize until you become familiar with it and the mapping is controlled by configuration so it changes with config.

There are many other differences — too many to note here.

We encourage anyone using docker-compose to not underestimate these differences in operation. Do some reading if you are unfamiliar with container technology.

From: shish...@gmail.com <shish...@gmail.com>
Date: April 27, 2020 at 6:10:08 PM
To: actionml-user <action...@googlegroups.com>

Subject: Re: EsHadoopIllegalArgumentException - Cannot detect ES version

Thanks a lot.

I changed it to "elasticsearch" and it works now. I had tried a few different combinations of the ip address but none of them had worked. Since, it was able to create those byte size indices I assumed, ES is accessible from the containers, but for some reason it was not able to write to it.

Thanks for all the help.

On Monday, 27 April 2020 19:43:58 UTC-4, pat wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to action...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/5118afa5-28cb-4a46-a6c6-dac3993b47f1%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to actionml-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/b2b05a8f-55cb-46d3-8be2-c314654b2558%40googlegroups.com.

shish...@gmail.com

unread,

Apr 28, 2020, 2:42:30 PM4/28/20

to actionml-user

Thanks a lot for this quick primer on containers. I will definitely read up more on the technology.

On Tuesday, 28 April 2020 12:20:15 UTC-4, pat wrote:

Docker, and it’s more complex cousin Kubernetes, both deliver many benefits over old style native host installations but they come with a new set of concepts and tools.

Docker-compose makes installation super easy, compared with native installs. But this ease can be deceptive — there is a cost for this.

For instance:
- you won’t ssh to an ip address, you’ll use docker of k8s to login to a container or run a command on the container
- you won’t tail -f logs, you’ll use docker-compose and k8s to view them
- you will need to explicitly attach a service to localhost with configuration and this forwards traffic back and forth between the host and containers
- docker-compose runs on a single machine so all containers share the host resources
- containers must communicate via network interfaces that involve routing even on a single machine — in fact k8s comes with a built in DNS resolver
- All config for a container is passed in via env variables set when a container is launched — these are specified in different ways for docker-compose and k8s
- in docker-compose the same storage can be attached to more than one container and will be seen in different places in the filesystems of the host and each container. This mapping of filesystems can be hard to visualize until you become familiar with it and the mapping is controlled by configuration so it changes with config.

There are many other differences — too many to note here.

We encourage anyone using docker-compose to not underestimate these differences in operation. Do some reading if you are unfamiliar with container technology.

To view this discussion on the web visit https://groups.google.com/d/msgid/actionml-user/b2b05a8f-55cb-46d3-8be2-c314654b2558%40googlegroups.com.

Reply all

Reply to author

Forward