Re: Issue on running Gerrit Analytics ETL job in docker

285 views
Skip to first unread message

Fabio Ponciroli

unread,
Oct 29, 2018, 7:33:48 PM10/29/18
to sher...@gmail.com, repo-d...@googlegroups.com, Luca Milanesio, syntonyze, galar...@gmail.com
Hi Shiping,
the ETL job is running inside a Docker container, hence the address you are passing in the ES_HOST (127.0.0.1)  refers to the localhost inside the docker container itself. In your case Elastisearch is running on you host machine, hence you need to set ES_HOST to your host IP address.

If you are using a Mac you can do it using docker.for.mac.localhost, otherwise just specify the IP of you host (not sure if there is an equivalent for Windows/Unix).
I also suggest you clean the ETL image you currently have, to make sure you get the latest one.

Try to run the following:

docker rmi gerritforge/spark-gerrit-analytics-etl:latest  # Remove docker image
docker run -ti --rm -e ES_HOST=docker.for.mac.localhost -e GERRIT_URL="http://xdb-dev.alibaba.net:8080" -e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour -e gerrit/analytics" gerritforge/spark-gerrit-analytics-etl:latest  # Use ES_HOST=<your_host_ip> if you are not running on MacOS 

Let us know if it works.

Thanks,
Fabio


Il giorno lun 29 ott 2018 alle ore 19:59 shipingc <sher...@gmail.com> ha scritto:
Hi,

When I tried to run Gerrit Analytics ETL job in docker, got an elasticsearch connection issue:

[shiping.chen@localhost /]$ sudo docker run -ti --rm -e ES_HOST=127.0.0.1:9200 -e GERRIT_URL="http://xdb-dev.alibaba.net:8080" -e ANALYTICS_ARGS="--since 2000-06-01 --aggregate email_hour -e gerrit/analytics" gerritforge/spark-gerrit-analytics-etl:latest
* Elastic Search Host: localhost:9200
* Analytics arguments: --since 2000-06-01 --aggregate email_hour -e gerrit/analytics
* Spark jar class: com.gerritforge.analytics.job.Main
* Spark jar path: /usr/local/spark/jars
* Waiting for Elasticsearch at http://localhost:9200 (1/30)
* Waiting for Elasticsearch at http://localhost:9200 (2/30)
* Waiting for Elasticsearch at http://localhost:9200 (3/30)
* Waiting for Elasticsearch at http://localhost:9200 (4/30)
* Waiting for Elasticsearch at http://localhost:9200 (5/30)
* Waiting for Elasticsearch at http://localhost:9200 (6/30)
* Waiting for Elasticsearch at http://localhost:9200 (7/30)
* Waiting for Elasticsearch at http://localhost:9200 (8/30)
* Waiting for Elasticsearch at http://localhost:9200 (9/30)
* Waiting for Elasticsearch at http://localhost:9200 (10/30)
* Waiting for Elasticsearch at http://localhost:9200 (11/30)
* Waiting for Elasticsearch at http://localhost:9200 (12/30)
* Waiting for Elasticsearch at http://localhost:9200 (13/30)
* Waiting for Elasticsearch at http://localhost:9200 (14/30)
* Waiting for Elasticsearch at http://localhost:9200 (15/30)
* Waiting for Elasticsearch at http://localhost:9200 (16/30)
* Waiting for Elasticsearch at http://localhost:9200 (17/30)
* Waiting for Elasticsearch at http://localhost:9200 (18/30)
* Waiting for Elasticsearch at http://localhost:9200 (19/30)
* Waiting for Elasticsearch at http://localhost:9200 (20/30)
* Waiting for Elasticsearch at http://localhost:9200 (21/30)
* Waiting for Elasticsearch at http://localhost:9200 (22/30)
* Waiting for Elasticsearch at http://localhost:9200 (23/30)
* Waiting for Elasticsearch at http://localhost:9200 (24/30)
* Waiting for Elasticsearch at http://localhost:9200 (25/30)
* Waiting for Elasticsearch at http://localhost:9200 (26/30)
* Waiting for Elasticsearch at http://localhost:9200 (27/30)
* Waiting for Elasticsearch at http://localhost:9200 (28/30)
* Waiting for Elasticsearch at http://localhost:9200 (29/30)
* Waiting for Elasticsearch at http://localhost:9200 (30/30)
Operation timed out

ElasticSearch itself is running:

[shiping.chen@localhost /]$ curl -XGET 127.0.0.1:9200                       {
  "name" : "Q6EjhhY",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "syys2GZBSuKb8_HTMpnMkw",
  "version" : {
    "number" : "6.4.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "04711c2",
    "build_date" : "2018-09-26T13:34:09.098244Z",
    "build_snapshot" : false,
    "lucene_version" : "7.4.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

I built the dockerized etl using "sbt docker" command in the git repo.

Could anybody kindly shed some light on what's wrong with my setting?

Best regards,
Shiping

Fabio Ponciroli

unread,
Oct 30, 2018, 3:56:53 AM10/30/18
to shipingc, repo-d...@googlegroups.com, Luca Milanesio, Antonio Barone, Stefano Galarraga
Hi Shiping,
Can you share your Elasticsearch configuration ? It will be useful to understand and debug the issue.

Thanks ,
Fabio

On Tue, 30 Oct 2018, 04:47 shipingc, <sher...@gmail.com> wrote:
Hi Fabio,

Thanks for the hint. Since there is no a good way to get the localhost from docker in linux, I switched to windows. With docker.host.internal I did not see the "Waiting for Elasticsearch at http://localhost:9200 (1/30)" message anymore.
The execution went very far, but eventually it failed at:

PS C:\Users\shiping.chen\elasticsearch-6.4.1> docker run -ti --rm -e ES_HOST=host.docker.internal -e GERRIT_URL="http://xdb-dev.alibaba.net:8080" -e ANALYTICS_ARGS="--since 2018-08-03 --aggregate email_hour -e gerrit/analytics" gerritforge/spark-gerrit-analytics-etl:latest

.........
2018-10-30 03:40:28 INFO  ContextCleaner:54 - Cleaned accumulator 224
2018-10-30 03:40:29 INFO  SparkContext:54 - Starting job: runJob at EsSparkSQL.scala:101
2018-10-30 03:40:29 INFO  DAGScheduler:54 - Got job 6 (runJob at EsSparkSQL.scala:101) with 2 output partitions
2018-10-30 03:40:29 INFO  DAGScheduler:54 - Final stage: ResultStage 26 (runJob at EsSparkSQL.scala:101)
2018-10-30 03:40:29 INFO  DAGScheduler:54 - Parents of final stage: List()
2018-10-30 03:40:29 INFO  DAGScheduler:54 - Missing parents: List()
2018-10-30 03:40:29 INFO  DAGScheduler:54 - Submitting ResultStage 26 (MapPartitionsRDD[48] at rdd at EsSparkSQL.scala:101), which has no missing parents
2018-10-30 03:40:29 INFO  MemoryStore:54 - Block broadcast_9 stored as values in memory (estimated size 34.8 KB, free 366.2 MB)
2018-10-30 03:40:29 INFO  MemoryStore:54 - Block broadcast_9_piece0 stored as bytes in memory (estimated size 15.1 KB, free 366.2 MB)
2018-10-30 03:40:29 INFO  BlockManagerInfo:54 - Added broadcast_9_piece0 in memory on 9775ea57fd23:38395 (size: 15.1 KB, free: 366.3 MB)
2018-10-30 03:40:29 INFO  SparkContext:54 - Created broadcast 9 from broadcast at DAGScheduler.scala:1039
2018-10-30 03:40:29 INFO  DAGScheduler:54 - Submitting 2 missing tasks from ResultStage 26 (MapPartitionsRDD[48] at rdd at EsSparkSQL.scala:101) (first 15 tasks are for partitions Vector(0, 1))
2018-10-30 03:40:29 INFO  TaskSchedulerImpl:54 - Adding task set 26.0 with 2 tasks
2018-10-30 03:40:29 INFO  TaskSetManager:54 - Starting task 0.0 in stage 26.0 (TID 604, localhost, executor driver, partition 0, PROCESS_LOCAL, 7884 bytes)
2018-10-30 03:40:29 INFO  TaskSetManager:54 - Starting task 1.0 in stage 26.0 (TID 605, localhost, executor driver, partition 1, PROCESS_LOCAL, 7884 bytes)
2018-10-30 03:40:29 INFO  Executor:54 - Running task 0.0 in stage 26.0 (TID 604)
2018-10-30 03:40:29 INFO  Executor:54 - Running task 1.0 in stage 26.0 (TID 605)
2018-10-30 03:40:29 INFO  BlockManager:54 - Found block rdd_39_0 locally
2018-10-30 03:40:29 INFO  BlockManager:54 - Found block rdd_39_1 locally
2018-10-30 03:40:29 INFO  CodeGenerator:54 - Code generated in 39.2802 ms
2018-10-30 03:40:29 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:29 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:29 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:29 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:29 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:29 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:29 ERROR NetworkClient:144 - Node [127.0.0.1:9200] failed (Connection refused (Connection refused)); selected next node [192.168.65.2:9200]
2018-10-30 03:40:30 INFO  EsDataFrameWriter:594 - Writing to [gerrit/analytics]
2018-10-30 03:40:30 INFO  EsDataFrameWriter:594 - Writing to [gerrit/analytics]
2018-10-30 03:40:30 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:30 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:30 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:30 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:30 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:30 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:30 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:30 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:30 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:30 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:30 ERROR NetworkClient:144 - Node [127.0.0.1:9200] failed (Connection refused (Connection refused)); no other nodes left - aborting...
2018-10-30 03:40:30 INFO  HttpMethodDirector:439 - I/O exception (java.net.ConnectException) caught when processing request: Connection refused (Connection refused)
2018-10-30 03:40:30 INFO  HttpMethodDirector:445 - Retrying request
2018-10-30 03:40:30 ERROR NetworkClient:144 - Node [127.0.0.1:9200] failed (Connection refused (Connection refused)); no other nodes left - aborting...
2018-10-30 03:40:30 ERROR Executor:91 - Exception in task 1.0 in stage 26.0 (TID 605)
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]]
        at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
        at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:388)
        at org.elasticsearch.hadoop.rest.RestClient.exists(RestClient.java:484)
        at org.elasticsearch.hadoop.rest.RestClient.indexExists(RestClient.java:479)
        at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:490)
        at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:352)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:612)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:600)
        at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-10-30 03:40:30 ERROR Executor:91 - Exception in task 0.0 in stage 26.0 (TID 604)
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]]
        at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
        at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:388)
        at org.elasticsearch.hadoop.rest.RestClient.exists(RestClient.java:484)
        at org.elasticsearch.hadoop.rest.RestClient.indexExists(RestClient.java:479)
        at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:490)
        at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:352)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:612)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:600)
        at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-10-30 03:40:30 WARN  TaskSetManager:66 - Lost task 1.0 in stage 26.0 (TID 605, localhost, executor driver): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]]
        at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
        at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:388)
        at org.elasticsearch.hadoop.rest.RestClient.exists(RestClient.java:484)
        at org.elasticsearch.hadoop.rest.RestClient.indexExists(RestClient.java:479)
        at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:490)
        at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:352)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:612)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:600)
        at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

2018-10-30 03:40:30 ERROR TaskSetManager:70 - Task 1 in stage 26.0 failed 1 times; aborting job
2018-10-30 03:40:30 INFO  TaskSetManager:54 - Lost task 0.0 in stage 26.0 (TID 604) on localhost, executor driver: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException (Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]] ) [duplicate 1]
2018-10-30 03:40:30 INFO  TaskSchedulerImpl:54 - Removed TaskSet 26.0, whose tasks have all completed, from pool
2018-10-30 03:40:30 INFO  TaskSchedulerImpl:54 - Cancelling stage 26
2018-10-30 03:40:30 INFO  DAGScheduler:54 - ResultStage 26 (runJob at EsSparkSQL.scala:101) failed in 0.860 s due to Job aborted due to stage failure: Task 1 in stage 26.0 failed 1 times, most recent failure: Lost task 1.0 in stage 26.0 (TID 605, localhost, executor driver): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]]
        at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
        at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:388)
        at org.elasticsearch.hadoop.rest.RestClient.exists(RestClient.java:484)
        at org.elasticsearch.hadoop.rest.RestClient.indexExists(RestClient.java:479)
        at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:490)
        at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:352)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:612)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:600)
        at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2018-10-30 03:40:30 INFO  DAGScheduler:54 - Job 6 failed: runJob at EsSparkSQL.scala:101, took 0.874579 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 26.0 failed 1 times, most recent failure: Lost task 1.0 in stage 26.0 (TID 605, localhost, executor driver): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]]
        at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
        at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:388)
        at org.elasticsearch.hadoop.rest.RestClient.exists(RestClient.java:484)
        at org.elasticsearch.hadoop.rest.RestClient.indexExists(RestClient.java:479)
        at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:490)
        at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:352)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:612)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:600)
        at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
        at org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:101)
        at org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:80)
        at org.elasticsearch.spark.sql.package$SparkDataFrameFunctions.saveToEs(package.scala:48)
        at com.gerritforge.analytics.job.Job$$anonfun$saveES$1.apply(Main.scala:210)
        at com.gerritforge.analytics.job.Job$$anonfun$saveES$1.apply(Main.scala:207)
        at scala.Option.foreach(Option.scala:257)
        at com.gerritforge.analytics.job.Job$class.saveES(Main.scala:207)
        at com.gerritforge.analytics.job.Main$.saveES(Main.scala:35)
        at com.gerritforge.analytics.job.Main$.delayedEndpoint$com$gerritforge$analytics$job$Main$1(Main.scala:115)
        at com.gerritforge.analytics.job.Main$delayedInit$body.apply(Main.scala:35)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
        at scala.App$class.main(App.scala:76)
        at com.gerritforge.analytics.job.Main$.main(Main.scala:35)
        at com.gerritforge.analytics.job.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]]
        at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
        at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
        at org.elasticsearch.hadoop.rest.RestClient.executeNotFoundAllowed(RestClient.java:388)
        at org.elasticsearch.hadoop.rest.RestClient.exists(RestClient.java:484)
        at org.elasticsearch.hadoop.rest.RestClient.indexExists(RestClient.java:479)
        at org.elasticsearch.hadoop.rest.RestClient.touch(RestClient.java:490)
        at org.elasticsearch.hadoop.rest.RestRepository.touch(RestRepository.java:352)
        at org.elasticsearch.hadoop.rest.RestService.initSingleIndex(RestService.java:612)
        at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:600)
        at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:58)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2018-10-30 03:40:30 INFO  SparkContext:54 - Invoking stop() from shutdown hook
2018-10-30 03:40:30 INFO  AbstractConnector:318 - Stopped Spark@745aef8d{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-10-30 03:40:30 INFO  SparkUI:54 - Stopped Spark web UI at http://9775ea57fd23:4040
2018-10-30 03:40:30 INFO  MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-10-30 03:40:30 INFO  MemoryStore:54 - MemoryStore cleared
2018-10-30 03:40:30 INFO  BlockManager:54 - BlockManager stopped
2018-10-30 03:40:30 INFO  BlockManagerMaster:54 - BlockManagerMaster stopped
2018-10-30 03:40:30 INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-10-30 03:40:31 INFO  SparkContext:54 - Successfully stopped SparkContext
2018-10-30 03:40:31 INFO  ShutdownHookManager:54 - Shutdown hook called
2018-10-30 03:40:31 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-a348143a-875d-45ed-a502-9484e16859cb
2018-10-30 03:40:31 INFO  ShutdownHookManager:54 - Deleting directory /tmp/spark-9c2e9a6a-5bcf-42ed-bca9-7cb55e932fd1


It seems still network issue.

Any idea?

Best regards,
Shiping

Fabio Ponciroli

unread,
Nov 2, 2018, 2:36:00 PM11/2/18
to shipingc, repo-d...@googlegroups.com
Hi Shiping,
good to hear it is working! Let us know if you need any more help with it.

By the way, to simplify the setup of the whole infrastructure we have been working on this plugin: https://gerrit.googlesource.com/plugins/analytics-wizard/

Have a look if it is something that can be of any help.

Thanks,
Fabio

Il giorno mer 31 ott 2018 alle ore 03:14 shipingc <sher...@gmail.com> ha scritto:
Hi Fabio,

I eventually took a workaround in which I installed elasticsearch and kibana on a machine, and run the docker job on another machine.

The feature is very nice! Thank you very much for the nice work!

Shiping 

shipingc

unread,
Nov 5, 2018, 3:36:02 AM11/5/18
to Fabio Ponciroli, repo-d...@googlegroups.com, Luca Milanesio, Antonio Barone, Stefano Galarraga

shipingc

unread,
Nov 5, 2018, 3:36:03 AM11/5/18
to Repo and Gerrit Discussion
HI Fabio,

I just use the default configuration. BTW, I use version 6.4.1 for both elasticsearch and kibana. 6.4.2 has some other startup issues.

Thanks,
Shiping 

elasticsearch.yml
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes: 
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

shipingc

unread,
Dec 20, 2018, 7:24:37 PM12/20/18
to Repo and Gerrit Discussion
Hi Fabio,

It has been a while since I successfully got the nice tool run. Recently I tried to rerun the etl in docker to re-populate the latest data, however the command hangs there:

PS C:\Users\shiping.chen> docker run -ti --rm -e ES_HOST=30.57.186.97    -e GERRIT_URL="http://x-dev.alibaba.net:8080"  -e ANALYTICS_ARGS="--since 2018-08-03 --aggregate email_hour  -e gerrit" gerritforge/gerrit-analytics-etl-gitcommits:latest
Unable to find image 'gerritforge/gerrit-analytics-etl-gitcommits:latest' locally
latest: Pulling from gerritforge/gerrit-analytics-etl-gitcommits
4fe2ade4980c: Already exists
6fc58a8d4ae4: Already exists
ef87ded15917: Pull complete
28f8e02fea6a: Pull complete
6f3c2b9d6b74: Pull complete
8b3a5087354d: Pull complete
16fc39044a9d: Pull complete
f309e443c9d2: Pull complete
1b92c11b208f: Pull complete
Digest: sha256:bcf38217d1cd189af79fec022b3a8a5874f4825b453d7b28ec04154121073ac7
Status: Downloaded newer image for gerritforge/gerrit-analytics-etl-gitcommits:latest
* Elastic Search Host: 30.57.186.97:9200
* Analytics arguments: --since 2018-08-03 --aggregate email_hour  -e gerrit
* Spark jar class: com.gerritforge.analytics.gitcommits.job.Main
* Spark jar path: /app/analytics-etl-gitcommits-assembly.jar
Elasticsearch is up, now running spark job...
2018-12-21 00:12:05 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-12-21 00:12:06 INFO  Main$:103 - Starting analytics app with config GerritEndpointConfig(Some(http://x-dev.alibaba.net:8080),None,file:///tmp/analytics-54550104399800,Some(gerrit),Some(2018-08-03),None,Some(email_hour),None,None,None,None,None,None,None)
2018-12-21 00:12:06 INFO  SparkContext:54 - Running Spark version 2.3.2
2018-12-21 00:12:06 INFO  SparkContext:54 - Submitted application: Gerrit GitCommits Analytics ETL
2018-12-21 00:12:06 INFO  SecurityManager:54 - Changing view acls to: root
2018-12-21 00:12:06 INFO  SecurityManager:54 - Changing modify acls to: root
2018-12-21 00:12:06 INFO  SecurityManager:54 - Changing view acls groups to:
2018-12-21 00:12:06 INFO  SecurityManager:54 - Changing modify acls groups to:
2018-12-21 00:12:06 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2018-12-21 00:12:06 INFO  Utils:54 - Successfully started service 'sparkDriver' on port 36815.
2018-12-21 00:12:06 INFO  SparkEnv:54 - Registering MapOutputTracker
2018-12-21 00:12:06 INFO  SparkEnv:54 - Registering BlockManagerMaster
2018-12-21 00:12:06 INFO  BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-12-21 00:12:06 INFO  BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-12-21 00:12:06 INFO  DiskBlockManager:54 - Created local directory at /tmp/blockmgr-59e490ee-76f1-4b76-b8df-357857eb846c
2018-12-21 00:12:06 INFO  MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2018-12-21 00:12:06 INFO  SparkEnv:54 - Registering OutputCommitCoordinator
2018-12-21 00:12:06 INFO  log:192 - Logging initialized @3119ms
2018-12-21 00:12:07 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2018-12-21 00:12:07 INFO  Server:419 - Started @3276ms
2018-12-21 00:12:07 INFO  AbstractConnector:278 - Started ServerConnector@5e663be5{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-12-21 00:12:07 INFO  Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6c44052e{/jobs,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4fdf8f12{/jobs/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4a8b5227{/jobs/job,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6979efad{/jobs/job/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5a6d5a8f{/stages,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4a67318f{/stages/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@315ba14a{/stages/stage,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@54e81b21{/stages/stage/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@38d5b107{/stages/pool,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6650813a{/stages/pool/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@44ea608c{/storage,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@50cf5a23{/storage/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@450794b4{/storage/rdd,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@273c947f{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@30457e14{/environment,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@632aa1a3{/executors,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@20765ed5{/executors/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b582111{/executors/threadDump,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2899a8db{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1e8823d2{/static,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@251ebf23{/,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29b732a2{/api,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1162410a{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@b09fac1{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://96c763388076:4040
2018-12-21 00:12:07 INFO  SparkContext:54 - Added JAR file:/app/analytics-etl-gitcommits-assembly.jar at spark://96c763388076:36815/jars/analytics-etl-gitcommits-assembly.jar with timestamp 1545351127637
2018-12-21 00:12:07 INFO  Executor:54 - Starting executor ID driver on host localhost
2018-12-21 00:12:07 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37905.
2018-12-21 00:12:07 INFO  NettyBlockTransferService:54 - Server created on 96c763388076:37905
2018-12-21 00:12:07 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-12-21 00:12:07 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 96c763388076, 37905, None)
2018-12-21 00:12:07 INFO  BlockManagerMasterEndpoint:54 - Registering block manager 96c763388076:37905 with 366.3 MB RAM, BlockManagerId(driver, 96c763388076, 37905, None)
2018-12-21 00:12:07 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 96c763388076, 37905, None)
2018-12-21 00:12:07 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 96c763388076, 37905, None)
2018-12-21 00:12:08 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@444548a0{/metrics/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:08 INFO  GerritConnectivity:61 - Connecting to API http://x-dev.alibaba.net:8080/projects/
2018-12-21 00:12:08 INFO  Main$:142 - Loaded a list of 10 projects [GerritProject(All-Projects,All-Projects),GerritProject(test-project,test-project),GerritProject(X-DB,X-DB),GerritProject(persistent_cache,persistent_cache),GerritProject(histore,histore),GerritProject(AliSQL-8.0,AliSQL-8.0),GerritProject(All-Users,All-Users),GerritProject(X-Factory,X-Factory),GerritProject(X-DB5,X-DB5),GerritProject(newengine,newengine)]
2018-12-21 00:12:13 INFO  SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/spark-warehouse').
2018-12-21 00:12:13 INFO  SharedState:54 - Warehouse path is 'file:/spark-warehouse'.
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@30517a57{/SQL,null,AVAILABLE,@Spark}
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3dde5f38{/SQL/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@721fc2e3{/SQL/execution,null,AVAILABLE,@Spark}
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@63187d63{/SQL/execution/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@44864536{/static/sql,null,AVAILABLE,@Spark}
2018-12-21 00:12:14 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-12-21 00:12:16 INFO  HashAggregateExec:54 - spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
2018-12-21 00:12:17 INFO  CodeGenerator:54 - Code generated in 387.9738 ms
2018-12-21 00:12:17 INFO  HashAggregateExec:54 - spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
2018-12-21 00:12:17 INFO  CodeGenerator:54 - Code generated in 61.008 ms
2018-12-21 00:12:17 INFO  CodeGenerator:54 - Code generated in 37.2962 ms
2018-12-21 00:12:18 INFO  ContextCleaner:54 - Cleaned accumulator 0
2018-12-21 00:12:18 INFO  CodeGenerator:54 - Code generated in 115.547 ms
2018-12-21 00:12:18 INFO  CodeGenerator:54 - Code generated in 60.2091 ms
2018-12-21 00:12:18 INFO  SparkContext:54 - Starting job: head at Main.scala:191
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Registering RDD 15 (rdd at Main.scala:188)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Registering RDD 21 (keyBy at GerritEventsTransformations.scala:58)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Registering RDD 20 (keyBy at GerritEventsTransformations.scala:57)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Registering RDD 28 (groupBy at GerritEventsTransformations.scala:69)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Got job 0 (head at Main.scala:191) with 1 output partitions
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Final stage: ResultStage 4 (head at Main.scala:191)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 3)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 3)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[15] at rdd at Main.scala:188), which has no missing parents
2018-12-21 00:12:18 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 27.7 KB, free 366.3 MB)
2018-12-21 00:12:18 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 11.5 KB, free 366.3 MB)
2018-12-21 00:12:18 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 96c763388076:37905 (size: 11.5 KB, free: 366.3 MB)
2018-12-21 00:12:18 INFO  SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1039
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[15] at rdd at Main.scala:188) (first 15 tasks are for partitions Vector(0, 1))
2018-12-21 00:12:18 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2018-12-21 00:12:18 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8135 bytes)
2018-12-21 00:12:18 INFO  TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 8117 bytes)
2018-12-21 00:12:18 INFO  Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2018-12-21 00:12:18 INFO  Executor:54 - Running task 1.0 in stage 0.0 (TID 1)
2018-12-21 00:12:18 INFO  Executor:54 - Fetching spark://96c763388076:36815/jars/analytics-etl-gitcommits-assembly.jar with timestamp 1545351127637
2018-12-21 00:12:19 INFO  TransportClientFactory:267 - Successfully created connection to 96c763388076/172.17.0.2:36815 after 56 ms (0 ms spent in bootstraps)
2018-12-21 00:12:19 INFO  Utils:54 - Fetching spark://96c763388076:36815/jars/analytics-etl-gitcommits-assembly.jar to /tmp/spark-fbfa42a9-9a8b-4b95-b1c9-b51e0a14268f/userFiles-c6853d5a-a1e2-4ced-b56e-d22e0ee40429/fetchFileTemp7646913260148081577.tmp
2018-12-21 00:12:19 INFO  Executor:54 - Adding file:/tmp/spark-fbfa42a9-9a8b-4b95-b1c9-b51e0a14268f/userFiles-c6853d5a-a1e2-4ced-b56e-d22e0ee40429/analytics-etl-gitcommits-assembly.jar to class loader
2018-12-21 00:12:19 INFO  CodeGenerator:54 - Code generated in 15.3798 ms
2018-12-21 00:12:19 INFO  CodeGenerator:54 - Code generated in 16.9873 ms
2018-12-21 00:12:19 INFO  CodeGenerator:54 - Code generated in 9.7323 ms
2018-12-21 00:12:19 INFO  CodeGenerator:54 - Code generated in 32.58 ms
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@632aa1a3{/executors,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@20765ed5{/executors/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3b582111{/executors/threadDump,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2899a8db{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1e8823d2{/static,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@251ebf23{/,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29b732a2{/api,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1162410a{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@b09fac1{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-12-21 00:12:07 INFO  SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://96c763388076:4040
2018-12-21 00:12:07 INFO  SparkContext:54 - Added JAR file:/app/analytics-etl-gitcommits-assembly.jar at spark://96c763388076:36815/jars/analytics-etl-gitcommits-assembly.jar with timestamp 1545351127637
2018-12-21 00:12:07 INFO  Executor:54 - Starting executor ID driver on host localhost
2018-12-21 00:12:07 INFO  Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37905.
2018-12-21 00:12:07 INFO  NettyBlockTransferService:54 - Server created on 96c763388076:37905
2018-12-21 00:12:07 INFO  BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-12-21 00:12:07 INFO  BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 96c763388076, 37905, None)
2018-12-21 00:12:07 INFO  BlockManagerMasterEndpoint:54 - Registering block manager 96c763388076:37905 with 366.3 MB RAM, BlockManagerId(driver, 96c763388076, 37905, None)
2018-12-21 00:12:07 INFO  BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 96c763388076, 37905, None)
2018-12-21 00:12:07 INFO  BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 96c763388076, 37905, None)
2018-12-21 00:12:08 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@444548a0{/metrics/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:08 INFO  GerritConnectivity:61 - Connecting to API http://x-dev.alibaba.net:8080/projects/
2018-12-21 00:12:08 INFO  Main$:142 - Loaded a list of 10 projects [GerritProject(All-Projects,All-Projects),GerritProject(test-project,test-project),GerritProject(X-DB,X-DB),GerritProject(persistent_cache,persistent_cache),GerritProject(histore,histore),GerritProject(AliSQL-8.0,AliSQL-8.0),GerritProject(All-Users,All-Users),GerritProject(X-Factory,X-Factory),GerritProject(X-DB5,X-DB5),GerritProject(newengine,newengine)]
2018-12-21 00:12:13 INFO  SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/spark-warehouse').
2018-12-21 00:12:13 INFO  SharedState:54 - Warehouse path is 'file:/spark-warehouse'.
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@30517a57{/SQL,null,AVAILABLE,@Spark}
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3dde5f38{/SQL/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@721fc2e3{/SQL/execution,null,AVAILABLE,@Spark}
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@63187d63{/SQL/execution/json,null,AVAILABLE,@Spark}
2018-12-21 00:12:13 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@44864536{/static/sql,null,AVAILABLE,@Spark}
2018-12-21 00:12:14 INFO  StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-12-21 00:12:16 INFO  HashAggregateExec:54 - spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
2018-12-21 00:12:17 INFO  CodeGenerator:54 - Code generated in 387.9738 ms
2018-12-21 00:12:17 INFO  HashAggregateExec:54 - spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current version of codegened fast hashmap does not support this aggregate.
2018-12-21 00:12:17 INFO  CodeGenerator:54 - Code generated in 61.008 ms
2018-12-21 00:12:17 INFO  CodeGenerator:54 - Code generated in 37.2962 ms
2018-12-21 00:12:18 INFO  ContextCleaner:54 - Cleaned accumulator 0
2018-12-21 00:12:18 INFO  CodeGenerator:54 - Code generated in 115.547 ms
2018-12-21 00:12:18 INFO  CodeGenerator:54 - Code generated in 60.2091 ms
2018-12-21 00:12:18 INFO  SparkContext:54 - Starting job: head at Main.scala:191
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Registering RDD 15 (rdd at Main.scala:188)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Registering RDD 21 (keyBy at GerritEventsTransformations.scala:58)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Registering RDD 20 (keyBy at GerritEventsTransformations.scala:57)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Registering RDD 28 (groupBy at GerritEventsTransformations.scala:69)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Got job 0 (head at Main.scala:191) with 1 output partitions
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Final stage: ResultStage 4 (head at Main.scala:191)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 3)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStage 3)
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[15] at rdd at Main.scala:188), which has no missing parents
2018-12-21 00:12:18 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 27.7 KB, free 366.3 MB)
2018-12-21 00:12:18 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 11.5 KB, free 366.3 MB)
2018-12-21 00:12:18 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 96c763388076:37905 (size: 11.5 KB, free: 366.3 MB)
2018-12-21 00:12:18 INFO  SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1039
2018-12-21 00:12:18 INFO  DAGScheduler:54 - Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[15] at rdd at Main.scala:188) (first 15 tasks are for partitions Vector(0, 1))
2018-12-21 00:12:18 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2018-12-21 00:12:18 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8135 bytes)
2018-12-21 00:12:18 INFO  TaskSetManager:54 - Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 8117 bytes)
2018-12-21 00:12:18 INFO  Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2018-12-21 00:12:18 INFO  Executor:54 - Running task 1.0 in stage 0.0 (TID 1)
2018-12-21 00:12:18 INFO  Executor:54 - Fetching spark://96c763388076:36815/jars/analytics-etl-gitcommits-assembly.jar with timestamp 1545351127637
2018-12-21 00:12:19 INFO  TransportClientFactory:267 - Successfully created connection to 96c763388076/172.17.0.2:36815 after 56 ms (0 ms spent in bootstraps)
2018-12-21 00:12:19 INFO  Utils:54 - Fetching spark://96c763388076:36815/jars/analytics-etl-gitcommits-assembly.jar to /tmp/spark-fbfa42a9-9a8b-4b95-b1c9-b51e0a14268f/userFiles-c6853d5a-a1e2-4ced-b56e-d22e0ee40429/fetchFileTemp7646913260148081577.tmp
2018-12-21 00:12:19 INFO  Executor:54 - Adding file:/tmp/spark-fbfa42a9-9a8b-4b95-b1c9-b51e0a14268f/userFiles-c6853d5a-a1e2-4ced-b56e-d22e0ee40429/analytics-etl-gitcommits-assembly.jar to class loader
2018-12-21 00:12:19 INFO  CodeGenerator:54 - Code generated in 15.3798 ms
2018-12-21 00:12:19 INFO  CodeGenerator:54 - Code generated in 16.9873 ms
2018-12-21 00:12:19 INFO  CodeGenerator:54 - Code generated in 9.7323 ms
2018-12-21 00:12:19 INFO  CodeGenerator:54 - Code generated in 32.58 ms
2018-12-21 00:12:54 INFO  Executor:54 - Finished task 1.0 in stage 0.0 (TID 1). 2265 bytes result sent to driver
2018-12-21 00:12:54 INFO  TaskSetManager:54 - Finished task 1.0 in stage 0.0 (TID 1) in 35756 ms on localhost (executor driver) (1/2)
// hanging here forever

I also tried the old etl command, the result is the same:

 docker run -ti --rm -e ES_HOST=xxx.xxx.xxx.xx -e GERRIT_URL="gerrit url" -e ANALYTICS_ARGS="--since 2018-08-03 --extract-branches true --aggregate email_hour  -e gerrit/analytics" gerritforge/spark-gerrit-analytics-etl:latest



Any idea what's wrong?

Best regards,
Shiping

Fabio Ponciroli

unread,
Dec 22, 2018, 10:26:58 AM12/22/18
to shipingc, Repo and Gerrit Discussion
Hi Shiping,
not sure why it is hanging, I have never experienced that before. Is it happening all the times?

I can't see anything strange from the logs you sent. What you could try to do is raising the Spark log level to see if you can capture more useful information. You can do it this way:

1) Enter the docker container: docker run -ti --rm --entrypoint /bin/bash gerritforge/gerrit-analytics-etl-gitcommits:latest 

2) Go to the  /app directory: cd /app
3) Create a file called log4j.properties with  the following content:

log4j.rootCategory=ALL, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

4) Submit the Spark job from inside Docker as follow:

spark-submit \

  --conf spark.es.nodes="30.57.186.97" \

  --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" \

  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties" \

  --class com.gerritforge.analytics.gitcommits.job.Main /app/analytics-etl-gitcommits-assembly.jar  \

  --url="http://x-dev.alibaba.net:8080--since 2018-08-03 --aggregate email_hour  -e gerrit


This should run your job with the most verbose log level. Let me know how it goes.

Thanks,
Fabio


--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shipingc

unread,
Jan 3, 2019, 7:45:54 PM1/3/19
to Repo and Gerrit Discussion
Hi Fabio,

Thank you for the suggestions. I figured out that it simply takes time, like half hour, but the job eventually will be finished. Not sure why it becomes so much slower than two months ago.

BTW, what's the correct procedure to update data in elasticsearch? I mean to get the latest analytic data from gerrit by executing the docker etl but keep the visulize and dashboard untouched, do I need to "DELETE gerrit" before rerun the etl?

Best,
Shiping
...

Fabio Ponciroli

unread,
Jan 4, 2019, 5:10:31 AM1/4/19
to shipingc, Repo and Gerrit Discussion
Hi Shiping,
I'm glad you have managed to get it working. 

Il giorno ven 4 gen 2019 alle ore 01:45 shipingc <sher...@gmail.com> ha scritto:
Hi Fabio,

Thank you for the suggestions. I figured out that it simply takes time, like half hour, but the job eventually will be finished. Not sure why it becomes so much slower than two months ago.

How many projects and commits are you processing?
Can you confirm you have the latest version of the analytics plugin ? We made some performance improvements on the latest version.
 

BTW, what's the correct procedure to update data in elasticsearch? I mean to get the latest analytic data from gerrit by executing the docker etl but keep the visulize and dashboard untouched, do I need to "DELETE gerrit" before rerun the etl?

The easiest way to do it is deleting the Elastisearch index every time and recreate a new one from scratch. If this operation takes too long, you can play with the since/until parameters and do incremental imports of the data.

Hope this helps.

Thanks,
Fabio
 
--

shipingc

unread,
Jan 4, 2019, 5:26:57 PM1/4/19
to Repo and Gerrit Discussion
Hi Fabio,

We have 10 projects, about 1000 commits in total. The whole execution takes more than a hour.

My current analytic plugin is the original one came from Gerrit 2.14.10 installation package. To get the latest one, should I use stable-2.14 branch to build? (https://gerrit.googlesource.com/plugins/analytics/) Does that branch include your latest performance improvement changes?

I also found data from some projects are missing, and the number of commits are not correct sometimes.

Best,
Shiping 

Fabio Ponciroli

unread,
Jan 7, 2019, 3:13:13 PM1/7/19
to shipingc, Repo and Gerrit Discussion
Hi Shiping,

Il giorno ven 4 gen 2019 alle ore 23:27 shipingc <sher...@gmail.com> ha scritto:
Hi Fabio,

We have 10 projects, about 1000 commits in total. The whole execution takes more than a hour.

My current analytic plugin is the original one came from Gerrit 2.14.10 installation package. To get the latest one, should I use stable-2.14 branch to build? (https://gerrit.googlesource.com/plugins/analytics/) Does that branch include your latest performance improvement changes?


Unfortunately, the plugin in 2.14 is not up to date. We only maintain it for versions 2.15 and 2.16 :(

I don't know the spec of the machine you are running the ETL and plugin on, but, to give you an example, processing the whole Gerrit project (~48K commits) on my laptop (Mac 2.3 GHz Intel Core i5 - 16 GB RAM) takes few minutes.

 
I also found data from some projects are missing, and the number of commits are not correct sometimes.

Can you provide and example for this issue and a way of reproducing it?
Reply all
Reply to author
Forward
0 new messages