| To reproduce: 1. Create a gcs to gcs pipeline 2. Customize the dataproc profile to use 0 worker nodes and 1 cpu, 4gb for master node 3. Run the pipeline Running the pipeline seems to put the dataproc cluster in a bad state. If you try to SSH to the VM, it hangs forever. This also breaks the program stop mechanism and the RuntimeMonitor. The result is that the pipeline is stuck in the running state until app-fabric is restarted. A stack dump of app-fabric one thread per stuck pipeline at:
"constraint-checker-task-14" #78 prio=5 os_prio=0 tid=0x00007f7794023800 nid=0xb5 waiting on condition [0x00007f777d57f000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at java.lang.Thread.sleep(Thread.java:340)
at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:224)
at com.jcraft.jsch.IO.getByte(IO.java:73)
at com.jcraft.jsch.Session.connect(Session.java:263)
at com.jcraft.jsch.Session.connect(Session.java:183)
at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:85)
at io.cdap.cdap.internal.app.runtime.distributed.remote.SSHRemoteProcessController.isRunning(SSHRemoteProcessController.java:63)
at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitor.shouldRetry(RuntimeMonitor.java:175)
at io.cdap.cdap.common.service.AbstractRetryableScheduledService.runOneIteration(AbstractRetryableScheduledService.java:160)
at com.google.common.util.concurrent.AbstractScheduledService$1$1.run(AbstractScheduledService.java:170)
at com.google.common.util.concurrent.AbstractScheduledService$CustomScheduler$ReschedulableCallable.call(AbstractScheduledService.java:355)
at com.google.common.util.concurrent.AbstractScheduledService$CustomScheduler$ReschedulableCallable.call(AbstractScheduledService.java:321)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
If a program stop is attempted, an error like the following occurs:
2020-09-11 18:00:12,908 - ERROR [appfabric-executor-4963:i.c.c.c.HttpExceptionHandler@76] - Unexpected error: request=POST /v3/namespaces/default/apps/1split/workflows/DataPipelineWorkflow/runs/d9a2f91f-f457-11ea-81d3-da2e85f5d1fa/stop user=<null>:
java.util.concurrent.ExecutionException: 1 out of 1 runs of the program program:default.1split.-SNAPSHOT.workflow.DataPipelineWorkflow failed to stop
at io.cdap.cdap.internal.app.services.ProgramLifecycleService.stop(ProgramLifecycleService.java:665) ~[na:na]
at io.cdap.cdap.gateway.handlers.ProgramLifecycleHttpHandler.performRunLevelStop(ProgramLifecycleHttpHandler.java:304) ~[na:na]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_265]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_265]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_265]
at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_265]
at io.cdap.http.internal.HttpMethodInfo.invoke(HttpMethodInfo.java:87) ~[io.cdap.http.netty-http-1.5.0.jar:na]
at io.cdap.http.internal.HttpDispatcher.channelRead(HttpDispatcher.java:45) [io.cdap.http.netty-http-1.5.0.jar:na]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
at io.cdap.http.internal.NonStickyEventExecutorGroup$NonStickyOrderedEventExecutor.run(NonStickyEventExecutorGroup.java:254) [io.cdap.http.netty-http-1.5.0.jar:na]
at io.netty.util.concurrent.UnorderedThreadPoolEventExecutor$NonNotifyRunnable.run(UnorderedThreadPoolEventExecutor.java:277) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_265]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_265]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_265]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_265]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_265]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_265]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_265]
Caused by: java.lang.RuntimeException: java.net.SocketTimeoutException: Read timed out
at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:190) ~[na:na]
at io.cdap.cdap.internal.app.runtime.distributed.AbstractTwillProgramController.doStop(AbstractTwillProgramController.java:147) ~[na:na]
at io.cdap.cdap.internal.app.runtime.AbstractProgramController.lambda$stop$2(AbstractProgramController.java:151) ~[na:na]
... 3 common frames omitted
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.8.0_265]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[na:1.8.0_265]
at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[na:1.8.0_265]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_265]
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[na:1.8.0_265]
at sun.security.ssl.InputRecord.read(InputRecord.java:503) ~[na:1.8.0_265]
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:990) ~[na:1.8.0_265]
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1388) ~[na:1.8.0_265]
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1416) ~[na:1.8.0_265]
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1400) ~[na:1.8.0_265]
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) ~[na:1.8.0_265]
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185) ~[na:1.8.0_265]
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570) ~[na:1.8.0_265]
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) ~[na:1.8.0_265]
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) ~[na:1.8.0_265]
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352) ~[na:1.8.0_265]
at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.postCall(RuntimeMonitorClient.java:142) ~[na:na]
at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.requestStop(RuntimeMonitorClient.java:132) ~[na:na]
at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitor.requestStop(RuntimeMonitor.java:164) ~[na:na]
at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:188) ~[na:na]
... 5 common frames omitted
I don't think this would affect 6.2.x and onward, as the Runtime system was changed to use the Dataproc jobs API instead of relying on direct SSH access. |