[ISSUE] (CDAP-17263) Pipeline state stuck when Dataproc cluster is too small

16 views
Skip to first unread message

Albert Shau (Jira)

unread,
Sep 11, 2020, 4:52:05 PM9/11/20
to cdap-...@googlegroups.com
Albert Shau created an issue
 
CDAP / Bug CDAP-17263
Pipeline state stuck when Dataproc cluster is too small
Issue Type: Bug Bug
Affects Versions: 6.1.4
Assignee: Trishka
Components: Cloud Provisioner
Created: 11/Sep/20 1:51 PM
Priority: Major Major
Reporter: Albert Shau

To reproduce:

1. Create a gcs to gcs pipeline
2. Customize the dataproc profile to use 0 worker nodes and 1 cpu, 4gb for master node
3. Run the pipeline

Running the pipeline seems to put the dataproc cluster in a bad state. If you try to SSH to the VM, it hangs forever. This also breaks the program stop mechanism and the RuntimeMonitor. The result is that the pipeline is stuck in the running state until app-fabric is restarted.

A stack dump of app-fabric one thread per stuck pipeline at:

"constraint-checker-task-14" #78 prio=5 os_prio=0 tid=0x00007f7794023800 nid=0xb5 waiting on condition [0x00007f777d57f000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at java.lang.Thread.sleep(Thread.java:340)
        at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.net.SocketInputStream.read(SocketInputStream.java:224)
        at com.jcraft.jsch.IO.getByte(IO.java:73)
        at com.jcraft.jsch.Session.connect(Session.java:263)
        at com.jcraft.jsch.Session.connect(Session.java:183)
        at io.cdap.cdap.common.ssh.DefaultSSHSession.<init>(DefaultSSHSession.java:85)
        at io.cdap.cdap.internal.app.runtime.distributed.remote.SSHRemoteProcessController.isRunning(SSHRemoteProcessController.java:63)
        at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitor.shouldRetry(RuntimeMonitor.java:175)
        at io.cdap.cdap.common.service.AbstractRetryableScheduledService.runOneIteration(AbstractRetryableScheduledService.java:160)
        at com.google.common.util.concurrent.AbstractScheduledService$1$1.run(AbstractScheduledService.java:170)
        at com.google.common.util.concurrent.AbstractScheduledService$CustomScheduler$ReschedulableCallable.call(AbstractScheduledService.java:355)
        at com.google.common.util.concurrent.AbstractScheduledService$CustomScheduler$ReschedulableCallable.call(AbstractScheduledService.java:321)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

If a program stop is attempted, an error like the following occurs:

2020-09-11 18:00:12,908 - ERROR [appfabric-executor-4963:i.c.c.c.HttpExceptionHandler@76] - Unexpected error: request=POST /v3/namespaces/default/apps/1split/workflows/DataPipelineWorkflow/runs/d9a2f91f-f457-11ea-81d3-da2e85f5d1fa/stop user=&lt;null&gt;:
java.util.concurrent.ExecutionException: 1 out of 1 runs of the program program:default.1split.-SNAPSHOT.workflow.DataPipelineWorkflow failed to stop
	at io.cdap.cdap.internal.app.services.ProgramLifecycleService.stop(ProgramLifecycleService.java:665) ~[na:na]
	at io.cdap.cdap.gateway.handlers.ProgramLifecycleHttpHandler.performRunLevelStop(ProgramLifecycleHttpHandler.java:304) ~[na:na]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_265]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_265]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_265]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_265]
	at io.cdap.http.internal.HttpMethodInfo.invoke(HttpMethodInfo.java:87) ~[io.cdap.http.netty-http-1.5.0.jar:na]
	at io.cdap.http.internal.HttpDispatcher.channelRead(HttpDispatcher.java:45) [io.cdap.http.netty-http-1.5.0.jar:na]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:38) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:353) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
	at io.cdap.http.internal.NonStickyEventExecutorGroup$NonStickyOrderedEventExecutor.run(NonStickyEventExecutorGroup.java:254) [io.cdap.http.netty-http-1.5.0.jar:na]
	at io.netty.util.concurrent.UnorderedThreadPoolEventExecutor$NonNotifyRunnable.run(UnorderedThreadPoolEventExecutor.java:277) [io.netty.netty-all-4.1.16.Final.jar:4.1.16.Final]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_265]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_265]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_265]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_265]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_265]
Caused by: java.lang.RuntimeException: java.net.SocketTimeoutException: Read timed out
	at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[com.google.guava.guava-13.0.1.jar:na]
	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:190) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.distributed.AbstractTwillProgramController.doStop(AbstractTwillProgramController.java:147) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.AbstractProgramController.lambda$stop$2(AbstractProgramController.java:151) ~[na:na]
	... 3 common frames omitted
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.8.0_265]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[na:1.8.0_265]
	at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[na:1.8.0_265]
	at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_265]
	at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) ~[na:1.8.0_265]
	at sun.security.ssl.InputRecord.read(InputRecord.java:503) ~[na:1.8.0_265]
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:990) ~[na:1.8.0_265]
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1388) ~[na:1.8.0_265]
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1416) ~[na:1.8.0_265]
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1400) ~[na:1.8.0_265]
	at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559) ~[na:1.8.0_265]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185) ~[na:1.8.0_265]
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570) ~[na:1.8.0_265]
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498) ~[na:1.8.0_265]
	at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) ~[na:1.8.0_265]
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352) ~[na:1.8.0_265]
	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.postCall(RuntimeMonitorClient.java:142) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitorClient.requestStop(RuntimeMonitorClient.java:132) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.monitor.RuntimeMonitor.requestStop(RuntimeMonitor.java:164) ~[na:na]
	at io.cdap.cdap.internal.app.runtime.distributed.remote.RemoteExecutionTwillController.terminate(RemoteExecutionTwillController.java:188) ~[na:na]
	... 5 common frames omitted

I don't think this would affect 6.2.x and onward, as the Runtime system was changed to use the Dataproc jobs API instead of relying on direct SSH access.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.9#713009-sha1:2a63c02)
Atlassian logo

Albert Shau (Jira)

unread,
Sep 11, 2020, 5:03:04 PM9/11/20
to cdap-...@googlegroups.com
Albert Shau commented on Bug CDAP-17263
 
Re: Pipeline state stuck when Dataproc cluster is too small

SSHRemoteProcessController.isRunning() should have a timeout for SSH, if it times out it should assume the cluster is not accessible and return that it is not running.

Similarly, if the RuntimeMonitor requests a stop but it fails because it cannot connect, it should assume there is something wrong and trigger a deprovision and mark the run as failed.

Terence Yim (Jira)

unread,
Mar 4, 2021, 2:58:37 PM3/4/21
to cdap-...@googlegroups.com
Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS
This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100154-sha1:e56b95a)
Atlassian logo

Terence Yim (Jira)

unread,
Mar 4, 2021, 2:58:46 PM3/4/21
to cdap-...@googlegroups.com
Terence Yim resolved as Fixed
 
Change By: Terence Yim
Fix versions: 6.4.0
Resolution: Fixed
Status: Open Resolved
Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS
This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100154-sha1:e56b95a)
Atlassian logo

Amit Virmani (Jira)

unread,
Mar 22, 2021, 5:58:17 PM3/22/21
to cdap-...@googlegroups.com
Amit Virmani closed an issue as Fixed
Change By: Amit Virmani
Status: Resolved Closed
Get Jira notifications on your phone! Download the Jira Cloud app for Android or iOS
This message was sent by Atlassian Jira (v1001.0.0-SNAPSHOT#100156-sha1:5ecadcd)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages