EMR ETL Runner - cluster "stuck" for 19+ hours

1,506 views
Skip to first unread message

Peter Vandenberk

unread,
Jun 24, 2014, 9:54:58 AM6/24/14
to snowpl...@googlegroups.com
Hey Snowplow peeps,

Not sure if this is a Snowplow issue, or rather an AWS one, but you may be able to help me either way  :-)

(see attached screenshot) - yesterday, our hourly scheduled EMR ETL Runner cronjob didn't finish, apparently because the EMR cluster was "stuck".

The AWS "Cluster Details" page isn't terribly helpful, as you can see in the screenshot, and even the log files don't really help much... I can grab and post all logs for all steps, but only the "stderr" log for the "Elasticity Setup Hadoop Debugging" step is mildly interesting:

+ /etc/init.d/hadoop-state-pusher-control stop
+ PID_FILE=/mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid
+ LOG_FILE=/mnt/var/log/hadoop-state-pusher/hadoop-state-pusher.out
+ SVC_FILE=/mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher
+ case $1 in
+ stop
+ echo 0
/etc/init.d/hadoop-state-pusher-control: line 35: /mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher: No such file or directory
+ /etc/init.d/hadoop-state-pusher-control start
+ PID_FILE=/mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid
+ LOG_FILE=/mnt/var/log/hadoop-state-pusher/hadoop-state-pusher.out
+ SVC_FILE=/mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher
+ case $1 in
+ start
++ dirname /mnt/var/lib/hadoop-state-pusher/run-hadoop-state-pusher
+ sudo -u hadoop mkdir -p /mnt/var/lib/hadoop-state-pusher
+ echo 1
++ dirname /mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid
+ sudo -u hadoop mkdir -p /mnt/var/run/hadoop-state-pusher

HOWEVER, the "stderr" log contains that exact same error message for all our EMR clusters, even the ones that terminate and finish successfully, so I think that's a red herring.

A couple of questions:

  1. have you seen this before?
  2. what's the best course of action if/when this happens?
  3. is there a way to automatically detect when this happens?
ad (2) - for now, I terminated the EMR cluster in the AWS dashboard, and the next time the job ran, things appear to have come good; I was wondering about potential data loss in case this happens, though?

ad (3) - in this instance, I only found out "by accident" when I was looking at the EMR panel in our AWS account, but I assume there's ways for the EMR cluster to "alert" when it takes longer than a certain threshold?

Best regards,

Peter Vandenberk
Principal Developer
Simply Business

emr_cluster_stuck.png

Alex Dean

unread,
Jun 24, 2014, 12:37:27 PM6/24/14
to snowpl...@googlegroups.com
Hey Peter,

TBH we more often see the cluster stuck in PENDING - we have an unscheduled ticket to fix that: https://github.com/snowplow/snowplow/issues/387

In either scenario, if you kill the EmrEtlRunner and restart with --skip staging there won't be any risk of data loss - basically you are killing the EMR job and then restarting it, so it will just process the raw event data in processing that it couldn't get to on the hung run.

It's an interesting question how we would detect that a step like setting up debug is taking too long (how long is too long)...

A








--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

dani...@simplybusiness.co.uk

unread,
Nov 27, 2014, 7:51:30 AM11/27/14
to snowpl...@googlegroups.com
Hi all,
I've got some more information about this issue. The last time a job has got stuck, I've gone into the machine and browsed the JobTracker web. The problem seems to be that the running task has an unassigned status and is never put to run. I attach a couple of screenshots.

Dani
running_job.png
unassigned_task.png

Alex Dean

unread,
Nov 27, 2014, 9:12:28 AM11/27/14
to snowpl...@googlegroups.com
Hi Dani,

Thanks, yes - it would be cool to get this ticket implemented:

https://github.com/snowplow/snowplow/issues/387

Cheers,

Alex

Yuval Herziger

unread,
Dec 19, 2014, 6:31:02 AM12/19/14
to snowpl...@googlegroups.com
Hi Alex,

I, too, experienced that for the first time today.
The cluster was on "Running" status for nearly 11 hours. I am not set on debug so obviously the logs aren't that elaborate.

Terminated, manually moved the contents of the processing bucket back to the in bucket and it all worked fine

Controller: 
2014-12-19T00:16:50.619Z INFO Executing /usr/java/latest/bin/java -cp /home/hadoop/conf:/usr/java/latest/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-tools-1.0.3.jar:/home/hadoop/hadoop-core-1.0.3.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/3 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/3/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar /home/hadoop/lib/emr-s3distcp-1.0.jar --src hdfs:///local/snowplow/enriched-events/ --dest s3://<my processing bucket>/enriched/good/run=2014-12-19-00-01-06/ --srcPattern .*part-.* --s3Endpoint s3-eu-west-1.amazonaws.com
2014-12-19T01:16:50.733Z INFO Process still running
2014-12-19T02:16:50.908Z INFO Process still running
2014-12-19T03:16:50.918Z INFO Process still running
2014-12-19T04:16:50.934Z INFO Process still running
2014-12-19T05:16:51.346Z INFO Process still running
2014-12-19T06:16:51.711Z INFO Process still running
2014-12-19T07:16:52.171Z INFO Process still running
2014-12-19T08:16:52.544Z INFO Process still running
2014-12-19T09:16:52.696Z INFO Process still running
2014-12-19T10:16:53.112Z INFO Process still running
2014-12-19T10:56:26.766Z WARN Step failed with execution exception java.lang.InterruptedException: sleep interrupted # that's me

Syslog:
2014-12-19 00:16:57,090 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Created 1 files to copy 5 files 
2014-12-19 00:18:00,276 INFO org.apache.hadoop.hdfs.DFSClient (Thread-6): Exception in createBlockOutputStream 172.31.2.26:9200 java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.31.1.150:41746 remote=/172.31.2.26:9200]

Alex Dean

unread,
Dec 19, 2014, 8:47:38 AM12/19/14
to snowpl...@googlegroups.com
Thanks for sharing Yuval!

A

dani...@simplybusiness.co.uk

unread,
Jan 27, 2015, 12:33:53 PM1/27/15
to snowpl...@googlegroups.com
Hi Alex,
At Simply Business we've implemented a fix that:
  • Terminates the EMR cluster after a configurable timeout
  • Retries to run an EMR job a configurable number of times

The second part solves the error that happens when the timeout terminates the cluster, but also any other kind of transient EMR error.

Would you be interested in it? We can issue a PR.

Best regards,

Dani

Alex Dean

unread,
Jan 27, 2015, 12:38:39 PM1/27/15
to snowpl...@googlegroups.com
Hi Dani,

Sounds great - PR most welcome!

Alex

Ryan Doherty

unread,
Jan 27, 2015, 7:54:31 PM1/27/15
to snowpl...@googlegroups.com
OMG YES
-Ryan

Martin Klosi

unread,
Jun 2, 2015, 8:18:18 PM6/2/15
to snowpl...@googlegroups.com

Reply all
Reply to author
Forward
0 new messages