Issues with EMR after upgrade to Snowplow 63

49 views
Skip to first unread message

dani...@simplybusiness.co.uk

unread,
Apr 15, 2015, 11:11:33 AM4/15/15
to snowpl...@googlegroups.com
Hello all,
Today we've upgraded our Snowplow installation from v60 to v63 (Red-Cheeked Cordon-Bleu) and we've tried to reprocess all our data. But we've start seeing failed task attemps in the job tracker until the job (enrich step) has failed. We've also noticed that the job was running much slower: usually it runs in 20min and this time it died after 50min and only about 50% of the mappers completed. We've tried to repeat the process 3 times in case that was an Amazon issue, all with the same result. Then we downgraded to v60 and the reprocessing finished normally.

The stack traces of the failed attempts were all the same:

java.lang.Throwable: Child Error
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 137.
    at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

Searching online this stack trace it seems that Linux is killing the JVM because is using too much memory. Have you found this issue too? What could have caused this increase in memory? Any insights are highly appreciated.

Thanks,
Dani

dani...@simplybusiness.co.uk

unread,
Apr 15, 2015, 11:19:41 AM4/15/15
to snowpl...@googlegroups.com
Forgot to mention that the instance type we're using is c3.xlarge.

Alex Dean

unread,
Apr 15, 2015, 11:26:41 AM4/15/15
to snowpl...@googlegroups.com
Hi Dani,

Did you upgrade from Scala Hadoop Enrich 0.12.0 to 0.14.0? I suspect the problem is somehow related to 0.14.0 - we have large customers on 0.13.0 and haven't seen any issues.

There's a lot of new functionality in Scala Hadoop Enrich 0.14.0! Let's try and narrow things down:
  1. Have you enabled any of the new enrichments?
  2. Have you changed anything else (e.g. sending new event types or cross-domain-linking)?

Thanks,

Alex




--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

dani...@simplybusiness.co.uk

unread,
Apr 15, 2015, 11:38:57 AM4/15/15
to snowpl...@googlegroups.com
Hi Alex,
Yes, we upgraded from 0.12.0 to 0.14.0. The only enrichment we have enabled is the new user agent parsing, we'll try to rerun it again (in a few days time) with this enrichment disabled to see if that makes a difference. Now that you say it, it might well be that all those new regexes add a lot of memory. Also, I was checking its implementation and maybe creating a new parser per event (which parses the yaml file and loads all the regexes) might also be adding pressure to the GC?

Thanks for the fast answer!
Dani

Alex Dean

unread,
Apr 15, 2015, 11:49:55 AM4/15/15
to snowpl...@googlegroups.com
Ouch - that will be it. New Parser per event is not sane - shame we didn't pick that up in code review. We will get that fixed in the next release!

https://github.com/snowplow/snowplow/issues/1616

A

dani...@simplybusiness.co.uk

unread,
Apr 15, 2015, 11:54:18 AM4/15/15
to snowpl...@googlegroups.com
Thanks Alex, I'll let you know how it goes when we try to reprocess again with this enrichment disabled.

Dani

Alex Dean

unread,
Apr 15, 2015, 11:59:17 AM4/15/15
to snowpl...@googlegroups.com
Cheers Dani,

The next release should be out this week so hopefully you will be able to complete the upgrade shortly...

A

Robert Kingston

unread,
Apr 15, 2015, 9:48:10 PM4/15/15
to snowpl...@googlegroups.com
Hi guys,

We're encountering similar problems on r62 (Scala Hadoop Enrich 0.13.0). After nearly 2 hours the job died at Enrich Raw Events (2x m1.small and 1x m3.xlarge).

Enrichment worked fine until enabling campaign attribution. Stderr output is:

Exception in thread "main" cascading.flow.FlowException: step failed: (1/2) .../snowplow/enriched-events, with job id: job_201504152123_0002, please see cluster logs for failure messages
	at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:221)
	at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
	at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
	at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)


Not sure if this helps in any way. Would be more than happy to share additional details.

Cheers,
Rob

Alex Dean

unread,
Apr 16, 2015, 3:27:30 AM4/16/15
to snowpl...@googlegroups.com
Hi Rob,

Hmm - we are running 0.13.0 with the campaign_attribution enabled for many customers and haven't seen any issues. So I'm not sure it's the same problem...

Could you start a new thread with some diagnostics?

Cheers,

Alex


Robert Kingston

unread,
Apr 16, 2015, 11:51:42 PM4/16/15
to snowpl...@googlegroups.com
Based on further tests, I believe it is different. Posting details in another thread.

dani...@simplybusiness.co.uk

unread,
Apr 20, 2015, 10:17:12 AM4/20/15
to snowpl...@googlegroups.com
Hi again,
I've tried again the reprocessing with the new user agent parsing disabled and it worked as it used! Incidentally, I've found another issue in JsonUtils.stripInstanceEtc(): in some case the message was null and this method was causing a NPE. Just changing `message` with `Option(message).getOrElse("")` fixed it. If you're interested, I can try to find out which exception caused the message to be null.

Regards,
Dani

Alex Dean

unread,
Apr 20, 2015, 10:19:17 AM4/20/15
to snowpl...@googlegroups.com
Hey Dani,

Ah good to hear about the fix. Yep we spotted that same NPE through Rob's thread late last week:

https://groups.google.com/forum/#!topic/snowplow-user/JuNn4BIz_EM

Cheers,

Alex

dani...@simplybusiness.co.uk

unread,
Apr 20, 2015, 10:21:55 AM4/20/15
to snowpl...@googlegroups.com
Oh, I missed that one. Good to know that you're already on it :)

Thanks!
Dani
Reply all
Reply to author
Forward
0 new messages