Restarting EmrEtlRunner after error

87 views
Skip to first unread message

Peter Kim

unread,
Jan 4, 2016, 6:32:22 PM1/4/16
to Snowplow
Hey all,

I've been running snowplow for ~2 weeks now. It's been going great, but every so often the EMR step will fail on the Elasticity Scalding Step: Enrich Raw Events step with errors such as:

Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun
.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun
.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun
.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java
.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at com
.twitter.scalding.Job$.apply(Job.scala:47)
        at com
.twitter.scalding.Tool.getJob(Tool.scala:48)
        at com
.twitter.scalding.Tool.run(Tool.scala:68)
        at org
.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at com
.snowplowanalytics.snowplow.enrich.hadoop.JobRunner$.main(JobRunner.scala:33)
        at com
.snowplowanalytics.snowplow.enrich.hadoop.JobRunner.main(JobRunner.scala)
        at sun
.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java
.lang.reflect.Method.invoke(Method.java:606)
        at org
.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by: java.lang.VerifyError: Bad type on operand stack
Exception Details:
 
Location:
    com
/snowplowanalytics/snowplow/enrich/common/utils/ConversionUtils$.decodeBase64Url(Ljava/lang/String;Ljava/lang/String;)Lscalaz/Validation; @5: invokevirtual
 
Reason:
   
Type 'org/apache/commons/codec/binary/Base64' (current frame, stack[0]) is not assignable to 'org/apache/commons/codec/binary/BaseNCodec'
 
Current Frame:
    bci
: @5
    flags
: { }
    locals
: { 'com/snowplowanalytics/snowplow/enrich/common/utils/ConversionUtils$', 'java/lang/String', 'java/lang/String' }
    stack
: { 'org/apache/commons/codec/binary/Base64', 'java/lang/String' }
 
Bytecode:
   
0000000: 2ab7 008a 2cb6 0090 3a08 bb00 5459 1908
   
0000010: b200 96b7 0099 3a09 b200 9e19 09b9 00a4
   
0000020: 0200 b900 aa01 00a7 0064 4e2d 3a04 b200
   
0000030: af19 04b6 00b3 3a05 1905 b600 b799 0005
   
0000040: 2dbf 1905 b600 bbc0 00bd 3a06 b200 9ebb
   
0000050: 00bf 59b2 0041 12c1 b600 c4b7 00c7 b200
   
0000060: 4106 bd00 0459 032b 5359 042c 5359 0519
   
0000070: 06b6 00ca 53b6 00d0 b900 d602 00b9 00a4
   
0000080: 0200 b900 d901 003a 0719 07b0          
 
Exception Handler Table:
    bci
[0, 42] => handler: 42
 
Stackmap Table:
    same_locals_1_stack_item_frame
(@42,Object[#189])
    append_frame
(@66,Object[#189],Object[#189],Object[#82])
    full_frame
(@139,{Object[#2],Object[#84],Object[#84]},{Object[#225]})

        at com
.snowplowanalytics.snowplow.enrich.hadoop.EtlJobConfig$.com$snowplowanalytics$snowplow$enrich$hadoop$EtlJobConfig$$base64ToJsonNode(EtlJobConfig.scala:224)
        at com
.snowplowanalytics.snowplow.enrich.hadoop.EtlJobConfig$.loadConfigAndFilesToCache(EtlJobConfig.scala:126)
        at com
.snowplowanalytics.snowplow.enrich.hadoop.EtlJob.<init>(EtlJob.scala:139)
       
... 15 more

or

Exception in thread "main" cascading.flow.FlowException: step failed: (2/4), with job id: job_1451491604590_0007, please see cluster logs for failure messages
        at cascading
.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:261)
        at cascading
.flow.planner.FlowStepJob.start(FlowStepJob.java:162)
        at cascading
.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
        at cascading
.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
        at java
.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java
.lang.Thread.run(Thread.java:745)

I'm not 100% what the underlying reason for these errors are. When I move the raw files on s3 "stuck" in the processing bucket back into the logs bucket, however, it runs without issue. I'm wondering if there's a fix for these errors. If not, is there any solution for me to automatically catch these exceptions, move the files back to the appropriate bucket, and restart the EMR process?

Thanks!
Peter Kim


Fred Blundun

unread,
Jan 5, 2016, 4:58:03 AM1/5/16
to snowpl...@googlegroups.com
Hi Peter,

That's interesting. We have encountered that VerifyError before, when upgrading to EMR AMI 3.x. It was caused by the version of the commons-codec jar available on the cluster at runtime not being the same as the version against which the enrichment jar was compiled. We fixed this by adding a bootstrap step (snowplow-ami3-bootstrap.sh) which deleted all copies of the commons-codec jar from the cluster before the enrichment step.

More recently, we have been working on upgrading to AMI 4.x and saw the error return. This seemed to be because the new AMI adds jars to the cluster, including the problematic jar, after bootstrap steps but before the actual job steps. We fixed this by changing the bootstrap script to download the correct version of the commons-codec jar (1.5) and put it at the beginning of the classpath, so that it takes precedence over other versions of the jar.

My questions are:

* Which AMI version are you using?
* Which version of the enrich jar are you using?
* In the EMR UI, for jobs which failed with a VerifyError, can you see whether the snowplow-ami3-bootstrap.sh script executed?
* Can you paste in your configuration file, with the AWS credentials anonymized?

Regards,
Fred

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Kim

unread,
Jan 5, 2016, 12:27:44 PM1/5/16
to Snowplow
Hey Fred,

Really appreciate you taking the time to respond. To answer your questions:

* Which AMI version are you using?
3.7.0

* Which version of the enrich jar are you using?
1.3.0

* In the EMR UI, for jobs which failed with a VerifyError, can you see the snowplow-ami3-bootstrap.sh script executed?
I do see the following shell script run as a bootstrap action:

Elasticity Bootstrap Action
s3://snowplow-hosted-assets/common/emr/snowplow-ami3-bootstrap-0.1.0.sh


* Can you paste in your configuration file, with the AWS credentials anonymized?
aws:
 # Credentials can be hardcoded or set in environment variables
 access_key_id: XXXXXX
 secret_access_key: XXXXX
 s3:
   region: us-east-1
   buckets:
     assets: s3://XXXXXX # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
     jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
     log: s3n://XXXXXX/logs
     raw:
       in:
          - "s3n://XXXXXX/"
       processing: s3n://XXXXXX/processing
       archive: s3://XXXXXX/raw
     enriched:
       good: s3://XXXXXX/enriched/good
       bad: s3://XXXXXX/enriched/bad
       errors: # Leave blank unless :continue_on_unexpected_error: set to true below
       archive: s3://XXXXXX/enriched/archive
     shredded:
       good: s3://XXXXXX/shredded/good
       bad: s3://XXXXXX/shredded/bad
       errors: s3://XXXXXX/shredded/errors
       archive: s3://XXXXXX/shredded/archive # Not required for Postgres currently
 emr:
   ami_version: 3.7.0      # Don't change this
   region: us-east-1        # Always set this
   jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
   service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
   placement: us-east-1b     # Set this if not running in VPC. Leave blank otherwise
   ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise
   ec2_key_name: XXXXXX
   bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
   software:
     hbase:               # To launch on cluster, provide version, "0.92.0", keep quotes
     lingual:             # To launch on cluster, provide version, "1.1", keep quotes
   # Adjust your Hadoop cluster below
   jobflow:
     master_instance_type: m1.medium
     core_instance_count: 2
     core_instance_type: m1.medium
     task_instance_count: 2 # Increase to use spot instances
     task_instance_type: m1.medium
     task_instance_bid:  # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
   bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
collectors:
 format: cloudfront # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
enrich:
 job_name: XXXXXX # Give your job a name
 versions:
   hadoop_enrich: 1.3.0 # Version of the Hadoop Enrichment process
   hadoop_shred: 0.6.0 # Version of the Hadoop Shredding process
   hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
 continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
 output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
 download:
   folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
 targets:
   - name: "XXXXXX"
     type: redshift
     host: XXXXXX.redshift.amazonaws.com # The endpoint as shown in the Redshift console
     database: snowplow # Name of database
     port: 5439 # Default Redshift port
     table: atomic.events
     username: XXXXXX
     password: XXXXXX
     maxerror: 1 # Stop loading on first error, or increase to permit more load errors
     comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
     ssl_mode: disable
monitoring:
 tags: {} # Name-value pairs describing this job
 logging:
   level: DEBUG # You can optionally switch to INFO for production
 #snowplow:
 #  method: get
 #  app_id: ADD HERE # e.g. snowplow
 #  collector: ADD HERE # e.g. d3rkrsqld9gmqf.cloudfront.net



Let me know if you need anything else.

Thanks!
Peter Kim

Fred Blundun

unread,
Jan 6, 2016, 12:46:38 PM1/6/16
to snowpl...@googlegroups.com
Thanks Peter,

It looks like you have the right AMI and jar versions configured, and you're using the right bootstrap action. One other idea for understanding the VerifyError is to check in the console that the AMI version which is running is indeed 3.7.0, to make sure that you haven't somehow been involuntarily upgraded to the 4.x series.

For the other error, have you tried downloading and looking at the cluster logs? Often they don't contain useful information but they are sometimes helpful.

It will be interesting to get to the bottom of this.

Fred

Peter Kim

unread,
Jan 6, 2016, 3:00:13 PM1/6/16
to snowpl...@googlegroups.com
Checked the console and the AMI version is 3.7.0. I'll try digging around more -- will keep you updated on what I find.

Thanks,
Peter Kim

--
You received this message because you are subscribed to a topic in the Google Groups "Snowplow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/snowplow-user/6ldou48ltfQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to snowplow-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages