config.yml snowplow-emr-etl-runner

310 views
Skip to first unread message

vlad...@devopsatwork.com

unread,
Apr 4, 2016, 11:00:46 AM4/4/16
to Snowplow
Guys,

I've read most of the discussion about emr-etl-runner but still not able to prepare a proper config.yml.


So i will really appreciate your help me solving this issue.


this is the message that i received after trying to run the application :

   Value guarded in: Snowplow::EmrEtlRunner::Cli::load_config
    With Contract: Maybe, String => Hash
    At: /root/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/cli.rb:134 ):
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:69:in `Contract'
    org/jruby/RubyProc.java:271:in `call'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:147:in `failure_callback'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:164:in `common_method_added'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'
    file:/root/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:37:in `(root)'
    org/jruby/RubyKernel.java:1091:in `load'
    file:/root/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    org/jruby/RubyKernel.java:1072:in `require'
    file:/root/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    /tmp/jruby4467930534781662455extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'



Below is my config.yml :




aws:

  access_key_id:
  secret_access_key:
  s3:
    region: eu-west-1
    buckets:
      assets: s3://vd-snowplow-etl-assets/
      log: s3://vd-snowplow-etl/logs/
      raw:
        in:
            - s3://vd-snowplow-etl-logfiles/
        processing: s3://vd-snowplow-etl/processing/
        archive: s3://vd-snowplow-etl-archive/
      enriched:
        good: s3://vd-snowplow-etl/enriched/good/
        bad: s3://vd-snowplow-etl/enriched/bad/
        errors: s3://vd-snowplow-etl/enriched/errors/
      shredded:
        good: s3://vd-snowplow-etl/shredded/good/
        bad: s3://vd-snowplow-etl/shredded/bad/
        errors: s3://vd-snowplow-etl/shredded/errors/
  emr:
    ami_version: 4.3.0      # Choose as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-ami.html
    region: eu-west-1       # Always set this
    placement:      # Set this if not running in VPC. Leave blank otherwise
    ec2_subnet_id: subnet-7083921b  # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: vd-com-aws-test-key
    # Adjust your Hadoop cluster below
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase: "0.92.0"               # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual: "1.1"
    jobflow:
      master_instance_type: m1.small
      core_instance_count: 2
      core_instance_type: m1.small
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.small
      task_instance_bid: 0.015
  etl:
    job_name: Snowplow ETL # Give your job a name
    hadoop_etl_version: 0.5.0 # Version of the Hadoop Enrichment process
    collector_format: clj-tomcat # Or 'clj-tomcat' for the Clojure Collector
    continue_on_unexpected_error: false # You




Ihor Tomilenko

unread,
Apr 4, 2016, 11:51:14 AM4/4/16
to Snowplow
Hi Vladimir,

Sorry to hear about your difficulties.

Your config file appears too short and it looks like you might be using a very old version not compatible with the current EmrEtlRunner. Please, review the example of the current configuration filehttps://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/config/config.yml.sample. For example, etl: section was dropped in r70 version.

What Snowplow version are you trying to run/install?

Regards,
Ihor

vlad...@devopsatwork.com

unread,
Apr 4, 2016, 12:05:52 PM4/4/16
to Snowplow
Hi Ihor,

Thanks for your fast reply.

Actually i tried with this template also :



Here is an example config that i was using. I got the same error.

I really would appreciate if i get some clues how i can run the ./snowplow-emr-etl-runner.


Thanks in advance.

aws:
  # Credentials can be hardcoded or set in environment variables

  access_key_id:
  secret_access_key:
  s3:
    region: eu-west-1
    buckets:
      assets: s3://vd-snowplow-etl-assets
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://vd-snowplow-etl/logs/
      raw:
        in:                  # Multiple in buckets are permitted

          - s3://vd-snowplow-etl-logfiles/
        processing: s3://vd-snowplow-etl/processing/
        archive: s3://vd-snowplow-etl-archive/    # e.g. s3://my-archive-bucket/raw
      enriched:
        good: s3://vd-snowplow-etl/enriched/good/       # e.g. s3://my-out-bucket/enriched/good
        bad: s3://vd-snowplow-etl/enriched/bad/        # e.g. s3://my-out-bucket/enriched/bad
        errors: s3://vd-snowplow-etl/enriched/errors/    # Leave blank unless :continue_on_unexpected_error: set to true below
      shredded:
        good: s3://vd-snowplow-etl/shredded/good      # e.g. s3://my-out-bucket/shredded/good
        bad: s3://vd-snowplow-etl/shredded/bad       # e.g. s3://my-out-bucket/shredded/bad
        errors: s3://vd-snowplow-etl/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://vd-snowplow-etl/shredded/archive    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 4.3.0

    region: eu-west-1        # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: eu-west-1a     # Set this if not running in VPC. Leave blank otherwise

    ec2_subnet_id: subnet-7083921b # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: vd-com-aws-test-key
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase: "0.92.0"               # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual: "1.1"             # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.

    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium

      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: clovdfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.clovdfront/wd_access_log' for Clovdfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:

  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.6.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.8.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: snowplow # e.g. snowplow
    collector: xxxxxxxx.clovdfront.net # e.g. d3rkrsqld9gmqf.clovdfront.net

vlad...@devopsatwork.com

unread,
Apr 4, 2016, 12:10:56 PM4/4/16
to Snowplow
Sorry there are typos in previous example :



aws:
  # Credentials can be hardcoded or set in environment variables
  access_key_id:
  secret_access_key:
  s3:
    region: eu-west-1
    buckets:
      assets: s3://vd-snowplow-etl-assets
      jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
      log: s3://vd-snowplow-etl/logs/
      raw:

        in:                  # Multiple in buckets are permitted
          - s3://vd-snowplow-etl-logfiles/
        processing: s3://vd-snowplow-etl/processing/
        archive: s3://vd-snowplow-etl-archive/    # e.g. s3://my-archive-bucket/raw
      enriched:
        good: s3://vd-snowplow-etl/enriched/good/       # e.g. s3://my-out-bucket/enriched/good
        bad: s3://vd-snowplow-etl/enriched/bad/        # e.g. s3://my-out-bucket/enriched/bad
        errors: s3://vd-snowplow-etl/enriched/errors/    # Leave blank unless :continue_on_unexpected_error: set to true below
      shredded:
        good: s3://vd-snowplow-etl/shredded/good      # e.g. s3://my-out-bucket/shredded/good
        bad: s3://vd-snowplow-etl/shredded/bad       # e.g. s3://my-out-bucket/shredded/bad
        errors: s3://vd-snowplow-etl/shredded/errors     # Leave blank unless :continue_on_unexpected_error: set to true below
        archive: s3://vd-snowplow-etl/shredded/archive    # Where to archive shredded events to, e.g. s3://my-archive-bucket/shredded
  emr:
    ami_version: 4.3.0
    region: eu-west-1        # Always set this
    jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
    service_role: EMR_DefaultRole     # Created using $ aws emr create-default-roles
    placement: eu-west-1a     # Set this if not running in VPC. Leave blank otherwise

    ec2_subnet_id: subnet-7083921b # Set this if running in VPC. Leave blank otherwise
    ec2_key_name: vd-com-aws-test-key
    bootstrap: []           # Set this to specify custom boostrap actions. Leave empty otherwise
    software:
      hbase: "0.92.0"               # Optional. To launch on cluster, provide version, "0.92.0", keep quotes. Leave empty otherwise.
      lingual: "1.1"             # Optional. To launch on cluster, provide version, "1.1", keep quotes. Leave empty otherwise.

    # Adjust your Hadoop cluster below
    jobflow:
      master_instance_type: m1.medium
      core_instance_count: 2
      core_instance_type: m1.medium
      task_instance_count: 0 # Increase to use spot instances
      task_instance_type: m1.medium
      task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
    bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
    additional_info:        # Optional JSON string for selecting additional features
collectors:
  format: cloudfront # For example: 'clj-tomcat' for the Clojure Collector, 'thrift' for Thrift records, 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs or 'ndjson/urbanairship.connect/v1' for UrbanAirship Connect events
enrich:

  job_name: Snowplow ETL # Give your job a name
  versions:
    hadoop_enrich: 1.6.0 # Version of the Hadoop Enrichment process
    hadoop_shred: 0.8.0 # Version of the Hadoop Shredding process
    hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process
  continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
  output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
  snowplow:
    method: get
    app_id: snowplow # e.g. snowplow

Ihor Tomilenko

unread,
Apr 4, 2016, 12:20:24 PM4/4/16
to Snowplow
A few tips for a start.
  1. I assume you do provide assess_key in you config file (well, you have to).
  2. Your enriched:archive bucket is missing
  3. You cannot use placement and subnet_id at the same time. Only of them have to be specified
  4. software section: can you remove the values for both hbase and lingual (unless you intend to use them)?
What does your gray-coloured section represent?

--Ihor

vlad...@devopsatwork.com

unread,
Apr 4, 2016, 1:52:01 PM4/4/16
to Snowplow
Hi Ihor,

Thanks for your suggestions.

Followed your points.

1. I did provide access just removed them for obvious reasons
2. I added the missing archived bucket in enriched section
3. Removed the subnet-id field
4. Removed hbase and lingual values.

The grey colored section i assume is something during copy-paste process, please ignore it.

Unfortunately after these corrections i still got the same error.

I'm executing following command :
./snowplow-emr-etl-runner -d --config config/config.yml

Using r77 version.


Thanks in advance,
Regards,
Vladimir





On Monday, April 4, 2016 at 7:20:24 PM UTC+3, Ihor Tomilenko wrote:
A few tips for a start.
  1. I assume you do provide assess_key in you config file (well, you have to).
  2. Your enriched:archive bucket is missing
  3. You cannot use placement and subnet_id at the same time. Only of them have to be specified
  4. software section: can you remove the values for both hbase and lingual (unless you intend to use them)?
What does your gray-coloured section represent?

--Ihor

Ihor Tomilenko

unread,
Apr 4, 2016, 2:24:01 PM4/4/16
to Snowplow
Vladimir,

You need to run EmrEtlRunner with the --resolver option. The option should point to the resolver.json file (or name of your choice) containing at least this. That is the command line will look like the following:

./snowplow-emr-etl-runner -d --config config/config.yml --resolver resolver.json

In case you are curious, the following function is where the load of the config file fails: https://github.com/snowplow/snowplow/blob/master/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/cli.rb#L134

If you still have the problem, please, provide the whole of the error message and your updated config.yml file.

--Ihor

Vladimir Dimov

unread,
Apr 7, 2016, 9:18:37 AM4/7/16
to Snowplow
Hi Ihor, 


I'm still unable to get the app running. 

Here is the whole error message : 

    Value guarded in: Snowplow::EmrEtlRunner::Runner::initialize
    With Contract: Hash, Hash, ArrayOf, String => Snowplow::EmrEtlRunner::Runner
    At: /root/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:32 ):
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:69:in `Contract'
    org/jruby/RubyProc.java:271:in `call'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:147:in `failure_callback'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:164:in `common_method_added'
    /root/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'
    file:/root/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:38:in `(root)'
    org/jruby/RubyKernel.java:1091:in `load'
    file:/root/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    org/jruby/RubyKernel.java:1072:in `require'
    file:/root/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'
    /tmp/jruby595330812059347835extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'




Do you think there is something missing like dependency. 
I've installed jre and the binary. Do i need something else ? 



Regards, 
Vladimir

Ihor Tomilenko

unread,
Apr 7, 2016, 1:14:14 PM4/7/16
to Snowplow
Vladimir,

This is a bit different error. It is indicative of wrong parameters passed to the runner.


What is your command line now? What are you passing to EmrEtlRunner?

--Ihor

Vladimir Dimov

unread,
Apr 7, 2016, 1:25:33 PM4/7/16
to Snowplow
I've wiped out the whole instance, start from beginning : 

i've opened this issue at discours : http://discourse.snowplowanalytics.com/

I'd really appreciate if i get some support. 

Do you have some working configuration that i can use to test ? 

Thanks


Regards, 
Vladimir

On Monday, 4 April 2016 18:00:46 UTC+3, vlad...@devopsatwork.com wrote:
Reply all
Reply to author
Forward
0 new messages