Adding bad rows to elastic search not working

Jimit Modi

unread,

Mar 2, 2016, 4:20:38 AM3/2/16

to Snowplow

Hello All,

I was trying to add bad rows data to elastic search following steps here.

I wanted to run only elastic search step so I used below.

ubuntu@ip-x-x-x-x:~/emr-tools$ ./snowplow-emr-etl-runner -c config/config-es.yml -r config/resolver.json -n enrichments --skip staging,s3distcp,enrich,shred,archive_raw

D, [2016-03-02T09:08:57.436000 #7138] DEBUG -- : Initializing EMR jobflow

D, [2016-03-02T09:09:00.622000 #7138] DEBUG -- : EMR jobflow j-1I55K53DYNLFJ started, waiting for jobflow to complete...

D, [2016-03-02T09:09:01.086000 #7138] DEBUG -- : EMR jobflow j-1I55K53DYNLFJ completed successfully.

I, [2016-03-02T09:09:01.087000 #7138] INFO -- : Completed successfully

ubuntu@ip-x-x-x-x:~/emr-tools$

This started a EMR as output above. The the EMR was created with empty steps:

Any ideas ?

Thanks

Jimit Modi

Auto Generated Inline Image 1

Fred Blundun

unread,

Mar 2, 2016, 4:43:18 AM3/2/16

to Snowplow

Hi Jimit,

Could you share your configuration YAML (with AWS credentials removed)?

Thanks,

Fred

Jimit Modi

unread,

Mar 2, 2016, 4:54:25 AM3/2/16

to snowpl...@googlegroups.com

aws:

# Credentials can be hardcoded or set in environment variables

access_key_id: XXXXXXXXXX

secret_access_key: XXXXXXXXXX

s3:

region: us-east-1

buckets:

assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket

jsonpath_assets: s3://snowplow-assets/custom-events-jsonpaths/ # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here

log: s3://snowplow-emr-logs/

raw:

in: # Multiple in buckets are permitted

- s3://elasticbeanstalk-us-east-1-238537778401/resources/environments/logs/publish/ # e.g. s3://my-archive-bucket/raw

processing: s3://snowplow-etl-emr-runner/processing

archive: s3://snowplow-etl-emr-runner/archive # e.g. s3://my-archive-bucket/raw

enriched:

good: s3://snowplow-etl-emr-runner/enriched/good # e.g. s3://my-out-bucket/enriched/good

bad: s3://snowplow-etl-emr-runner/enriched/bad # e.g. s3://my-out-bucket/enriched/bad

errors: # Leave blank unless :continue_on_unexpected_error: set to true below

archive: s3://snowplow-etl-emr-runner/enriched/archive # Where to archive enriched events to, e.g. s3://my-out-bucket/enriched/archive

shredded:

good: s3://snowplow-etl-emr-runner/shredded/good # e.g. s3://my-out-bucket/shredded/good

bad: s3://snowplow-etl-emr-runner/shredded/bad # e.g. s3://my-out-bucket/shredded/bad

errors: # Leave blank unless :continue_on_unexpected_error: set to true below

archive: s3://snowplow-etl-emr-runner/shredded/archive # Not required for Postgres currently

emr:

ami_version: 4.3.0 # Was 3.7.0 # Don't change this

region: us-east-1 # Always set this

jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles

service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles

placement: # Set this if not running in VPC. Leave blank otherwise

ec2_subnet_id: # Set this if running in VPC. Leave blank otherwise

ec2_key_name: xxxx

bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise

software:

hbase: # To launch on cluster, provide version, "0.92.0", keep quotes

lingual: # To launch on cluster, provide version, "1.1", keep quotes

# Adjust your Hadoop cluster below

jobflow:

master_instance_type: m1.medium

core_instance_count: 2

core_instance_type: m1.medium

task_instance_count: 0 # Increase to use spot instances

task_instance_type: m1.medium

task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances

bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures

collectors:

format: clj-tomcat # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs

enrich:

job_name: Snowplow ETL # Give your job a name

versions:

hadoop_enrich: 1.6.0 # Was 1.5.1 # Version of the Hadoop Enrichment process

hadoop_shred: 0.8.0 # Was 0.7.0 # Version of the Hadoop Shredding process

hadoop_elasticsearch: 0.1.0 # Version of the Hadoop to Elasticsearch copying process

continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL

output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP

storage:

download:

folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift

targets:

- name: " Redshift database"

type: redshift

host: xxxxxxxxx.us-east-1.redshift.amazonaws.com # The endpoint as shown in the Redshift console

database: database # Name of database

port: 5439 # Default Redshift port

ssl_mode: disable # One of disable (default), require, verify-ca or verify-full

table: atomic.events

username: snowplow

password: xxxxxxxxxx

maxerror: 1 # Stop loading on first error, or increase to permit more load errors

comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified

- name: "ELK Elasticsearch cluster" # Name for the target - used to label the corresponding jobflow step

type: elasticsearch # Marks the database type as Elasticsearch

host: "xxxxxxxxxx" # Elasticsearch host

database: snowplow # The Elasticsearch index

port: 9200 # Port used to connect to Elasticsearch

table: bad_rows # The Elasticsearch type

es_nodes_wan_only: false # Set to true if using Amazon Elasticsearch Service

username: # Not required for Elasticsearch

password: # Not required for Elasticsearch

sources: # Leave blank or specify: ["s3://out/enriched/bad/run=xxx", "s3://out/shred/bad/run=yyy"]

maxerror: # Not required for Elasticsearch

comprows: # Not required for Elasticsearch

monitoring:

tags: {} # Name-value pairs describing this job

logging:

level: DEBUG # You can optionally switch to INFO for production

--

Jim(y || it)

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fred Blundun

unread,

Mar 2, 2016, 4:59:25 AM3/2/16

to snowpl...@googlegroups.com

The problem is that you haven't specified any sources for the Elasticsearch target. If you run a job with enrichment or shred steps, then the bad rows generated by those steps will automatically be loaded into Elasticsearch. But since you are only running the Elasticsearch step, the app doesn't know where to load bad rows from. You need something like this:

sources: ["s3://out/enriched/bad/run=xxx", "s3://out/shred/bad/run=yyy"]

Hope that helps,

Fred

Jimit Modi

unread,

Mar 2, 2016, 5:04:22 AM3/2/16

to snowpl...@googlegroups.com

Thanks Fred for quick reply.

So is it ok if I pass this way sources: ["s3://out/enriched/bad/", "s3://out/shred/bad/"] (removing run subdirectory, as I wanted to process of many previous run)

Thanks

--

Jim(y || it)

Fred Blundun

unread,

Mar 2, 2016, 5:14:03 AM3/2/16

to snowpl...@googlegroups.com

I'm afraid we don't currently support configuring higher-level directories in this way - you will need to specify each run by its ID. I have created a ticket to improve this here.

Regards,

Fred

Jimit Modi

unread,

Mar 2, 2016, 5:21:37 AM3/2/16

to snowpl...@googlegroups.com

So that mean, For every run directory I will have to run a different EMR for it ?

--

Jim(y || it)

Fred Blundun

unread,

Mar 2, 2016, 5:31:01 AM3/2/16

to snowpl...@googlegroups.com

The elements of the array won't correspond to different EMR jobs - just different steps within a single EMR job.

Regards,

Fred

Jimit Modi

unread,

Mar 2, 2016, 8:07:04 AM3/2/16

to snowpl...@googlegroups.com

Hey Fred,

It failed with below error. Any clue ?

ubuntu@ip-:~/emr-tools$ ./snowplow-emr-etl-runner -c config/config-es.yml -r config/resolver.json -n enrichments --skip staging,s3distcp,enrich,shred,archive_raw

D, [2016-03-02T12:03:07.308000 #7246] DEBUG -- : Initializing EMR jobflow

D, [2016-03-02T12:03:11.145000 #7246] DEBUG -- : EMR jobflow j-1V0UXPYB5YEMY started, waiting for jobflow to complete...

F, [2016-03-02T12:59:16.698000 #7246] FATAL -- :

Snowplow::EmrEtlRunner::EmrExecutionError (EMR jobflow j-1V0UXPYB5YEMY failed, check Amazon EMR console and Hadoop logs for details (help: https://github.com/snowplow/snowplow/wiki/Troubleshooting-jobs-on-Elastic-MapReduce). Data files not archived.

Snowplow ETL: TERMINATING [STEP_FAILURE] ~ elapsed time n/a [2016-03-02 12:10:52 +0000 - ]

- 1. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-15-13-14-11/ -> Elasticsearch: ELK Elasticsearch cluster: FAILED ~ 00:45:16 [2016-03-02 12:10:56 +0000 - 2016-03-02 12:56:13 +0000]

- 2. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-03-01-12-18-42/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 3. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-29-11-48-48/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 4. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-29-06-28-22/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 5. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-26-07-56-53/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 6. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-25-13-15-53/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 7. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-23-12-26-19/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 8. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-23-09-59-22/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 9. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-23-08-22-39/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 10. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-23-07-22-39/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 11. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-22-13-55-08/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 12. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-22-12-40-18/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 13. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-22-07-25-58/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 14. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-02-15-13-14-11/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 15. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-03-02-06-14-09/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 16. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-03-01-12-18-42/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 17. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-29-11-48-48/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 18. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-29-06-28-22/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 19. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-26-07-56-53/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 20. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-25-13-15-53/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 21. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-23-12-26-19/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 22. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-23-09-59-22/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 23. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-23-08-22-39/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 24. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-23-07-22-39/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 25. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-22-13-55-08/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 26. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-22-12-40-18/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 27. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/enriched/bad/run=2016-02-22-07-25-58/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]

- 28. Elasticity Scalding Step: Errors in s3://snowplow-etl-emr-runner/shredded/bad/run=2016-03-02-06-14-09/ -> Elasticsearch: ELK Elasticsearch cluster: CANCELLED ~ elapsed time n/a [ - ]):

/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:471:in `run'

/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'

/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'

/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'

/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:68:in `run'

/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/method_reference.rb:46:in `send_to'

/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts.rb:305:in `call_with'

/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/gems/contracts-0.7/lib/contracts/decorators.rb:159:in `common_method_added'

file:/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/emr-etl-runner/bin/snowplow-emr-etl-runner:39:in `(root)'

org/jruby/RubyKernel.java:1091:in `load'

file:/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'

org/jruby/RubyKernel.java:1072:in `require'

file:/home/ubuntu/emr-tools/snowplow-emr-etl-runner!/META-INF/main.rb:1:in `(root)'

/tmp/jruby6137640058805975763extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'

--

Jim(y || it)

Fred Blundun

unread,

Mar 2, 2016, 9:03:09 AM3/2/16

to snowpl...@googlegroups.com

Hello Jimit,

The EMR Console may have more information about exactly what went wrong - have a look at this wiki page for more information on how to debug failing jobs.

Regards,

Fred

Jimit Modi

unread,

Mar 2, 2016, 9:11:06 AM3/2/16

to snowpl...@googlegroups.com

Ok will check and let you know.

Also if below gives you some headstart. This is the stderr of the step in EMR

Exception in thread "main" cascading.flow.FlowException: step failed: (1/1) snowplow/bad_rows, with job id: job_1456920518439_0001, please see cluster logs for failure messages
	at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:221)
	at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
	at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
	at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

--

Jim(y || it)

Jimit Modi

unread,

Mar 4, 2016, 7:08:53 AM3/4/16

to Snowplow

Hey Fred,

Are you able to point me some where to fix this error ?

Ihor Tomilenko

unread,

Mar 4, 2016, 11:03:58 AM3/4/16

to Snowplow

Hi Jimit,

You might need to dig more to get more descriptive error.

If you go to EMR service on AWS you should be able to see the cluster list with the name you gave to it in the configuration file. It will be accompanied by the job ID. Click on the one which has failed and extend the Steps section. It should list all the attempted steps and their statuses. From there you should be able to access the corresponding logs too. Mind you it might not be sufficient to examine only the stderr. You might need to check the other logs too (syslog, controller).

Regards,

Ihor

Jimit Modi

unread,

Mar 7, 2016, 1:25:38 AM3/7/16

to snowpl...@googlegroups.com

Hey Ihor,

Here I am attaching the other logs.

--

Jim(y || it)

syslog.txt

controllerlog.txt

stderr.txt

Fred Blundun

unread,

Mar 8, 2016, 8:16:38 AM3/8/16

to Snowplow

Hi Jimit,

Those files don't look very informative, but they aren't the full logs. The EMR console page for the cluster should tell you the bucket in which the full logs for the job are stored. I suggest downloading them and grepping for errors. (More information on the location of the full logs is available here.)