Hi,
I've been setting up the Snowplow pipeline to collect telemetry data for
this and have run in to difficulties with the shredding part of the ETL runner. Any pointers would be really appreciated.
Everything runs smoothly via the Python Tracker, Scala Kinesis Collector, snowplow-kinesis-s3 sink, up to the Elasticity Scalding Step: Shred Enriched Events step of the EMR job.
Everything sinks to 'bad', and each record looks like
{"line":"cooltura-0.0.1-001\tmob\t2015-09-16 15:42:44.671\t2015-09-16 15:22:11.229\t1970-01-17 16:40:16.455\tstruct\td29e4daf-66ef-4939-8326-03d21d0bab2d\t\tdefault\tpy-0.7.2\tssc-0.5.0-kinesis\thadoop-1.0.0-common-0.14.0\t28\t84.45.69.x\t\t\t\t65590e33-2e3f-4a6c-8604-9a85747120d1\tGB\t\t\t\t51.5\t-0.13000488\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t3\t2\tlala\t72x35\t1\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tpython-requests/2.2.1 CPython/2.7.6 Linux/3.13.0-53-generic\tUnknown\tUnknown\t\tunknown\tOTHER\ten\t\t\t\t\t\t\t\t\t\t\t\t72\t35\tLinux\tLinux\tOther\tEurope/London\tComputer\t0\t\t\t\t\t\t\t\t\t\t\t\tUSD\tEurope/London\t\t\t\t\t\t\t{\"schema\":\"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1\",\"data\":[{\"schema\":\"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0\",\"data\":{\"useragentFamily\":\"Python Requests\",\"useragentMajor\":\"2\",\"useragentMinor\":\"2\",\"useragentPatch\":null,\"useragentVersion\":\"Python Requests 2.2\",\"osFamily\":\"Linux\",\"osMajor\":\"3\",\"osMinor\":\"13\",\"osPatch\":null,\"osPatchMinor\":null,\"osVersion\":\"Linux 3.13\",\"deviceFamily\":\"Other\"}}]}\t\t","errors":[{"level":"error","message":"Could not find schema with key iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1 in any repository, tried:","repositories":["Iglu Client Embedded [embedded]","Tagcloud Analytics [HTTP]","Iglu Central [HTTP]"]}]}
After first skipping the shred and archive steps to successfully enrich, I call the shredding process with
bundle exec bin/snowplow-emr-etl-runner -S s3://tagcloud-analytics-etl-out/good/run=2015-09-16-15-42-44 --config config/config.yml --resolver config/resolver.json --enrichments ../config/enrichments
with
config.yml:
aws:
# Credentials can be hardcoded or set in environment variables
access_key_id: XXXXXXXXXXXXXXXXXXXXX
secret_access_key: XXXXXXXXXXXXXXXXXXXX
s3:
region: eu-west-1
buckets:
assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
jsonpath_assets: # If you have defined your own JSON Schemas, add the s3:// path to your own JSON Path files in your own bucket here
log: s3://tagcloud-analytics-etl-log
raw:
in: # Multiple in buckets are permitted
- s3://tagcloud-analytics-etl-in # e.g. s3://my-archive-bucket/raw
processing: s3://tagcloud-analytics-etl-processing
archive: s3://tagcloud-analytics-etl-archive
enriched:
good: s3://tagcloud-analytics-etl-out/good # e.g. s3://my-out-bucket/en
bad: s3://tagcloud-analytics-etl-out/bad # e.g. s3://my-out-bucket/enriched/bad
errors: s3://tagcloud-analytics-etl-out/error # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://tagcloud-analytics-etl-out/archive ## Where to archive enriched events to, e.g. s3://my-out-bucket/enriched/archive
shredded:
good: s3://tagcloud-analytics-etl-shredded/good # e.g. s3://my-shredded-bucket/en
bad: s3://tagcloud-analytics-etl-shredded/bad # e.g. s3://my-shredded-bucket/enriched/bad
errors: s3://tagcloud-analytics-etl-shredded/error # Leave blank unless :continue_on_unexpected_error: set to true below
archive: s3://tagcloud-analytics-etl-shredded/archive ## Where to archive enriched events to, e.g. s3://my-shredded-bucket/enriched/archive
emr:
ami_version: 3.9.0 # Don't change this
region: eu-west-1 # Always set this
jobflow_role: EMR_EC2_DefaultRole # Created using $ aws emr create-default-roles
service_role: EMR_DefaultRole # Created using $ aws emr create-default-roles
placement: # Set this if not running in VPC. Leave blank otherwise
ec2_subnet_id: subnet-79f7271c # Set this if running in VPC. Leave blank otherwise
ec2_key_name: tagcloud-emr
bootstrap: [] # Set this to specify custom boostrap actions. Leave empty otherwise
software:
hbase: "0.94.18" # To launch on cluster, provide version, "0.92.0", keep quotes
lingual: "1.1" # To launch on cluster, provide version, "1.1", keep quotes
# Adjust your Hadoop cluster below
jobflow:
master_instance_type: m1.medium
core_instance_count: 2
core_instance_type: m1.medium
task_instance_count: 0 # Increase to use spot instances
task_instance_type: m1.medium
task_instance_bid: 0.015 # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
bootstrap_failure_tries: 3 # Number of times to attempt the job in the event of bootstrap failures
collectors:
format: thrift # Or 'clj-tomcat' for the Clojure Collector, or 'thrift' for Thrift records, or 'tsv/com.amazon.aws.cloudfront/wd_access_log' for Cloudfront access logs
enrich:
job_name: Snowplow ETL # Give your job a name
versions:
hadoop_enrich: 1.0.0 # Version of the Hadoop Enrichment process
hadoop_shred: 0.4.0 # Version of the Hadoop Shredding process
continue_on_unexpected_error: true # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
output_compression: NONE # Compression only supported with Redshift, set to NONE if you have Postgres targets. Allowed formats: NONE, GZIP
storage:
download:
folder: # Postgres-only config option. Where to store the downloaded files. Leave blank for Redshift
targets:
- name: "tagcloud-snowplow-storage"
type: redshift
database: snowplow # Name of database
port: 5439 # Default Redshift port
table: atomic.events
username: storageloader
password: Kg1CYvW5Rp
maxerror: 1 # Stop loading on first error, or increase to permit more load errors
comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified
monitoring:
tags: {} # Name-value pairs describing this job
logging:
level: DEBUG # You can optionally switch to INFO for production
resolver.json:
{
"schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0",
"data": {
"cacheSize": 500,
"repositories": [
{
"name": "Iglu Central",
"priority": 0,
"vendorPrefixes": ["com.snowplowanalytics"],
"connection": {
"http": {
}
}
},
{
"name": "Tagcloud Analytics",
"priority": 5,
"vendorPrefixes": ["com.snowplowanalytics"],
"connection": {
"http": {
}
}
}
]
}
}
I can obtain 'com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1' from both
and
using wget on the core instance, so the subnets and security groups don't look to be the issue.
Everything in the pipeline is from commit 64a789273047d33c186cf00e2a0d8aabe52339f3 (feature/r70 merge) and deployed on AWS Ubuntu 14.04 instances.
Thanks,
Dan