problem running Enrich EmrEtlRunner

509 views
Skip to first unread message

Luis Henrique Gonçalves

unread,
Jul 16, 2014, 7:23:54 PM7/16/14
to snowpl...@googlegroups.com
Hi guys,


I am pretty new with SnowPlow and I am trying to run the Enrich process but get this error:

ContractError (Contract violation for argument 1 of 3:
    Expected: String,
    Actual: nil
    Value guarded in: Snowplow::EmrEtlRunner::EmrJob::partition_by_run
    With Contract: String, String, Bool => Maybe
    At: /home/ec2-user/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:321 ):
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:132:in `failure_callback'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:214:in `block in call_with'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:209:in `times'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:209:in `call_with'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/decorators.rb:157:in `partition_by_run'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:159:in `initialize'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call_with'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/decorators.rb:157:in `initialize'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:57:in `new'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:57:in `run'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call_with'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/decorators.rb:157:in `run'
    bin/snowplow-emr-etl-runner:39:in `<main>'

Does any one got an idea how I can fix it?


Regards.

Alex Dean

unread,
Jul 16, 2014, 7:41:27 PM7/16/14
to snowpl...@googlegroups.com
Hey Luis,

That's actually a bug in 0.9.5 we just spotted. It's fixed in 0.9.6 which we're testing at the moment, but in the meantime the workaround is simple:

Just set dummy values for both of the :errors: buckets in your EmrEtlRunner's config.yml. Doesn't matter what value as long as they're not null.

Thanks,

Alex


--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Luis Henrique Gonçalves

unread,
Jul 16, 2014, 9:13:54 PM7/16/14
to snowpl...@googlegroups.com
Hi Alex,


Tanks for that mate, think its working now, but still getting an error about the zone.
Everything is created in us-west-2

D, [2014-07-17T01:12:01.871869 #340] DEBUG -- : Initializing EMR jobflow
F, [2014-07-17T01:12:01.956226 #340] FATAL -- :

ArgumentError (Specified Availability Zone is not supported):
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/elasticity-3.0.4/lib/elasticity/aws_request.rb:29:in `rescue in submit'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/elasticity-3.0.4/lib/elasticity/aws_request.rb:26:in `submit'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/elasticity-3.0.4/lib/elasticity/emr.rb:191:in `run_job_flow'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/elasticity-3.0.4/lib/elasticity/job_flow.rb:131:in `run'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/emr_job.rb:231:in `run'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call_with'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/decorators.rb:157:in `run'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/lib/snowplow-emr-etl-runner/runner.rb:58:in `run'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/contracts.rb:230:in `call_with'
    /home/ec2-user/snowplow/3-enrich/emr-etl-runner/vendor/bundle/ruby/1.9.1/gems/contracts-0.4/lib/decorators.rb:157:in `run'
    bin/snowplow-emr-etl-runner:39:in `<main>'

Regards

Alex Dean

unread,
Jul 17, 2014, 2:33:31 AM7/17/14
to snowpl...@googlegroups.com
Can you paste in the relevant lines from your config.yml? The middle part of your config.yml that deals with regions, placements, ec2 etc.

A


--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luis Henrique Gonçalves

unread,
Jul 17, 2014, 9:20:57 AM7/17/14
to snowpl...@googlegroups.com
Hi Alex,


I managed to fiz the error, the problem was not informing the placement, that in my case should be us-west-2a. Now it's working.
Buy unfortunately I am getting a error on Step: Shredded HDFS -> S3 / Status: Failed.
I am sending the log error we get, if you can give us some advice.

Controller LOG: 
2014-07-17T02:31:34.524Z INFO Fetching jar file.
2014-07-17T02:31:40.586Z INFO Working dir /mnt/var/lib/hadoop/steps/4
2014-07-17T02:31:40.586Z INFO Executing /usr/java/latest/bin/java -cp /home/hadoop/conf:/usr/java/latest/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-tools-1.0.3.jar:/home/hadoop/hadoop-core-1.0.3.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/4 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/4/tmp -Djava.library.path=/home/hadoop/native/Linux-amd64-64 org.apache.hadoop.util.RunJar /home/hadoop/lib/emr-s3distcp-1.0.jar --src hdfs:///local/snowplow/shredded-events/ --dest s3n://spenriched/shredded/good/run=2014-07-17-02-11-23/ --srcPattern .*part-.* --s3Endpoint s3-us-west-2.amazonaws.com
2014-07-17T02:31:46.665Z INFO Execution ended with ret val 1
2014-07-17T02:31:46.666Z WARN Step failed with bad retval
2014-07-17T02:31:52.736Z INFO Step created jobs: 

SysLog: 
2014-07-17 02:31:42,198 INFO com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Running with args: [Ljava.lang.String;@471719b6
2014-07-17 02:31:45,975 FATAL com.amazon.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
java.io.FileNotFoundException: File does not exist: hdfs:/local/snowplow/shredded-events
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:564)
	at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:549)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
	at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:13)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:187)


Regards.

Alex Dean

unread,
Jul 17, 2014, 9:25:18 AM7/17/14
to snowpl...@googlegroups.com
Hi Luis,

Ah - yes, due to an unfortunate attribute of S3DistCp, it will fail if no files were output for shredding. This will happen if you have no custom contexts + no unstructured events + don't enable link click tracking.

For future runs: run with --skip shred on EmrEtlRunner.

To finalize this run: just re-run EmrEtlRunner with --skip staging,emr and you should be good to go.

Cheers,

Alex

Luis Henrique Gonçalves

unread,
Jul 17, 2014, 10:00:29 AM7/17/14
to snowpl...@googlegroups.com
Hi Alex,


Thank you so much.

I didn't run ok in the firts time, because in the config.yml file there is no mention about the sr archive bucket, so I add the following line to it:
:archive: s3n://myarchivebucket/archive

Don't know if it is right, but it's maybe a fix for the next release.


Cheers

Alex Dean

unread,
Jul 17, 2014, 10:34:59 AM7/17/14
to snowpl...@googlegroups.com
Hi Luis,

Where did you add an :archive: line?

Cheers,

Alex

Luis Henrique Gonçalves

unread,
Jul 17, 2014, 10:44:11 AM7/17/14
to snowpl...@googlegroups.com
Hi Alex,


Here is my config.yml file:

:logging:
  :level: DEBUG # You can optionally switch to INFO for production
:aws:
  :access_key_id: xxxxx
  :secret_access_key: xxxxx
:s3:
  :region: us-west-2
  :buckets:
    :assets: s3://snowplow-hosted-assets # DO NOT CHANGE unless you are hosting the jarfiles etc yourself in your own bucket
    :log: s3n://elasticbeanstalk-us-west-2-bucket/etl-log
    :raw:
      :in: s3n://elasticbeanstalk-us-west-2-bucket/resources/environments/logs/publish/e-xxxxxxxxx/i-xc2xxxxx
      :processing: s3n://myprocessingbucket/processing
      :archive: s3n://myarchivebucket/archive
    :enriched:
      :good: s3n://myenrichedbucket/enriched/good
      :bad: s3n://myenrichedbucket/enriched/bad
      :errors: dummy # Leave blank unless :continue_on_unexpected_error: set to true below
    :shredded:
      :good: s3n://myshreddedbucket/shredded/good
      :bad: s3n://myshreddedbucket/shredded/bad
      :errors: dummy # Leave blank unless :continue_on_unexpected_error: set to true below
:emr:
  :region: us-west-2 # Always set this
  :placement: us-west-2a # Set this if not running in VPC. Leave blank otherwise
  :ec2_subnet_id: subnet-xxxxxbf3 # Set this if running in VPC. Leave blank otherwise
  :ec2_key_name: xxxxxx_us-west-2
  :software:
    :hbase:                # To launch on cluster, provide version, "0.92.0", keep quotes
    :lingual:              # To launch on cluster, provide version, "1.1", keep quotes
  # Adjust your Hadoop cluster below
  :jobflow:
    :master_instance_type: m1.small
    :core_instance_count: 1
    :core_instance_type: m1.small
    :task_instance_count: 0 # Increase to use spot instances
    :task_instance_type: m1.small
    :task_instance_bid: # In USD. Adjust bid, or leave blank for non-spot-priced (i.e. on-demand) task instances
:etl:
  :job_name: Snowplow ETL # Give your job a name
  :versions:
    :hadoop_enrich: 0.5.0 # Version of the Hadoop Enrichment process
    :hadoop_shred: 0.1.0 # Version of the Hadoop Shredding process
  :collector_format: clj-tomcat # Or 'clj-tomcat' for the Clojure Collector
  :continue_on_unexpected_error: false # Set to 'true' (and set :out_errors: above) if you don't want any exceptions thrown from ETL
:iglu:
  :schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0
  :data:
    :cache_size: 500
    :repositories:
      - :name: "Iglu Central"
        :priority: 0
        :vendor_prefixes:
          - com.snowplowanalytics
        :connection:
          :http:
            :uri: http://iglucentral.com
:enrichments:
  :anon_ip:
    :enabled: false
    :anon_octets: 1 # Or 2, 3 or 4. 0 is same as enabled: false

Alex Dean

unread,
Jul 17, 2014, 10:46:55 AM7/17/14
to snowpl...@googlegroups.com

Luis Henrique Gonçalves

unread,
Jul 17, 2014, 10:51:45 AM7/17/14
to snowpl...@googlegroups.com
Hi Alex,


Sorry, we used the sample file that was already in the package not the one on page.

Congrats for the great job with snowplow, we are impressed.


Cheers.

Eugene Klimov

unread,
Aug 10, 2014, 1:54:23 PM8/10/14
to snowpl...@googlegroups.com
Hello Alex,

> Ah - yes, due to an unfortunate attribute of S3DistCp, it will fail if no files were output for shredding. This will happen if you have no custom contexts + no unstructured events + don't enable link click tracking.

i have same problem
after deploy snowplow in my AWS account

when i run

bundle exec bin/snowplow-emr-etl-runner --config=config/config.yml --debug --enrichments=config/enrichments

i have failure on step
Elasticity S3DistCp Step: Shredded HDFS -> S3

and in stderr i have
following error

Exception in thread "main" java.lang.RuntimeException: Failed to get source file system
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:567)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:549)
...
Caused by: java.io.FileNotFoundException: File does not exist: hdfs:/local/snowplow/shredded-events
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:564)
... 9 more


but when i call trackPageView in my Javascript code 
i had custom context

here is my javascript code {{ it's a twig template }} and i get syntax from https://github.com/snowplow/snowplow/wiki/2-Specific-event-tracking-with-the-Javascript-tracker#custom-contexts

window.snowplow('newTracker', 'cf', '//dh5l21u5robsy.cloudfront.net', {
    appId: 'xxx',
    platform: 'web',
    respectDoNotTrack: false,
    cookieDomain: 'xxx'
});
window.snowplow('setUserId', XXX );
window.snowplow('trackPageView',null, [
   {
       schema: "iglu:com.mycompany.contexts/global/jsonschema/1-0-0",
       data: {
           user_lang_local: '{{ native_language }}',
           user_lang_target: "en",
           user_premium_type: '{{ premium_class }}',
           user_langlevel: '{{ user_language_level }}',
           user_xp_level: '{{ user_xp_level }}',
           abtest: '{{ user_ab_test_matches }}'
       }
   }
]);

in browser i see cx field in query string
please see following screenshot

in my S3 bucket for cloudfront logs
i also see cx field

and configure my config/config.yml
iglu section as

:iglu:
  :schema: iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-0
  :data:
    :cache_size: 1024
    :repositories:
      - :name: "Iglu Central"
        :priority: 0
        :vendor_prefixes:
          - com.snowplowanalytics
        :connection:
          :http:
            :uri: http://iglucentral.com
      - :name: "MyCompany JSON events"
        :priority: 1
        :vendor_prefixes:
          - com.mycompany
        :connection:
          :http:


and configure s3 bucket in config.yml

:s3:
  :region: eu-west-1
  :buckets:
...
    :shredded:
      :good: s3://mycompany-analytics-out/snowplow-shredded/good
      :bad: s3://mycompany-analytics-out/snowplow-shredded/bad
      :errors: s3://mycompany-analytics-out/snowplow-shredded/errors
:etl:
  :continue_on_unexpected_error: false

full config.yml here


i created all directories (i.e. snowplow-shredded) in my s3 buckets
and upload following json schema http://pastebin.com/b4zW7pXH to file

How to debug when my custom contexts is losing?

Alex Dean

unread,
Aug 10, 2014, 2:38:45 PM8/10/14
to snowpl...@googlegroups.com
Hi Eugene,

Thanks for the very detailed bug report - it should make it much easier to troubleshoot what is going wrong.

From everything you've shared, it looks like you've set everything up completely correctly, meaning that you should have custom contexts flowing through to their own table in Redshift. But your Hadoop job is failing at the point of shredding, which means that it's not successfully extracting the context JSONs which are embedded in your enriched events.

Can you look in this bucket:

s3://mycompany-analytics-out/snowplow-shredded/bad

And let me know what kinds of files/error messages it contains?

A





--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eugene Klimov

unread,
Aug 10, 2014, 3:31:09 PM8/10/14
to snowpl...@googlegroups.com

Hello Alex,


> Can you look in this bucket:
> s3://mycompany-analytics-out/snowplow-shredded/bad
> And let me know what kinds of files/error messages it contains?
Thanks Alex
yep, currenly i see some errors
http://take.ms/daaXK

     "message" : "Could not find schema with key iglu:com.lingualeo.contexts/global/jsonschema/1-0-0 in any repository, tried:",
        "repositories" : [ "Iglu Central [HTTP]",
            "Iglu Client Embedded [embedded]",
            "mycompany JSON events [HTTP]"

hm... but i check s3 acl before running snowplow-emt-etl-runner

i'm try reload 1-0-0 file over
s3cmd --acl-public sync

try 
wget http://s3-eu-west-1.amazonaws.com/lingualeo-analytics/lingualeo-json-schemas/com.lingualeo.contexts/global/jsonschema/1-0-0

and run 
bundle exec bin/snowplow-emr-etl-runner --config=config/config.yml --debug --enrichments=config/enrichments --skip=staging

and it's failed again with same error... 

Alex Dean

unread,
Aug 10, 2014, 3:39:08 PM8/10/14
to snowpl...@googlegroups.com
Hi Eugene,

Ah - you are missing a /schemas/ from your folder structure. Your config.yml is perfect - you just need to create an additional folder in S3 such that:

http://s3-eu-west-1.amazonaws.com/lingualeo-analytics/lingualeo-json-schemas/schemas/com.lingualeo.contexts/global/jsonschema/1-0-0
                                                                             ^^^^^^^

That should fix it.

A

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eugene Klimov

unread,
Aug 11, 2014, 10:12:29 AM8/11/14
to snowpl...@googlegroups.com
Hello Alex,

> Ah - you are missing a /schemas/ from your folder structure. Your config.yml is perfect - you just need to create an additional folder in S3 such that:
> That should fix it.
thanks for your helpfull reply
after i added /schemas/ folder, and fix some bugs in my 1-0-0 file
EmrEtlRunner work fine...

but when i run
bundle exec bin/snowplow-storage-loader --config=config/redshift.yml
it's failed with following error

Loading Snowplow events and shredded types into MyCompany Redshift
database (Redshift cluster)...

Cannot find JSON Paths file to load
s3://mycompany-analytics-out/snowplow-shredded/good/run=2014-08-11-13-09-18/com.lingualeo.contexts/global/jsonschema/1-
into atomic.com_lingualeo_contexts_global_1

i upload my global_1.json
s3://mycompany-analytics/lingualeo-json-schemas/jsonpaths/com.lingualeo.contexts/

my redshift.yml is here
http://pastebin.com/6gLVBbNq

my global_1.json

{
"jsonpaths": [

"$.schema.vendor",
"$.schema.name",
"$.schema.format",
"$.schema.version",

"$.hierarchy.rootId",
"$.hierarchy.rootTstamp",
"$.hierarchy.refRoot",
"$.hierarchy.refTree",
"$.hierarchy.refParent",

"$.data.user_lang_local",
"$.data.user_lang_target",
"$.data.user_premium_type",
"$.data.user_langlevel",
"$.data.user_xp_level",
"$.data.abtest"

]
}

Eugene Klimov

unread,
Aug 11, 2014, 11:08:00 PM8/11/14
to snowpl...@googlegroups.com
;) i change /jsonpaths to
:jjson_assets: s3://mycompany-analytics/mycompany-json-schemas/jsonpaths/
and it's worked

thanks Alex

Alex Dean

unread,
Aug 12, 2014, 4:17:58 AM8/12/14
to snowpl...@googlegroups.com
Glad you fixed it Eugene! Yes you need to include /jsonpaths/, as Snowplow does for its own assets:

https://github.com/snowplow/snowplow/blob/master/4-storage/storage-loader/lib/snowplow-storage-loader/shredded_type.rb#L31

A


--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Doron Pearl

unread,
Aug 12, 2014, 11:56:30 AM8/12/14
to snowpl...@googlegroups.com
Hi Alex - we're getting the same error as mentioned above in the last step of the ETL:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs:/local/snowplow/shredded-events

We're sending an unstructured event after defining a self-describing JSON (tried both via Python and JS) and made sure that the cx parameter is set when the call is made to the collector.

As you suggested, I checked the /snowplow-shredded/bad folder but all I see is these 0 bytes files:
_SUCCESS, part-00000 ... part-00007

After the storage-loader runs, the events do make it to the DB (PostgreSQL) however the custom context isn't there (I should be looking for additional fields on the events table, right?).
Also the user_id is null in the events table even though it was set on the call to the collector.

Any idea what's wrong / how to troubleshoot this?

Thanks in advance,
Doron

Yali Sassoon

unread,
Aug 12, 2014, 5:17:15 PM8/12/14
to snowpl...@googlegroups.com
Hi Doron,

One thing to note immediately - we do not currently support the loading of shredded data into Postgres, only Redshift. So if you're using Postgres at the moment, you may as well skip the shredding process (using --skip shred). You should still be able to view (and query) the context data using Postgres's JSON parsing functions. You should find your custom context JSON in the atomic.events."unstruct_event" field, and the custom context in the atomic.events."contexts" fields. If possible, though, we do recommend switching to Redshift, as you'll be able to query the unstructured events and custom context much more efficiently, once that data has been shredded.

That still leaves the error message that you're getting when you run the shred step. That error is caused by their being no unstructured events or contexts successfully shredded. There are a number of possibilities:
  1. You aren't successfully collecting the unstructured events or context. It should be possible to diligence whether this is the case, by doing a text search on your collector logs and seeing if you can spot the unstructured events / context there, as per the tracker protocol. Specifically you're looking for query string parameters with 'ue_pr' or 'ue_px' or 'co' or 'cx'. It sounds like you've already checked this when you send the data to the collector - can you also check that the data is in the collector logs?
  2. The rows are failing validation for some reason. In this case you should be able to find the data in your bad rows bucket, with the reason for the validation failure. It appears that you've also checked this from your mail - can you confirm that you've checked both bad rows buckets - the one specified for enriched events, and the one specified for shredded events?
  3. Because the job is failing, the bad shredded rows aren't being written to S3. I'll discuss with the team tomorrow whether this is a possibility
Can you check if the data you expect is in the atomic.events table, but in the "contexts" or "unstruct_event" fields as an unshredded JSON? Note that if you were running Redshift, the shredded data wouldn't be loaded into atomic.events, it would be loaded into dedicated additional tables you would need to setup in Redshift. 

Hope that helps and let me know what you find...


Yali



Doron Pearl

unread,
Aug 12, 2014, 7:53:47 PM8/12/14
to snowpl...@googlegroups.com
Hi Yali,

Thanks for the quick reply.

1. Yes, I can see in the collector logs, the context values (base64 encoded) under ue_px in one request and cx in another.
2. I've checked both locations and both shredded/bad and enriched/bad have zero byte files... so I'm unable to tell what went wrong
I checked the atomic.events table and both unstruct_event and contexts fields  are empty.

Some more info to help with the troubleshooting:
a. The collector log: http://pastebin.com/jvzvFxwD

b. ETL enrichment process config:

c. Here's the content of PART-00000 under enriched/good:

Seems like a lot of data that was available in the collector log is missing in the enrichment output so I think the problem might be there.
Looking forward to your insights.

Thanks again!
Doron

Yali

unread,
Aug 13, 2014, 6:00:53 AM8/13/14
to snowpl...@googlegroups.com
Hi Doron,

Your enrichment process config looks good. However, the collector log you sent doesn't have any custom context or unstructured event data in it. Can you find a fragment of a processed collector log file with '&ue_pr' or '&ue_px' or '&co' or '&cx' in it, and paste those lines in please?
Message has been deleted

Doron Pearl

unread,
Aug 13, 2014, 8:25:33 AM8/13/14
to snowpl...@googlegroups.com
Hi Yali - My apology, I sent out the wrong collector log file (it was an old one from July).
Here's a new log that includes the 'cx' and 'ue_px' parameters: http://pastebin.com/nt35umNE

Thanks,
Doron

Yali Sassoon

unread,
Aug 13, 2014, 9:30:54 AM8/13/14
to snowpl...@googlegroups.com
That's really odd Doron:
  • I decoded the cx value in your collector log and validated it against your JSON schema. It passed
  • So if you feed that collector log into your enrichment process, as configured, that data point should end up in Redshift

This makes me wonder if the problem is that for some reason, the collector log isn't processed. (E.g. you have the incorrect bucket specified as the 'in' bucket.

When I look at the configuration you shared, I can see that you have the in bucket set to the same bucket as the log bucket. This is a bad idea, because it means that each time you run the data pipeline, it'll try and process the EMR logs generated from the last time the process was run.

This is doubly strange in your case, because you said you had no bad rows at all. Any EMR log that is fed into the Snowplow data pipeline will generate a bad row for every line in those EMR logs, because those logs are not in the correct format for Snowplow.

Things to try:

  1. Double check that s3://prod-snowplow-logs is the bucket where your collector logs are stored. Update your logs bucket so that logs EMR logs are not saved into the same bucket as the collector logs.
  2. Try running EmrEtlRunner *just* for the single collector log you shared with me. (You can copy it into it's own bucket, and create a temporary config file that specifies that new bucket as teh processing bucket, and run EmrEtlRunner with a --skip staging,archive argument. Then check the output from that run to see if that particular line was succcessfully processed.)

Doron Pearl

unread,
Aug 13, 2014, 11:52:35 AM8/13/14
to snowpl...@googlegroups.com
Yali - changing the log s3/log to its own bucket seems to be the cause of the problem.
I can now see this file under enriched/good with the following content (so that's a good progress!)
I'm now running into issues with the shredded process which is unable to find our JSON schema. Hoping to figure this out soon.

Thanks,
Doron
Reply all
Reply to author
Forward
0 new messages