snowplow-storage-loader fails

164 views
Skip to first unread message

Seneschaux

unread,
Aug 2, 2015, 1:00:12 PM8/2/15
to Snowplow
Hi team,

Had to add an additional inbound rule on Security Group so EC2 instance can access Redshift. (did I miss this in docs?)
Stuck here now:
$ bundle exec bin/snowplow-storage-loader --config config/redshift.yml --skip analyze
Loading Snowplow events and shredded types into My Redshift database (Redshift cluster)...
PG::Error error executing COPY and ANALYZE statements: BEGIN;
COPY atomic.events FROM 's3://snowplow-hklx-analysis/out/enriched/' CREDENTIALS
'aws_access_key_id=A**********A;aws_secret_access_key=H***************************U' REGION AS 'us-east-1' DELIMITER '\t' MAXERROR 1 EMPTYASNULL FILLRECORD TRUNCATECOLUMNS  TIMEFORMAT 'auto' ACCEPTINVCHARS;
COMMIT;: ERROR:  Load into table 'events' failed.  Check 'stl_load_errors' system table for details.

userid slice tbl starttime session query filename line_number colname type col_length position raw_line raw_field_value err_code err_reason
101 0 100067 2015-08-02 16:32:32.014119 1012 581 s3://snowplow-haka-analysis/out/enriched/run=2015-07-31-03-30-37/ 1 collector_tstamp timestamp 0 0 0 1213 Missing data for not-null field

I had 4 part* files here but removed 2 empty files (in hopes this was problem -- no luck)  Can I assume 0-byte files are OK?
Perhaps problem in data files but not sure "files" are read (vs folders).  

Can you provide some steps to investigate further?

redshift.yml file:
:aws:
  :access_key_id: A**********A
  :secret_access_key: H***************************U
:s3:
  :region: us-east-1 # S3 bucket region
  :buckets:
    :jsonpath_assets: # If you have defined your own JSON Schemas, add the s3://path to your own JSON Path files in your own bucket here
    :enriched:
      :good: s3://snowplow-haka-analysis/out/enriched # Must be s3:// not s3n:// for Redshift. This is the same as the :enriched:good: bucket specified for EmrEtlRunner
      :archive: s3://snowplow-haka-analysis/archive/StorageLoader/enriched # Where to archive enriched events to
    :shredded:
      :good: s3://snowplow-haka-analysis/out/shredded # Must be s3:// not s3n:// for Redshift. This is the same as the :shredded:good: bucket specified for EmrEtlRunner
      :archive: s3://snowplow-haka-analysis/archive/StorageLoader/shredded # Where to archive shredded types to
:download:
  :folder: # Not required for Redshift
:targets:
  - :name: "My Redshift database"
    :type: redshift
    :host: snowplow-dw-inst1.c*****i.us-east-1.redshift.amazonaws.com # The endpoint as shown in the Redshift console
    :database: dev1 # Name of database
    :port: 5439 # Default Redshift port
    :table: atomic.events
    :username: storageloader
    :password: 7********e
    :maxerror: 1 # Stop loading on first error, or increase to permit more load errors
    :comprows: 200000 # Default for a 1 XL node cluster. Not used unless --include compupdate specified

Alex Dean

unread,
Aug 2, 2015, 5:21:33 PM8/2/15
to Snowplow
Hi Seneschaux,
  • Empty files are fine for the load - they are happily ignored
  • There is a pretty rare bug around collector_tstamp being null (https://github.com/snowplow/snowplow/issues/1611) but the fact you are getting this on line 1 with a very small volume of data makes me think something else is going on
  • You can try bumping maxerror to see if any events can be loaded
  • What do the non-empty enriched event files contain when you download them and cat them?

Thanks,

Alex


--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean

Seneschaux

unread,
Aug 2, 2015, 5:35:07 PM8/2/15
to Snowplow
Hi Alex,

Added a header row (for display here) but here's the first few lines of part-00000:
app_id platform etl_tstamp collector_tstamp dvce_tstamp event event_id txn_id name_tracker v_tracker v_collector v_etl user_id user_ipaddress user_fingerprint domain_userid domain_sessionidx network_userid geo_country geo_region geo_city geo_zipcode geo_latitude geo_longitude geo_region_name ip_isp ip_organization ip_domain ip_netspeed page_url page_title page_referrer page_urlscheme page_urlhost page_urlport page_urlpath page_urlquery page_urlfragment refr_urlscheme refr_urlhost refr_urlport refr_urlpath refr_urlquery refr_urlfragment refr_medium refr_source refr_term mkt_medium mkt_source mkt_term mkt_content mkt_campaign contexts se_category se_action se_label se_property se_value unstruct_event tr_orderid tr_affiliation tr_total tr_tax tr_shipping tr_city tr_state tr_country ti_orderid ti_sku ti_name ti_category ti_price ti_quantity pp_xoffset_min pp_xoffset_max pp_yoffset_min pp_yoffset_max useragent br_name br_family br_version br_type br_renderengine br_lang br_features_pdf br_features_flash br_features_java br_features_director br_features_quicktime br_features_realplayer br_features_windowsmedia br_features_gears br_features_silverlight br_cookies br_colordepth br_viewwidth br_viewheight os_name os_family os_manufacturer os_timezone dvce_type dvce_ismobile dvce_screenwidth dvce_screenheight doc_charset doc_width doc_height tr_currency tr_total_base tr_tax_base tr_shipping_base ti_currency ti_price_base base_currency geo_timezone mkt_clickid mkt_network etl_tags dvce_sent_tstamp refr_domain_userid refr_dvce_tstamp derived_contexts domain_sessionid derived_tstamp

web 07/31/15 03:30 AM 07/26/15 05:43 AM 07/26/15 05:43 AM page_view 003c061c-80ec-4f3c-8013-377700773d3c
cf js-2.4.2 cloudfront hadoop-1.0.0-common-0.14.0
66.102.6.206 545783021 56e63229e8ede4b8 1











https://gtm-msr.appspot.com/render2?id=GTM-KZPLB9 gtm-msr
https gtm-msr.appspot.com 80 /render2 id=GTM-KZPLB9







































Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko




en-US 0 1 1 1 0 0 0 0 1 1 24 1024 674


America/New_York

1024 768 utf-8 1024 674

















web 07/31/15 03:30 AM 07/26/15 05:43 AM 07/26/15 05:43 AM struct 00ea4547-0002-41a2-803a-dabb00129755
cf js-2.4.2 cloudfront hadoop-1.0.0-common-0.14.0
66.102.6.206 545783021 56e63229e8ede4b8 1











https://gtm-msr.appspot.com/render2?id=GTM-KZPLB9

https gtm-msr.appspot.com 80 /render2 id=GTM-KZPLB9














{"schema":"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0","data":[{"account":"Haka Account"}]} assignment clickAssignPlay Assign Play
0


















Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko




en-US 0 1 1 1 0 0 0 0 1 1 24 1024 674


America/New_York

1024 768 utf-8 1024 674

















web 07/31/15 03:30 AM 07/26/15 05:43 AM 07/26/15 05:43 AM struct 00003022-8060-402e-8077-4026800bb5f7
cf js-2.4.2 cloudfront hadoop-1.0.0-common-0.14.0
66.102.6.201 545783021 56e63229e8ede4b8 1











https://gtm-msr.appspot.com/render2?id=GTM-KZPLB9

https gtm-msr.appspot.com 80 /render2 id=GTM-KZPLB9















performance clickPerformance Perfomance Metrics
0


















Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko




en-US 0 1 1 1 0 0 0 0 1 1 24 1024 674


America/New_York

1024 768 utf-8 1024 674

















web 07/31/15 03:30 AM 07/26/15 05:43 AM 07/26/15 05:43 AM struct 00d8506e-0058-4caa-80d8-81ee0058192d
cf js-2.4.2 cloudfront hadoop-1.0.0-common-0.14.0
66.102.6.206 545783021 56e63229e8ede4b8 1











https://gtm-msr.appspot.com/render2?id=GTM-KZPLB9

https gtm-msr.appspot.com 80 /render2 id=GTM-KZPLB9















video shareVideo

0


















Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko




en-US 0 1 1 1 0 0 0 0 1 1 24 1024 674


America/New_York

1024 768 utf-8 1024 674
















Will try increasing maxerror and let you know.

On Sunday, August 2, 2015 at 4:21:33 PM UTC-5, Alex Dean wrote:
Hi Seneschaux,
  • Empty files are fine for the load - they are happily ignored
  • There is a pretty rare bug around collector_tstamp being null (https://github.com/snowplow/snowplow/issues/1611) but the fact you are getting this on line 1 with a very small volume of data makes me think something else is going on
  • You can try bumping maxerror to see if any events can be loaded
  • What do the non-empty enriched event files contain when you download them and cat them?

Thanks,

Alex

On Sun, Aug 2, 2015 at 6:00 PM, Seneschaux <senes...@gmail.com> wrote:
Hi team,

Had to add an additional inbound rule on Security Group so EC2 instance can access Redshift. (did I miss this in docs?)
Stuck here now:
$ bundle exec bin/snowplow-storage-loader --config config/redshift.yml --skip analyze
Loading Snowplow events and shredded types into My Redshift database (Redshift cluster)...
PG::Error error executing COPY and ANALYZE statements: BEGIN;
COPY atomic.events FROM 's3://snowplow-haka-analysis/out/enriched/' CREDENTIALS

Seneschaux

unread,
Aug 2, 2015, 5:57:43 PM8/2/15
to Snowplow
Hi Alex,

That helped (used maxerror=5) -- processed 2 of two files.  Here's the stl_load_errors -- trips on line 1, it seems.
userid slice tbl starttime session query filename line_number colname type col_length position raw_line raw_field_value err_code err_reason
101 1 100067 2015-08-02 21:46:03.502507 28971 539 s3://snowplow-haka-analysis/out/enriched/ 1 collector_tstamp timestamp 0 0 0 1213 Missing data for not-null field
101 0 100067 2015-08-02 21:46:03.502507 28971 539 s3://snowplow-haka-analysis/out/enriched/run=2015-07-31-03-30-37/ 1 collector_tstamp timestamp 0 0 0 1213 Missing data for not-null field

Alex Dean

unread,
Aug 2, 2015, 6:00:02 PM8/2/15
to Snowplow
What is line 1 in each of those files? (The raw lines, not post-processing in Excel or similar)

A

Mark Ferris

unread,
Aug 2, 2015, 6:41:51 PM8/2/15
to snowpl...@googlegroups.com
From storage loader's archive bucket:
part-00000/line1
web 2015-07-31 03:30:37.868 2015-07-26 05:43:32.000 2015-07-26 05:43:30.566 page_view 003c061c-80ec-4f3c-8013-377700773d3c cf js-2.4.2 cloudfront hadoop-1.0.0-common-0.14.0 66.102.6.206 545783021 56e63229e8ede4b8 1 https://gtm-msr.appspot.com/render2?id=GTM-KZPLB9 gtm-msr https gtm-msr.appspot.com 80 /render2 id=GTM-KZPLB9 Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko en-US 0 1 1 1 0 0 0 0 1 1 24 1024 674 America/New_York 1024 768 utf-8 1024 674

part-00001/line1
web 2015-07-31 03:30:37.868 2015-07-27 06:42:24.000 2015-07-27 06:42:09.571 struct 80f0dfc7-00c6-40d8-80ae-c85c0076763a cf js-2.4.2 cloudfront hadoop-1.0.0-common-0.14.0 66.249.88.165 545783021 2f3b563a8a2bc359 1 https://gtm-msr.appspot.com/render2?id=GTM-KZPLB9 https gtm-msr.appspot.com 80 /render2 id=GTM-KZPLB9 library clickLibrary Operation Center 0.0 Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko en-US 0 1 1 1 0 0 0 0 1 1 24 1024 674 America/New_York 1024 768 utf-8 1024 674
Will try to reprocess from cloudfront and check the source/target buckets/files.  It's almost like the first row is a header but I don't see it.

My action for now.  Thank you again!.  Will update later.

--
You received this message because you are subscribed to a topic in the Google Groups "Snowplow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/snowplow-user/0fjNSkRBSak/unsubscribe.
To unsubscribe from this group and all its topics, send an email to snowplow-use...@googlegroups.com.

Alex Dean

unread,
Aug 2, 2015, 7:46:56 PM8/2/15
to Snowplow
Yes that's odd - the first lines of your files clearly have collector_tstamps.

A
Reply all
Reply to author
Forward
0 new messages