Storage Loader error, mismatch in number of columns

64 views
Skip to first unread message

Joao Correia

unread,
Mar 28, 2016, 5:59:47 PM3/28/16
to Snowplow
Hi Snowplowers,

I haven't had any errors in any of my EMR stages when using additional contexts until today.

Data is on snowplow-shredded/good ready to be loaded but when I run the storage-loader I get:

------------

COPY atomic.net_company_product_catalog_1 FROM 's3://xxxxxxxx/snowplow-shredded/good/run=2016-03-28-13-23-41/net.company/product_catalog/jsonschema/1-' CREDENTIALS 'aws_access_key_id=xxxxxxxxxxxxxxxxxxxx;aws_secret_access_key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' JSON AS 's3://xxxxxxxx/snowplow-json-paths/net.company/product_catalog_1.json' REGION AS 'us-west-2' MAXERROR 1 TRUNCATECOLUMNS TIMEFORMAT 'auto' ACCEPTINVCHARS ;
COMMIT;: ERROR: Number of jsonpaths and the number of columns should match. JSONPath size: 14, Number of columns in table or column list: 15
  Detail: 
  -----------------------------------------------
  error:  Number of jsonpaths and the number of columns should match. JSONPath size: 14, Number of columns in table or column list: 15
  code:      8001
  context:   
  query:     2417144
  location:  s3_utility.cpp:653
  process:   padbmaster [pid=29446]
  -----------------------------------------------
/snowplow/snowplow-storage-loader!/storage-loader/lib/snowplow-storage-loader/redshift_loader.rb:88:in `load_events_and_shredded_types'
file:/snowplow/snowplow-storage-loader!/storage-loader/bin/snowplow-storage-loader:54:in `(root)'
org/jruby/RubyArray.java:1613:in `each'
file:/snowplow/snowplow-storage-loader!/storage-loader/bin/snowplow-storage-loader:51:in `(root)'
org/jruby/RubyKernel.java:1091:in `load'
file:/snowplow/snowplow-storage-loader!/META-INF/main.rb:1:in `(root)'
org/jruby/RubyKernel.java:1072:in `require'
file:/snowplow/snowplow-storage-loader!/META-INF/main.rb:1:in `(root)'
/tmp/jruby1410521194231941383extract/jruby-stdlib-1.7.20.1.jar!/META-INF/jruby.home/lib/ruby/shared/rubygems/core_ext/kernel_require.rb:1:in `(root)'

------------

Any ideas what could be causing this? the error says it's a mismatch in the number of columns, how could this be if tracking didn't change?

Joao 

Ihor Tomilenko

unread,
Mar 28, 2016, 6:40:18 PM3/28/16
to Snowplow
Hi Joao,

Sorry to hear about the problem you experience.

Indeed, the error message indicates a mismatch between the actual structure of your table in Redshift and the JSONPaths file describing that table.

As EmrEtlRunner did not complain, the JSON describing your custom contexts conforms to your JSON schema.

As I understand, you have a dedicated table to load the custom contexts into, net_company_product_catalog_1. You also should have the corresponding JSONPaths file named product_catalog_1.json in your assets bucket. This file lists the columns in net_company_product_catalog_1 table. The number of parameters listed there should match the actual number of the columns in the Redshift table.

Could you, please, compare both (table and JSONPaths file) and find the difference? Have you by any chance had either of those modified recently? Have you made an update to snowplow?

Regards,
Ihor

Joao Correia

unread,
Mar 29, 2016, 12:26:58 PM3/29/16
to Snowplow
Hi Ihor,

This is my product catalog json schema.

{
    "jsonpaths": [
        "$.schema.vendor",
        "$.schema.name",
        "$.schema.format",
        "$.schema.version",
        "$.hierarchy.rootId",
        "$.hierarchy.rootTstamp",
        "$.hierarchy.refRoot",
        "$.hierarchy.refTree",
        "$.hierarchy.refParent",
        "$.data.productCategory",
        "$.data.productName",
        "$.data.productType"
    ]
}

this is a sample data row under /shredded/good/run=xxxxxx/net.company/product_catalog/jsonschema/1-0-0

 
   "schema": 
      "vendor":"net.company",
      "name":"product_catalog",
      "format":"jsonschema",
      "version":"1-0-0"
   },
   "data": 
      "productCategory":"xxxxxx",
      "productType":"xxxxxxx",
      "productName":"xxxxxxx"
   },
   "hierarchy": 
      "rootId":"0010628d-5157-4b21-a1b7-dd962aba49f8",
      "rootTstamp":"2016-03-28 12:30:06.000",
      "refRoot":"events",
      "refTree": 
         "events",
         "product_catalog"
      ],
      "refParent":"events"
   }
}

I can't figure out why it worked before and it doesn't now. I'm running Snowplow r77.

Thanks
Joao

Ihor Tomilenko

unread,
Mar 29, 2016, 1:44:19 PM3/29/16
to Snowplow
Hi Joao,

I'm confused right now. The error you provided earlier indicates that your Redshift table net_company_product_catalog_1 contains 15 columns but your JSONPaths file lists only 14. You didn't provide the structure of the table to review. However, the JSONPaths you are showing indicates only 12 columns.

It appears to me you are checking the wrong files. Does your JSONPaths file comes from the bucket s3://xxxxxxxx/snowplow-json-paths/net.company/product_catalog_1.json (indicated in the error)? What are the columns in net_company_product_catalog_1 table?

Regards,
Ihor

Joao Correia

unread,
Mar 29, 2016, 1:44:41 PM3/29/16
to Snowplow
I've got an update,

The problem is not from my custom contexts but from com.snowplowanalytics.snowplow/client_session. I have been testing the Objc-C tracker.

 I pinpoint the error to the client_session_1. Seems there is a mismatch columns.

Joao Correia

unread,
Mar 29, 2016, 2:12:38 PM3/29/16
to Snowplow
The issue is with client_session schema, please see the image below. I tried the COPY command by hand and I got the error.


Ihor Tomilenko

unread,
Mar 29, 2016, 2:52:23 PM3/29/16
to Snowplow
Joao,

I can see that you are using JSON schema 1-0-1. Have you upgraded from 1-0-0? Did you use Schema Guru?

If you take a look at the versions the schema here you will see that the version 1-0-0 is describing 14 parameters whereas version 1-0-1 has an extra property named "firstEventId".

You might need to adjust the corresponding JSONPaths file. Your Redshift table seems to match the 15 parameters of JSON schema 1-0-1.

Please, note that current version of Schema Guru is not version aware. So when table definitions and jsonpath files are generated for schema version 1-0-1, they are exactly the same as if they were generated for schema version 1-0-0. We will be releasing the new version addressing this bug soon.

Regards,
Ihor

Joao Correia

unread,
Mar 29, 2016, 3:17:00 PM3/29/16
to Snowplow
Hi Ihor,

Since I'm testing the ObjectiveC tracker this was the first time I got data into this table. I don't control which schema the ObjC tracker uses do I? 

How do I adjust the JSON Paths for the client_session? Isn't it built into Snowplow?

Ihor Tomilenko

unread,
Mar 29, 2016, 4:08:24 PM3/29/16
to Snowplow
Yes, Joao,

You are right. I will pass this to the developers to see if there's any issue on our side.

Regards,
Ihor

Alex Dean

unread,
Mar 29, 2016, 9:44:09 PM3/29/16
to Snowplow
Hi Joao,

The client_session table available in the snowplow/snowplow GitHub repo, master branch, has 14 columns:

https://github.com/snowplow/snowplow/blob/master/4-storage/redshift-storage/sql/com.snowplowanalytics.snowplow/client_session_1.sql

If you deploy that, the load should go through fine.

Cheers,

Alex

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Co-founder
Snowplow Analytics
The Roma Building, 32-38 Scrutton Street, London EC2A 4RQ, United Kingdom
+44 (0)203 589 6116
+44 7881 622 925
@alexcrdean
Reply all
Reply to author
Forward
0 new messages