PSQLException error code 1004 out of memory on redshift copy

2,103 views
Skip to first unread message

Sean Halliburton

unread,
Mar 19, 2016, 8:45:45 PM3/19/16
to Snowplow
Here's a new one when I try to run storage-loader ./snowplow-storage-loader --config config.yml --skip analyze:

./snowplow-storage-loader --config config.yml --skip analyze
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   930  100   930    0     0  1028k      0 --:--:-- --:--:-- --:--:--  908k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   930  100   930    0     0  1117k      0 --:--:-- --:--:-- --:--:--  908k
Loading Snowplow events and shredded types into sp-dev-redshift (Redshift cluster)...
Unexpected error: Java::OrgPostgresqlUtil::PSQLException error executing COPY statements: BEGIN;
COPY atomic.events FROM 's3://team/snowplow-dev/shredded/good/run=2016-03-19-04-10-50/atomic-events' CREDENTIALS 'aws_access_key_id=XXXXXXXXXXXX;aws_secret_access_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX;token=AQoDYXdzEMn//////////wEa4AMwdu3GmnZK4VR7hT2eofDwqh/QdPeYV5KIHkyKswsueheIJbHbEVK6nA55iqGxsj+Ace6Ml8EmcwryVTe4Mh6UBRrD+rRgVXqhzUuP9oyjM0vEdB6fnExI0BOZLuh+KKJiVOPLb5SsNpaLFIpaC7sdmRblAUEOUiD1dLh0YBk5fgU6WImkWCnECvewU0RSySBhZFu6QqlWl0rX+DV08mgnUMmOXqZxCOk2CqF1CzySHRT5aMYx9s1UMj31PSYOR/pY9gQOCHAooZ3osoRz3WkE8hIxE76T0D7y9CL2k/OL+jZyQPGcRvYf53c63WGXEQWq4GKlVw6LPqf3VGf7X3AuBHZGWuBv4U2IQhZ8CAMsO1lc3dYkyaBYxZIa6/v6vUMk/YJWTpgYDAKmWAaDQa96X5Ue6pNDTgSGQM1kH1J4YRSadEC3yDpdV9hYBXs533mySQjAz0364P/EYWEOpVT3B7U49faTeKzdCLRNit7P/tPFdzxzfDRguNAQNK1wrDXDCXx6jRslh9idS8bwxWAkiqRQyCeR8F4Vpza6sTG8NfrNZ6Z3E7BqC4MjYIbEnerrFlmqgHzFumO60OcsVu0+2lUzZGWm58LFElxa2+aUSNooRXn2EvusEPFLqbKw+Ogg8da3twU=' REGION AS 'us-west-2' DELIMITER '\t' MAXERROR 1 EMPTYASNULL FILLRECORD TRUNCATECOLUMNS  TIMEFORMAT 'auto' ACCEPTINVCHARS ;
COMMIT;: ERROR: Out of memory
  Detail: 
  -----------------------------------------------
  error:  Out of memory
  code:      1004
  context:   allocation failed, maximum supported size exceeded, requested size: 4294967197
  query:     4881
  location:  alloc.cpp:1864
  process:   query0_321 [pid=30699]
  -----------------------------------------------


We modded redshift_loader.rb for the session token:

def get_credentials(config)
        if ( config[:aws][:access_key_id] == 'iam' && config[:aws][:secret_access_key] == 'iam' )   # this will definitely factor into the direct Redshift query requests, once we can get past the s3 session token issue(s)
          credentials_from_role = Aws::InstanceProfileCredentials.new.credentials
          "aws_access_key_id=#{credentials_from_role.access_key_id};aws_secret_access_key=#{credentials_from_role.secret_access_key};token=#{credentials_from_role.session_token}"
        else
          "aws_access_key_id=#{config[:aws][:access_key_id]};aws_secret_access_key=#{config[:aws][:secret_access_key]}"
        end
      end

and our first 1-2 DB commits were successful.  But now we get this error every time.  I've tried researching 'postgres error code 1004', 'redshift error code 1004', '1004 allocation failed', 'redshift/postgres out of memory', etc., and found no valuable hints, even in the AWS forums.  Has anyone run into this before?  

S

Sean Halliburton

unread,
Mar 19, 2016, 8:52:42 PM3/19/16
to Snowplow
Here is the config:

aws:
  access_key_id: XXXXXXXXXXXX
  secret_access_key: XXXXXXXXXXXX
  s3:
    region: us-west-2
    buckets:
      assets: s3://snowplow-hosted-assets
      jsonpath_assets: s3://team/snowplow-dev/jsonpaths
      log: s3n://team/snowplow-dev/etl/logs
      raw:
        in: ["s3n://team/snowplow-dev/raw"]
        processing: s3://team/snowplow-dev/etl/processing
        archive: s3://team/snowplow-dev/archive/raw
      enriched:
        good: s3://team/snowplow-dev/enriched/good
        bad: s3://team/snowplow-dev/enriched/bad
        errors: 
        archive: s3://team/snowplow-dev/enriched/archive
      shredded:
        good: s3://team/snowplow-dev/shredded/good
        bad: s3://team/snowplow-dev/shredded/bad
        errors: 
        archive: s3://team/snowplow-dev/shredded/archive
  emr:
    ami_version: 4.3.0
    region: us-west-2
    jobflow_role: instance-profile
    service_role: role
    placement:
    ec2_subnet_id: subnet-a6374fc3
    ec2_key_name: sp-dev-batchprocessor
    bootstrap: ["s3://team/proxy.sh"]
    software:
      hbase:
      lingual:
    jobflow:
      master_instance_type: m4.large
      core_instance_count: 2
      core_instance_type: m4.2xlarge
      task_instance_count: 2
      task_instance_type: m4.2xlarge
      task_instance_bid:
    bootstrap_failure_tries: 3
    additional_info:
collectors:
  format: thrift
enrich:
  job_name: sp-dev-enrich
  versions:
    hadoop_enrich: 1.6.0
    hadoop_shred: 0.8.0
    hadoop_elasticsearch: 0.1.0
  continue_on_unexpected_error: false
  output_compression: NONE
storage:
  download:
    folder:
  targets:
    - name: "sp-dev-redshift"
      type: redshift
      database: snowplow
      port: 5439
      ssl_mode: disable
      table: atomic.events
      username: "snowplowdata"
      password: "XXXXXXXXXX"
      es_nodes_wan_only:
      maxerror: 1
      comprows: 200000
monitoring:
  tags: {} # Name-value pairs describing this job
  logging:
    level: DEBUG # You can optionally switch to INFO for production
iglu:
  schema: iglu:com.snowplowanalytics.self-desc/schema/jsonschema/1-0-0
  data:
    cache_size: 1000
    repositories:
      - name: "Iglu Central"
        priority: 0
        vendor_prefixes:
          - com.snowplowanalytics
          - com.google
        connection:
          http:
            uri: http://iglucentral.com
      - name: "Our Iglu repository"
        priority: 5
        vendor_prefixes:
          - com.company
        connection:
          http:

Fred Blundun

unread,
Mar 22, 2016, 7:05:49 AM3/22/16
to snowpl...@googlegroups.com
Hi Sean,

It could be that you have exceeded the capacity of your Redshift cluster. If this is the case you will probably want to follow the AWS guide on resizing a Redshift cluster. How many nodes are you running and what type are they?

Regards,
Fred

--
You received this message because you are subscribed to the Google Groups "Snowplow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snowplow-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sean Halliburton

unread,
Mar 22, 2016, 8:35:33 PM3/22/16
to Snowplow
We solved it.  It was actually unrelated to cluster size, WLM, etc.  It was a combination of a missed step in Redshift setup and some horrible error messaging on Redshift's (or maybe Postgres's, not sure) part.  Our storageloader user lost usage rights on either the atomic schema or atomic.events.  We recreated them and it works fine now.  /shrug 

Thanks!

Fred Blundun

unread,
Mar 23, 2016, 5:03:04 AM3/23/16
to snowpl...@googlegroups.com
Fair enough - glad you got it working!

Fred


--

Sean Halliburton

unread,
Mar 23, 2016, 6:55:51 PM3/23/16
to Snowplow
I spoke too soon.  The error popped up again this morning.  It does not seem to matter what size cluster; we have one install running on a dc1.large single node importing the same data successfully, and this newer one is running on a dc1.large 8-node cluster and choking.  I've tried resizing it several ways and it seems to work once on resize and then fail on subsequent loads.  It's completely intermittent.  I've adjusted the concurrency, memory allocation, slots, and I don't observe any noticeable effect or pattern.  Again, we've had a separate install running on the default Redshift parameter group for near 6 months now without this issue.  We thought it might be permissions-related because we were able to copy to the top-level * schema but not atomic.  Do schemas occupy dedicated space in memory?

I have noticed that the DB user we created for storage loader registers multiple open Redshift sessions simultaneously: https://www.dropbox.com/s/puehcw7r4kanr7c/Screenshot%202016-03-23%2008.15.21.png?dl=0  I've seen up to 10 running concurrently.  Is this normal?

S

tobias...@justwatch.com

unread,
Mar 24, 2016, 5:24:29 AM3/24/16
to Snowplow
Hey, we have the exact same error after we resized our redshift cluster yesterday from a single to a Multi node (4 nodes, 1 leader) dc1.large. We only have to load around 280MB so it shouldn't run out of memory on a node with 15GB. I already wrote in the AWS Forum and created a support case at AWS. If we have any further information or can solve it, I'll let you know, don't know if we have the same error though. We are on snowplow r77 and this is the AWS Thread https://forums.aws.amazon.com/thread.jspa?threadID=228118

Fred Blundun

unread,
Mar 24, 2016, 6:54:26 AM3/24/16
to snowpl...@googlegroups.com
Thanks for sharing that thread Tobias - it will be interesting to see what AWS have to say about it.

Sean - it isn't normal to have that many concurrent open sessions. Do you have any other software interacting with Redshift that could be responsible?

Regards,
Fred

--

Sean Halliburton

unread,
Mar 24, 2016, 12:33:57 PM3/24/16
to Snowplow
No, just this single install from a single EC2 instance.  The really weird thing is that we're only trying to load a few hundred to a few thousand records for testing purposes.  We have a separate install in a separate AWS account running r72, but loading the exact same data via a second tracker/endpoint.  That one's been running fine for months without incident.  Did the Storage Loader change drastically between r72 and r77?  Did the Redshift Copy statement change significantly?

S

Fred Blundun

unread,
Mar 24, 2016, 1:35:24 PM3/24/16
to snowpl...@googlegroups.com
You can see a list of all changes between r72 and r77 here. The changes to the StorageLoader are labelled. There was only one change to the SQL run by the StorageLoader: ticket 1361. But I would be surprised if that caused such a big difference to memory consumption...

Regards,
Fred

On 24 March 2016 at 16:33, Sean Halliburton <sean.hal...@gmail.com> wrote:
No, just this single install from a single EC2 instance.  The really weird thing is that we're only trying to load a few hundred to a few thousand records for testing purposes.  We have a separate install in a separate AWS account running r72, but loading the exact same data via a second tracker/endpoint.  That one's been running fine for months without incident.  Did the Storage Loader change drastically between r72 and r77?  Did the Redshift Copy statement change significantly?

S

--

Sean Halliburton

unread,
Mar 24, 2016, 2:26:39 PM3/24/16
to Snowplow
We are including vacuum, but not analyze.  I don't think the issue is specific to Snowplow, or the data we're trying to load, but at least wanted to check around here.  @tobias, our onsite AWS support rep just checked their internal issue tracker for '1004' errors and he said there are now 18+ tickets open reporting this issue in us-west-2, which we're using.  Idk if other regions are affected or reporting the same issue.  What region is your cluster in?

S

tobias...@justwatch.com

unread,
Mar 24, 2016, 3:31:20 PM3/24/16
to Snowplow
Hey, ok that's good to know, so there seems to be a bigger issue on AWS side. We are hosting in eu-west-1. We had an maintenance event on 03.23 so maybe they updated the redshift version to this one http://aws.amazon.com/releasenotes/Amazon-Redshift/6547015632191386

tobias...@justwatch.com

unread,
Mar 25, 2016, 6:24:01 AM3/25/16
to Snowplow
Hey,

the AWS support answered and there is a solution, if you need it asap you can make a deep copy or you just change the maintenance window. Beware that the maintenance window change (time after noon tomorrow) was for eu-west-1, don't know if this works for every region.

Unfortunately, the Out of Memory error that you are experiencing was introduced in our latest patch. The Redshift team has already identified the issue and the fix will be deployed in the upcoming patch. In order to receive this patch as soon as possible, you can modify your cluster's maintenance window to a time after noon tomorrow UTC. However, it is possible that in order for you to continue inserting data into the affected table(s), that you will have to perform a DEEP COPY of them. Instructions for this is outlined below:
—— How to perform a DEEP COPY ——
CREATE TABLE <new_table_name> (LIKE <original_table_name>);
INSERT INTO <new_table_name> SELECT * FROM <original_table_name>;
DROP TABLE <original_table_name>;
ALTER TABLE <new_table_name> RENAME TO <original_table_name>;

Sean Halliburton

unread,
Mar 25, 2016, 12:11:41 PM3/25/16
to Snowplow
Yep, we received the same instructions and picked up the patch.  Testing now, thanks!

Sean

kath...@eero.com

unread,
Mar 25, 2016, 1:18:35 PM3/25/16
to Snowplow
I use a staging table that I truncate each time before a load into Redshift. We were also impacted by the bug. In order to resolve the issue before waiting on the fix from Amazon, I had to create a new version of the table not just as a deep copy of the table but without column compression. Hope that helps someone!

Katharine

tobias...@justwatch.com

unread,
Mar 26, 2016, 8:26:26 AM3/26/16
to Snowplow
We had a maintenance window but nothing changed, could someone tell me the cluster version you have after the update (select version() ) mine is 1.0.1043 don't know if this is the patched or unpatched version.

thx

Sean Halliburton

unread,
Mar 26, 2016, 11:18:50 AM3/26/16
to snowpl...@googlegroups.com
1041 was the bug, 1043 is the patch, which is working for us so far but we did have to perform a deep copy to port the existing table to it. 

S
--
You received this message because you are subscribed to a topic in the Google Groups "Snowplow" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/snowplow-user/iMkm1FDuV_k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to snowplow-use...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Sean Halliburton

tobias...@justwatch.com

unread,
Mar 26, 2016, 1:01:13 PM3/26/16
to Snowplow
ok thx, so I will try that deep copy too. wow, that's really a huge and scary AWS fuckup
Reply all
Reply to author
Forward
0 new messages