GPLOAD Control File limitations

196 views
Skip to first unread message

Zayne du Bery

unread,
Jan 13, 2017, 12:49:45 AM1/13/17
to Greenplum Users
Hello Greenplum!

I recently installed and deployed greenplum. First up, I really like it a lot. My 1 master and 4 segments are operational on a virtualized Xen hypervisor ontop of a monster Supermicro Server!

But I am faced with a problem which is hardly documented anywhere. Even though gpload is mighty quick (15000000 records in 140.27 seconds), I cannot find a way to convert dates without manually creating an external table nor ignore trailing null columns.

The problem is as follows: 

  1. I have a date in the csv file: Format is YYYYMMDDHHMMSS but my database table is defined as timestamp in the form: YYYY-MM-DD HH:MM:SS - I need to convert the CSV input format of 20170117091011 to 2017-01-17 09:10:11 within gpload...
  2. How do I ignore 200 trailing NULLCOLS with GPLOAD? There are 200 blank columns at the end of the csv which I wish to ignore using a YML Control File.
There is no documentation regarding this. We wish to improve on our existing Oracle performance by migrating to Greenplum and using the exact same files.

Thanks! Keep up the great work.

Zayne

Jon Roberts

unread,
Jan 13, 2017, 8:22:58 AM1/13/17
to Zayne du Bery, Greenplum Users
Both of these can be handled by the MAPPING section of the YAML file.  I'll ignore columns col3, col4, and col5 and transform your date using to_timestamp in the YAML file to the correct timestamp format.

[gpadmin@gpdbsne ~]$ cat testfile.txt 
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
"col1"|"col2"|"col3"|"col4"|"col5"|20170117091011
[gpadmin@gpdbsne ~]$ cat test.yml 
---
VERSION: 1.0.0.1 
DATABASE: gpadmin
USER: gpadmin
HOST: gpdbsne
PORT: 5432
GPLOAD:
   INPUT:
    - SOURCE:
        LOCAL_HOSTNAME:
          - gpdbsne
        PORT: 8999
        FILE:
          - /home/gpadmin/testfile.txt
    - FORMAT: text 
    - DELIMITER: '|'
    - QUOTE: '"'
    - COLUMNS:
           - col1: text
           - col2: text
           - col3: text
           - col4: text
           - col5: text
           - col6: text
   OUTPUT:
     - TABLE: public.test
     - MODE: insert 
     - MAPPING:
               col1: col1
               col2: col2
               col6: to_timestamp(col6, 'YYYYMMDDHH24MISS')
[gpadmin@gpdbsne ~]$ psql 
SET
Timing is on.
psql (8.2.15)
Type "help" for help.

gpadmin=# create table public.test (col1 text, col2 text, col6 timestamp) distributed randomly;
CREATE TABLE
Time: 14.006 ms
gpadmin=# \q
[gpadmin@gpdbsne ~]$ gpload -f test.yml 
2017-01-13 08:20:58|INFO|gpload session started 2017-01-13 08:20:58
2017-01-13 08:20:58|INFO|started gpfdist -p 8999 -P 9000 -f "/home/gpadmin/testfile.txt" -t 30
2017-01-13 08:20:58|INFO|running time: 0.12 seconds
2017-01-13 08:20:58|INFO|rows Inserted          = 10
2017-01-13 08:20:58|INFO|rows Updated           = 0
2017-01-13 08:20:58|INFO|data formatting errors = 0
2017-01-13 08:20:58|INFO|gpload succeeded
[gpadmin@gpdbsne ~]$ psql -c "select * from public.test"
  col1  |  col2  |        col6         
--------+--------+---------------------
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
 "col1" | "col2" | 2017-01-17 09:10:11
(10 rows)





Jon Roberts
Principal Engineer | jrob...@pivotal.io | 615-426-8661

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+unsubscribe@greenplum.org.
To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.

Reply all
Reply to author
Forward
0 new messages