Not able to upload the data into RedShift from the HDFS--No Error No MSG

Shekhar Sharma

unread,

May 8, 2015, 2:22:22 AM5/8/15

to lingua...@googlegroups.com

Hello,
I am running Hadoop 1 as pseudomode and i am trying to migrate the data from my box to amazon redshift. The MR job ran successfully, the console tells me the exact number of rows, and the task tracker ooutput also shows that its in process of committing. I am not sure why it is not populating the data on the red shift.

Please help me if you have any ideas to resolve this issue. I tried all the options.

Regards,
Som

Shekhar Sharma

unread,

May 8, 2015, 7:39:51 AM5/8/15

to lingua...@googlegroups.com

Some more information:
(1) Lingual version is 1.2
(2) Provider version : 3.0.0
(3) hadoop 1.0.3

The script is as follows: ( I have removed the data base name, username and password)
hadoop fs -rmr .lingual

# tell lingual which platform to run on ("local" or "hadoop")
export LINGUAL_PLATFORM=hadoop
## location of Lingual catalog
CATALOG_PATH=/user/$USER/.lingual

#-------------------------------------------------------------------------------------------------------------------------------------

# initialize lingual catalog
lingual catalog --init --showstacktrace
echo ${CASCADING_JDBC_PATH}
# create cascading-jdbc-redshift provider
lingual catalog --provider --add /home/campus/cascading-jdbc/cascading-jdbc-redshift/build/libs/cascading-jdbc-redshift-3.0.0-wip-dev-provider.jar --showstacktrace

#-------------------------------------------------------------------------------------------------------------------------------------

# create IN schema to read from
lingual catalog --schema IN --add
# create stereotype and define columns and types
lingual catalog --schema IN --stereotype IN -add --columns anonymized_id,make,model,year,vehicle_type,vin,serial,date,country_code,state_code,software_version,bench_mode,driver_type,application_name,application_category_name,total_launches,total_duration,softkey_press_count,job_run_id --types string,string,string,string,string,string,string,string,string,string,string,boolean,string,string,string,long,long,int,int
# add format to schema
delim=`echo 'a' | tr 'a' '\001'`
lingual catalog --schema IN --format ssv --add --provider text --properties "delimiter=${delim}"
# add table to schema
lingual catalog --schema IN --table IN --stereotype IN -add <some file on hdfs> --format ssv
# show table
lingual catalog --schema IN --table IN --show

Putting data into redshift
# create OUT5 schema to write to
lingual catalog --schema OUT5 --add --showstacktrace
# create stereotype and define columns and types
lingual catalog --schema OUT5 --stereotype OUT5 -add --columns anonymized_id --types string --showstacktrace
# add format to schema
lingual catalog --schema OUT5 --format postgresql --add --provider redshift --showstacktrace
# add protocol and table definition to schema
lingual catalog --schema OUT5 --protocol jdbc --add --properties="tabledesc.tablename=sample,tabledesc.columnnames=anonymized_id,tabledesc.columndefs=CHARACTER VARYING NOT NULL DISTKEY ENCODE LZO,jdbcuser=###l,jdbcpassword=##,usedirectinsert=true,sinkmode=UPDATE" --provider redshift --showstacktrace
# add protocol format - postgresql
lingual catalog --schema OUT5 --protocol jdbc --format postgresql --add --provider redshift --showstacktrace
# add table to schema
lingual catalog --schema OUT5 --table OUT5 --stereotype OUT5 --add <URL of redshift databasee> --protocol jdbc --format postgresql --provider redshift --showstacktrace
# show table
lingual catalog --schema OUT5 --table OUT5 --show

echo "INSERT INTO \"OUT5\".\"OUT5\" SELECT \"anonymized_id\" FROM \"IN\".\"IN\";" | lingual shell --showstacktrace --verbose

Joe Posner

unread,

May 8, 2015, 12:53:13 PM5/8/15

to lingua...@googlegroups.com

Redshift's bulk loading defaults to allowing a certain number of lines to be in error without considering the load itself to have failed. If all your records are bad but the total record count is below Redshift's threshold you would see the behavior . The docs at http://docs.aws.amazon.com/redshift/latest/dg/r_STL_LOAD_ERRORS.html and http://docs.aws.amazon.com/redshift/latest/dg/r_Load_Error_Reference.html have information about how to get redshift's error reports.

You can also override the default by setting, ex redshift.copy.errorcount=10

Other considerations:

- The RedshiftTap itself handles setting delim characters as needed for the insert by default and handles telling Redshift what delimiter to use so there's no need to do the custom setting of the delimiter to a non-ASCII character like you're doing. If you do want to manually set a delimiter, I'd suggest picking an ASCII character to avoid possible charset issues across systems.

- Try running a more recent version of hadoop.

--
You received this message because you are subscribed to the Google Groups "Lingual User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lingual-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shekhar Sharma

unread,

May 8, 2015, 9:48:57 PM5/8/15

to lingua...@googlegroups.com

Thanks Joe for reply.

Well the delimiter (\u0001 ) is for the data present on the HDFS, not for the redshift. I have taken this script directly fro the tutorial, the only difference is the delimiter for the data present on the HDFS, nothing much.

Using recent version of hadoop is out of question, as it is running in production.

I will check the redshift error tables for possible errros and will update you.

Andre Kelpe

unread,

May 10, 2015, 9:28:44 AM5/10/15

to lingua...@googlegroups.com

You have to use the provider version 2.6. The providers for 3.0.0 are compiled against Cascading 3, while lingual 1.2 is compiled against Cascading 2.6.

http://conjars.org/repo/cascading/cascading-jdbc-redshift/2.6.0/

Also, is there a reason that you use such an old version of hadoop? The current production release is 2.6.

- André

--

You received this message because you are subscribed to the Google Groups "Lingual User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lingual-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

Shekhar Sharma

unread,

May 11, 2015, 1:18:55 AM5/11/15

to lingua...@googlegroups.com

Thanks for the info. I will first try to see the error tables in Redshift, and would try with 2.6.

Shekhar Sharma

unread,

May 11, 2015, 3:49:03 AM5/11/15

to lingua...@googlegroups.com

Well I have tried with 2.6 and it worked. I was able to upload the data.

The problem which i am facing now is the boolean type. if you see the schema there is one variable which is of type boolean, and it is throwing the following error.

ERROR [main] provider.ProviderProxy (ProviderProxy.java:createTap(178)) - factory cascading.jdbc.RedshiftFactory_$$_javassist_2 failed to create tap: cannot map type class java.lang.Boolean to a sql type

I tried giving both boolean and Boolean, but it is throwing the same error of not able to map it to the sql type.

Please advise me.

Regards,

Som

Andre Kelpe

unread,

May 11, 2015, 7:38:07 AM5/11/15

to lingua...@googlegroups.com

Could you run lingual with --showstacktrace and paste the output here?

- André

Shekhar Sharma

unread,

May 11, 2015, 10:09:00 AM5/11/15

to lingua...@googlegroups.com

how to represent the boolean datatype? Its not able to map the boolean to sql type.?

Reply all

Reply to author

Forward