Connecting Pentaho Data Integration to hive / hadoop

111 views
Skip to first unread message

Feris Thia

unread,
Nov 21, 2010, 6:08:18 PM11/21/10
to hu...@googlegroups.com, User Group Pentaho Indonesia, Feris Thia
Hi All,

Buat yang bermain dengan Hive / Hadoop.

Mudah-mudahan bermanfaat ya.

Regards,

Feris

 
 

Sent to you by Feris Thia via Google Reader:

 
 

via johndz by John Dzilvelis on 8/21/10

My latest data integration challenge has been with a new node in my data landscape: a hadoop/hive installation. Since PDI has become my favorite hammer for many different tasks, I thought it would be handy to get connected to the hive database via jdbc. With that ability, I can enhance hive output by including lookups and joins with operational ( MySQL ) databases.

Unfortunately, I didn't have much luck using standard connections with jdbc and table input steps. I suppose this is because the hive jdbc driver is still in the embryonic stage.

The turning point for my effort was the discovery of the new User Defined Java Class in Pentaho 4.0 GA. I struggled a bit before getting this to work, but I now have a simple working example that returns the result of a hive query to the stream. There was quite a bit of late night thrashing, so excuse the un-refined code.

In summary, the keys to getting the udjc to work were:

  • Copy the ${HIVE_HOME}/lib/*.jar and $HADOOP_HOME/hadoop-*-core.jar files into the data_integration/libext directory.
  • The test case ktr file posted at the bottom of this thread on the Pentaho forum was also helpful.
  • Matt Casters blog on udjc for the intro to using this step.
Here is the un-refined code for the udjc step:



import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;


public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException
{
Object[] inrow = getRow();
if (inrow == null) {
setOutputDone();
return false;
}

try {
String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
Class.forName(driverName);
} catch (Exception e) {
e.printStackTrace();
System.exit(1);
}

try {
Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
Statement stmt = con.createStatement();
String sql ;
String msg ;

// select * query
sql = "select sum(id) from jd_test " ;
ResultSet res = stmt.executeQuery(sql);
long idval ;
Object[] outrow = createOutputRow(inrow, data.outputRowMeta.size());

while (res.next()) {
outrow = RowDataUtil.createResizedCopy(outrow, data.outputRowMeta.size());
idval = res.getLong(1) ;
msg = "result id: " + idval ;
logBasic(msg);
get(Fields.Out, "id").setValue(outrow, idval );
putRow(data.outputRowMeta, outrow);
}
} catch ( SQLException se ) { se.printStackTrace() ; System.exit(1) ;}
return true;
}


 
 

Things you can do from here:

 
 
Reply all
Reply to author
Forward
0 new messages