Reading File from Hdfs, applying filter and iterate data

22 views
Skip to first unread message

santlal gupta

unread,
Apr 6, 2016, 11:49:22 AM4/6/16
to Lingual User
Hi,

We are exploring Cascading Linugal for following use case and it is best fit.

   1.Reading avro data from HDFS, perform filter on it and return Handler to view the data.

For this i have explored cascading lingual with JDBC and  Lingual in Java with Cascading. I have done below poc for it.

1. Lingual with JDBC:

  For now i have tried with .txt file. Because we are still searching for data provider that provide format .avro.

   And i found that while reading file from hdfs i have to do three step then only i can run the JDBC code.

       a. To create schema lets say LOGS.
   hduser@UbuntuD2:~$ lingual catalog --schema LOGS --add

       b. Create a stereotype that describes the file  structure say LOGFILE
  hduser@UbuntuD2:~$ lingual catalog --schema LOGS --stereotype LOGFILE -add --columns id,name --types int,string

      c. Now register table say LOGS in the schema called LOGS using the LOGFILE stereotype
 hduser@UbuntuD2:~$ lingual catalog  --schema LOGS --table LOGS --stereotype LOGFILE -add  /home/hduser/myEmp/emp_data 

then only i can write as follows:
    Connection connection = DriverManager.getConnection("jdbc:lingual:hadoop2-mr1;catalog=/user/hduser/;schema=/home/hduser/myEmp");

this approach will give ResultSet  as like Handler through which i can read data.

So can you suggest me how to reduce these steps. or any alternative for this? 

2. Lingual in Java with Cascading:
    
   In this case i think above steps are not required. i.e. create schema, stereotype  and table.
   But problem is that after cascading execution it will write the output in the file that is on HDFS, So we have to open the file and read it.
   So it will take time. we can open  the sinkTap and get the handler but before that data is already written to the file. 

So can you suggest me how to improve time here ?

Thanks
Santlal Gupta

ClusteCascadingTest.java.txt
ClusterJDBCTest.java.txt
emp_data
Reply all
Reply to author
Forward
0 new messages