Hi,
We are exploring Cascading Linugal for following use case and it is best fit.
1.Reading avro data from HDFS, perform filter on it and return Handler to view the data.
For this i have explored cascading lingual with JDBC and Lingual in Java with Cascading. I have done below poc for it.
1. Lingual with JDBC:
For now i have tried with .txt file. Because we are still searching for data provider that provide format .avro.
And i found that while reading file from hdfs i have to do three step then only i can run the JDBC code.
a. To create schema lets say LOGS.
hduser@UbuntuD2:~$ lingual catalog --schema LOGS --add
b. Create a stereotype that describes the file structure say LOGFILE
hduser@UbuntuD2:~$ lingual catalog --schema LOGS --stereotype LOGFILE -add --columns id,name --types int,string
c. Now register table say LOGS in the schema called LOGS using the LOGFILE stereotype
hduser@UbuntuD2:~$ lingual catalog --schema LOGS --table LOGS --stereotype LOGFILE -add /home/hduser/myEmp/emp_data
then only i can write as follows:
Connection connection = DriverManager.getConnection("jdbc:lingual:hadoop2-mr1;catalog=/user/hduser/;schema=/home/hduser/myEmp");
this approach will give ResultSet as like Handler through which i can read data.
So can you suggest me how to reduce these steps. or any alternative for this?
2. Lingual in Java with Cascading:
In this case i think above steps are not required. i.e. create schema, stereotype and table.
But problem is that after cascading execution it will write the output in the file that is on HDFS, So we have to open the file and read it.
So it will take time. we can open the sinkTap and get the handler but before that data is already written to the file.
So can you suggest me how to improve time here ?
Thanks
Santlal Gupta