Hi,
We have hadoop hortonworks distribution and Greenplum .
Data is processed and stored in hadoop in avro format.
Folder structure on hdfs
/year/ day/ 15 minutes / datafile
There are multiple format data into this datafile folder.
Account.avro
customer.avro
Greenplum pulls the data from datafile using gphdfs protocol.
External table is pointing to year/* folder...
Requirement is to pull the data every 15 minutes. Volume of data is also very less in that 15 minutes (~2000 records)
We had decided to point the external table to top folder to avoid re-creation of the external table and also avoid bloating catalog table.
We pull the data into Greenplum stage table and then delete the data which is not needed.
Now we have performance issue. Sometimes a small table of 1000 rows is taking around 2 minutes..
- Reason.
- We are looking into top folder instead of the folder from where we need data.
This is causing us to pull unnecessary data every 15 minutes.
if we point to the 15 minutes folder, then the load is very fast.
Question:
1. What would be best solution in this scenario?
2. How to dynamically point to the right folder
Thanks
Suraj