Greenplum - HDFS - Best solution needed

47 views
Skip to first unread message

suraj birla

unread,
Apr 28, 2017, 6:42:12 PM4/28/17
to gpdb-...@greenplum.org
Hi,

We have hadoop hortonworks distribution and Greenplum .
Data is processed and stored in hadoop in avro format.
Folder structure  on hdfs 

/year/ day/ 15 minutes / datafile

There are multiple format data into this datafile folder.
Account.avro
customer.avro

Greenplum pulls the data from datafile using gphdfs protocol.

External table is pointing to year/* folder... 
Requirement is to pull the data every 15 minutes. Volume of data is also very less in that 15 minutes (~2000 records)

We had decided to point the external table to top folder to avoid re-creation of the external table and also avoid bloating  catalog table.

We pull the data into Greenplum stage table and then delete the data which is not needed.

Now we have performance issue. Sometimes a small table of 1000 rows is taking around 2 minutes..
 - Reason.  
        - We are looking into top folder instead of the folder from where we need data.
This is causing us to pull unnecessary data every 15 minutes.

if we point to the  15 minutes folder, then the load is very fast.

Question: 
1. What would be best solution in this scenario?
2. How to dynamically point to the right folder 


Thanks
Suraj






Ivan Novick

unread,
Apr 28, 2017, 9:44:59 PM4/28/17
to suraj birla, Greenplum Users
Can you recreate the external table before each load with the optimized folder?  Just a thought

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+unsubscribe@greenplum.org.
To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.



--
Ivan Novick
Product Manager Pivotal Greenplum

Jon Roberts

unread,
May 1, 2017, 11:28:55 AM5/1/17
to Ivan Novick, suraj birla, Greenplum Users
I've seen many customers create and drop hundreds if not 1000+ tables every single day during their ETL processing and been fine with catalog bloat.  You do need to vacuum the catalog tables regularly and I also recommend using the reindexdb tool with the -s option because those indexes also can get bloat.

Jon Roberts

Scott Kahler

unread,
May 1, 2017, 11:59:03 AM5/1/17
to Jon Roberts, Ivan Novick, suraj birla, Greenplum Users
If you are talking about pulling 2000 records, which is a relatively small amount of data, an option would be to write a web external table the does an execute on a program containing the logic you want to pull the data.

Scott Kahler | Pivotal, Greenplum Product Management  | ska...@pivotal.io | 816.237.0610

suraj birla

unread,
May 1, 2017, 1:00:05 PM5/1/17
to Scott Kahler, Jon Roberts, Ivan Novick, Greenplum Users
Scott, we are using gphdfs protocol to pull the data from hadoop and taking advantage of parallelism. For web external table we to write customer code and run on master only.

We have 60 tables which needs to be updated every 15 minutes.. not all table get updated.

Based on the solution provided by Ivan, Jon  and our next challenge would be
- How to find the optimized folder? I don't have a way to know which folders are there on hadoop and what was the last folder GP pulled the data.

- Not all folders will have file for 62 tables.. not all tables are getting updated every 15 minutes in the source system. When we try to read the file from hadoop and the file is not present we get errors. How to trap the error and ignore it.. 
Exception in thread "main" java.io.IOException: there is no file in the path you given, PATH : /KafkaPipeline/gpExport/tableData/2017*/*/Pricebook2_KafkaBatch_*.avro

- Lastly we are pulling data for 5 tables at a time  and we are getting 
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007ff775750000, 65536, 1) failed; error='Cannot allocate memory' (errno=12)


Scott Kahler

unread,
May 2, 2017, 10:43:03 AM5/2/17
to Greenplum Users, ska...@pivotal.io, jrob...@pivotal.io, ino...@pivotal.io
Okay, just offering up that in the cases where you are pulling in a relatively small number of records and may be only going after a block or two in HDFS using GPHDFS may be overkill. Thus going after that data using a method with less cluster overhead may yield an advantage.

You will need to take a look at how you are allocating memory to GPDB. I believe JVMs you spin up are going to exist outside the memory footprint that is allocated for the gpdb/postgres processes themselves. Thus if you setting statically allocate all of the available memory to the GPDB processes you don't have much to work with for the JVMs.
Reply all
Reply to author
Forward
0 new messages