Platform Extension Framework configuration

401 views
Skip to first unread message

muhammad ali

unread,
Apr 10, 2023, 10:06:18 PM4/10/23
to Greenplum Users
Hello Experts, 

let's say I have a 6 node cluster
1 master, 1 standby master, 4 segment nodes with 8 segments(total 32)

My external table resides in mysql which has a dataset of around 10GB with index on date column. I want to ask the following questions:

Following variables are mostly changed for PXF

PXF_JVM_OPTS="-Xmx3g -Xms3g"
pxf.max.threads=100


1: As per documentation:

"The PXF Service process running on a segment host allocates a worker thread for each segment instance on the host that participates in a query against an external table."

when accessing above external table, how it is going to work? I want to understand high level steps if possible.

2: let's say 10 concurrent queries are running which are accessing above external table. Are they going be completed in a sequential manner? First come first basis? How memory is assigned to a single external table query?

3: What is the recommended PXF_JVM_OPTS value for accessing above dataset?

4: Does every query will take PXF_JVM_OPTS value or is it a global constant value?

5: How pxf.max.threads comes into play while accessing external tables?


Thanks and Regards,
Muhammad Ali

Alexander Denissov

unread,
Apr 11, 2023, 4:20:37 PM4/11/23
to muhammad ali, Greenplum Users

Hi Muhammad, here’re answers to your questions:

  1. when accessing above external table, how it is going to work? I want to understand high level steps if possible.

----------------
Assuming you have 32 primary segments (8 per host), each PXF query will use 8 threads in each JVM (we have 1 PXF JVM per host). This is the maximum concurrency you can achieve with the query; it greatly depends on how parallelizable your data set is.

 

With PXF JDBC profile that you need to use to access a MySQL table, if you do NOT use PARTITION_BY custom option in your external table LOCATION clause (see

https://docs.vmware.com/en/VMware-Tanzu-Greenplum-Platform-Extension-Framework/6.6/tanzu-greenplum-platform-extension-framework/GUID-jdbc_pxf.html#jdbc-custom-options-4  ), then PXF will consider your dataset non-parallelizable and the whole data set will be retrieved by only 1 GP segment with only 1 thread (out of 32 potential) over 1 JDBC connection in 1 JVM. This will be the least performant way to get your data.

 

If you manage to identify a partition column (such as date, for example) and provide the appropriate options in the LOCATION clause, you might achieve higher concurrency, where each PXF thread dedicated to the query will retrieve one or more partitions (up to 32 total concurrent / 8 concurrent per JVM). Please note these are “virtual partitions” for your dataset and have nothing to do with database table partitions. If you end up with more that 32 partitions, some of the threads will get assigned more that 1 partition to process. A thread would process these partitions sequentially (but still concurrently with other running threads).

 

2: let's say 10 concurrent queries are running which are accessing above external table. Are they going be completed in a sequential manner? First come first basis? How memory is assigned to a single external table query?

----------------

All queries are running concurrently, but you need to think about other bottlenecks, such as JDBC connections. PXF uses JDBC connection pool, if you achieve full parallelism (see above) with 8 threads per JVM per query, then 10 concurrent queries would consume 80 threads and would require 80 JDBC connections from a single JVM. If your pool has less number of connections configured (such as the default of 15) (see https://docs.vmware.com/en/VMware-Tanzu-Greenplum-Platform-Extension-Framework/6.6/tanzu-greenplum-platform-extension-framework/GUID-jdbc_cfg.html#about-jdbc-connection-pooling-8 ), then threads will not run at full concurrency since they will be waiting for available JDBC connections to complete their requests.

 

There is no special memory management other than the standard JVM heap allocation / garbage collection on “as needed” basis. You should make sure the memory you give to your JVM is sufficient to support the level of concurrency your workload exhibits.

3: What is the recommended PXF_JVM_OPTS value for accessing above dataset?

----------------

As mentioned above, this will depend on level of concurrency of your workload, it is not possible to estimate this generically. PXF is streaming data, so the rough overall formula is like this:

Total JVM Memory = (Memory for streaming 1 data partition) * (segments per host) * (num queries)

 

Assuming you’re only running 1 type of query which requires 100 MB per data partition (an arbitrary number for this example only) and you have 8 segments per host and 10 concurrent queries :

 

MEM = 100 MB * 8 * 10 = 8 GB

 

The biggest challenge is to determine “Memory for streaming 1 data partition” value that can only be done empirically for your workload.

 

4: Does every query will take PXF_JVM_OPTS value or is it a global constant value?

----------------

This is a global value for the whole PXF JVM that serves primary GPDB segments on a given host.

 

5: How pxf.max.threads comes into play while accessing external tables?

----------------

I’d recommend not changing the default parameter, unless necessary. It comes to action in 2 ways:

 

  1. it defines how many Tomcat threads are created in PXF to process requests from the primary GPDB segments on a given host. In the example above 10 concurrent queries with 8 segments per host would require 80 threads. But 100 concurrent queries with 8 segments per host would require 800 threads, so the parameter will need to be changed to support that higher workload.

 

  1. It also defines (by default) the maximum size of the worker thread pool used to process read requests (reading data from an external system into Greenplum). This parameter can be changed individually using “pxf.task.pool.max-size” property in “pxf-application.properties” file, but by default it takes on the same value as “pxf.max.threads”. When PXF serves read requests, it uses a Tomcat thread to pick a worker thread that is then used for asynchronous data streaming back to Greenplum. The Tomcat thread then becomes available for processing other requests while the worker thread serves data back to Greenplum.

 

For the write case, PXF uses Tomcat threads themselves for data processing and *GENERALLY* does not utilize worker threads. Tomcat threads do not become available until after the data is written to the external system. However, for JDBC profile, PXF can create a writer thread pool per query per segment if the external table defines POOL_SIZE option. This provides for more concurrency when writing data at the expense of extra threads and potentially more JDBC connections required to support it.

 

FINALLY, for MySQL SPECIFICALLY, you need to pay attention to the JDBC fetch size. See https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-implementation-notes.html for more details. PXF ships with the setting that allows streaming of data row-by-row, to use the minimal amount of PXF JVM memory. We try avoiding fetching the whole ResultSet into PXF memory (default behavior of MySQL JDBC driver). But this means performance optimization of row pre-fetching is not utilized. You can set the value to another number (typically a 1000), but you will need to explicitly specify in your JDBC URL to use MySQL cursor fetch feature.

 

Please let us know if you have more questions.

--

Thanks,

Alexander Denissov

 

 

From: muhammad ali <engmuha...@gmail.com>
Date: Monday, April 10, 2023 at 7:06 PM
To: Greenplum Users <gpdb-...@greenplum.org>
Subject: [gpdb-users] Platform Extension Framework configuration

!! External Email

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.
To view this discussion on the web visit https://groups.google.com/a/greenplum.org/d/msgid/gpdb-users/dd656c72-90cc-4f88-ae9f-af50abe748a7n%40greenplum.org.

 

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

 

Reply all
Reply to author
Forward
0 new messages