Hi Muhammad, here’re answers to your questions:
----------------
Assuming you have 32 primary segments (8 per host), each PXF query will use 8 threads in each JVM (we have 1 PXF JVM per host). This is the maximum concurrency you can achieve with the query; it greatly depends on how parallelizable your data set is.
With PXF JDBC profile that you need to use to access a MySQL table, if you do NOT use PARTITION_BY custom option in your external table LOCATION clause (see
https://docs.vmware.com/en/VMware-Tanzu-Greenplum-Platform-Extension-Framework/6.6/tanzu-greenplum-platform-extension-framework/GUID-jdbc_pxf.html#jdbc-custom-options-4 ), then PXF will consider your dataset non-parallelizable and the whole data set will be retrieved by only 1 GP segment with only 1 thread (out of 32 potential) over 1 JDBC connection in 1 JVM. This will be the least performant way to get your data.
If you manage to identify a partition column (such as date, for example) and provide the appropriate options in the LOCATION clause, you might achieve higher concurrency, where each PXF thread dedicated to the query will retrieve one or more partitions (up to 32 total concurrent / 8 concurrent per JVM). Please note these are “virtual partitions” for your dataset and have nothing to do with database table partitions. If you end up with more that 32 partitions, some of the threads will get assigned more that 1 partition to process. A thread would process these partitions sequentially (but still concurrently with other running threads).
2: let's say 10 concurrent queries are running which are accessing above external table. Are they going be completed in a sequential manner? First come first basis? How memory is assigned to a single external table query?
----------------
All queries are running concurrently, but you need to think about other bottlenecks, such as JDBC connections. PXF uses JDBC connection pool, if you achieve full parallelism (see above) with 8 threads per JVM per query, then 10 concurrent queries would consume 80 threads and would require 80 JDBC connections from a single JVM. If your pool has less number of connections configured (such as the default of 15) (see https://docs.vmware.com/en/VMware-Tanzu-Greenplum-Platform-Extension-Framework/6.6/tanzu-greenplum-platform-extension-framework/GUID-jdbc_cfg.html#about-jdbc-connection-pooling-8 ), then threads will not run at full concurrency since they will be waiting for available JDBC connections to complete their requests.
There is no special memory management other than the standard JVM heap allocation / garbage collection on “as needed” basis. You should make sure the memory you give to your JVM is sufficient to support the level of concurrency your workload
exhibits.
3: What is the recommended PXF_JVM_OPTS value for accessing above dataset?
----------------
As mentioned above, this will depend on level of concurrency of your workload, it is not possible to estimate this generically. PXF is streaming data, so the rough overall formula is like this:
Total JVM Memory = (Memory for streaming 1 data partition) * (segments per host) * (num queries)
Assuming you’re only running 1 type of query which requires 100 MB per data partition (an arbitrary number for this example only) and you have 8 segments per host and 10 concurrent queries :
MEM = 100 MB * 8 * 10 = 8 GB
The biggest challenge is to determine “Memory for streaming 1 data partition” value that can only be done empirically for your workload.
4: Does every query will take PXF_JVM_OPTS value or is it a global constant value?
----------------
This is a global value for the whole PXF JVM that serves primary GPDB segments on a given host.
5: How pxf.max.threads comes into play while accessing external tables?
----------------
I’d recommend not changing the default parameter, unless necessary. It comes to action in 2 ways:
For the write case, PXF uses Tomcat threads themselves for data processing and *GENERALLY* does not utilize worker threads. Tomcat threads do not become available until after the data is written to the external system. However, for JDBC profile, PXF can create a writer thread pool per query per segment if the external table defines POOL_SIZE option. This provides for more concurrency when writing data at the expense of extra threads and potentially more JDBC connections required to support it.
FINALLY, for MySQL SPECIFICALLY, you need to pay attention to the JDBC fetch size. See https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-implementation-notes.html for more details. PXF ships with the setting that allows streaming of data row-by-row, to use the minimal amount of PXF JVM memory. We try avoiding fetching the whole ResultSet into PXF memory (default behavior of MySQL JDBC driver). But this means performance optimization of row pre-fetching is not utilized. You can set the value to another number (typically a 1000), but you will need to explicitly specify in your JDBC URL to use MySQL cursor fetch feature.
Please let us know if you have more questions.
--
Thanks,
Alexander Denissov
From:
muhammad ali <engmuha...@gmail.com>
Date: Monday, April 10, 2023 at 7:06 PM
To: Greenplum Users <gpdb-...@greenplum.org>
Subject: [gpdb-users] Platform Extension Framework configuration
!! External Email |
--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
gpdb-users+...@greenplum.org.
To view this discussion on the web visit
https://groups.google.com/a/greenplum.org/d/msgid/gpdb-users/dd656c72-90cc-4f88-ae9f-af50abe748a7n%40greenplum.org.
!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender. |