Limit on number of files for external table definition

126 views
Skip to first unread message

Daniel Wang

unread,
Jan 16, 2018, 5:40:33 PM1/16/18
to Greenplum Users
Hello,

I'm trying  to understand the impact of the limit on number of file-protocol accessed external files specified for a particular external table.

This seems to be the best reference:
http://engineering.pivotal.io/post/the_file_protocol_for_external_tables_in_greenplum_database/

We can view the number of external files with pg_max_external_files ( https://gpdb.docs.pivotal.io/500/ref_guide/system_catalogs/pg_max_external_files.html ).

-- Is this limit configurable (i.e., can we choose 100 files per segment server instead of just 1)?
-- It seems to be a limit enforced during external table creation. Is it just per-external-table or is it global or per-query (i.e., if I run select * from <exttable> where ..., will they execute concurrently)?

Happy to browse through the code if no one knows the answers, but would really like a starting point in this relatively large code base.

Cheers,
-Daniel

Ivan Novick

unread,
Jan 16, 2018, 6:00:04 PM1/16/18
to Daniel Wang, Greenplum Users
Most users use gpfdist protocol or s3 protocol or gphdfs protocol.

Are you using 'file' protocol?

What problem are you trying to solve?


--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+unsubscribe@greenplum.org.
To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.



--
Ivan Novick, Product Manager Pivotal Greenplum

Jasper Li

unread,
Jan 16, 2018, 7:35:30 PM1/16/18
to Ivan Novick, Daniel Wang, Greenplum Users
Besides, if you do need file on each segment, does 'copy ... on segment' satisfy your requirement?

Best wishes
Jasper
--

Ivan Novick

unread,
Jan 16, 2018, 8:15:56 PM1/16/18
to Jasper Li, Daniel Wang, Greenplum Users
I think we can help more if we know more about what you are trying to do Daniel

Daniel Wang

unread,
Jan 16, 2018, 9:29:11 PM1/16/18
to Greenplum Users, ja...@pivotal.io, dan...@awakenetworks.com
Right now, I'd like to do concurrent bulk-loads into gpdb, each using all segment servers. Right now, we only do one at a time. We're pushing rows into named pipes on the segment servers' filesystems, and doing INSERT INTO ...SELECT FROM on an external table defined on those pipes (we don't write to real files because we want to avoid disk contention). 

I also want to explore the possibility of just writing those to disk instead of flowing them into a gpdb-native table, then having multiple queries select from them through external tables. If the limit is global, then they would block on each other (hopefully not deadlock, though), but if it's per-query, per-table, then perhaps we're fine.

COPY on segment might work for the bulk-loading case, but I haven't explored it, as this code pre-dates the on-segment implementation. Would an N-way copy-on-segment perform comparably to a single insert-into-select-from external table with N files? My instinct would be that the copy would be slower due to N transactions instead of 1.

gpfdist isn't available in open-source greenplum, is it?

Cheers,
-Daniel


On Tuesday, January 16, 2018 at 5:15:56 PM UTC-8, inovick wrote:
I think we can help more if we know more about what you are trying to do Daniel
On Tue, Jan 16, 2018 at 4:35 PM, Jasper Li <ja...@pivotal.io> wrote:
Besides, if you do need file on each segment, does 'copy ... on segment' satisfy your requirement?

Best wishes
Jasper
--

On Wed, Jan 17, 2018 at 7:00 AM, Ivan Novick <ino...@pivotal.io> wrote:
Most users use gpfdist protocol or s3 protocol or gphdfs protocol.

Are you using 'file' protocol?

What problem are you trying to solve?

On Tue, Jan 16, 2018 at 2:40 PM, Daniel Wang <dan...@awakenetworks.com> wrote:
Hello,

I'm trying  to understand the impact of the limit on number of file-protocol accessed external files specified for a particular external table.

This seems to be the best reference:
http://engineering.pivotal.io/post/the_file_protocol_for_external_tables_in_greenplum_database/

We can view the number of external files with pg_max_external_files ( https://gpdb.docs.pivotal.io/500/ref_guide/system_catalogs/pg_max_external_files.html ).

-- Is this limit configurable (i.e., can we choose 100 files per segment server instead of just 1)?
-- It seems to be a limit enforced during external table creation. Is it just per-external-table or is it global or per-query (i.e., if I run select * from <exttable> where ..., will they execute concurrently)?

Happy to browse through the code if no one knows the answers, but would really like a starting point in this relatively large code base.

Cheers,
-Daniel

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.

To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.
--
Ivan Novick, Product Manager Pivotal Greenplum

--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.

To post to this group, send email to gpdb-...@greenplum.org.
Visit this group at https://groups.google.com/a/greenplum.org/group/gpdb-users/.
For more options, visit https://groups.google.com/a/greenplum.org/d/optout.

Jimmy Yih

unread,
Jan 16, 2018, 9:38:08 PM1/16/18
to Daniel Wang, Greenplum Users, Jasper Li
Open source Greenplum should have gpfdist available.  It's enabled to build by default in the configure script.

To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+unsubscribe@greenplum.org.

Jasper Li

unread,
Jan 16, 2018, 9:38:19 PM1/16/18
to Daniel Wang, Greenplum Users
Hi Daniel,

gpfdist is available in opensource version and it is your best choice to bulk-load the data.

Do you have trouble to compile it?

Best wishes
Jasper
--

To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+unsubscribe@greenplum.org.

Ming Li

unread,
Jan 16, 2018, 9:55:38 PM1/16/18
to Daniel Wang, Greenplum Users
pg_max_external_files is a view, which is defined as:

CREATE VIEW pg_max_external_files AS
    SELECT   address::name as hostname, count(*) as maxfiles
    FROM     gp_segment_configuration
    WHERE    content >= 0
    AND      role='p'
    GROUP BY address;

The maximum number of external table files allowed per segment host when using the external table file protocol is same as the segment number on each segment node, which is global, and can not changed by guc.



Daniel Wang

unread,
Jan 16, 2018, 10:24:13 PM1/16/18
to Greenplum Users, dan...@awakenetworks.com
Hi,


On Tuesday, January 16, 2018 at 6:55:38 PM UTC-8, Ming Li wrote:
The maximum number of external table files allowed per segment host when using the external table file protocol is same as the segment number on each segment node, which is global, and can not changed by guc.

Great, as long as this limit only applies to the definition of the external tables, and not as a limit applied across queries that use those tables, I'm good with that. Meaning, that if I have 5 segment servers, I can define an external table with using 5 files (file://...) and then select from that table (or some number of external tables defined on different sets of 5 files) using an arbitrary number of queries concurrently, then I can understand that. Is that the case?

Cheers,
-Daniel

Ming Li

unread,
Jan 16, 2018, 10:33:40 PM1/16/18
to Daniel Wang, Greenplum Users
Yes, exactly.

BTW, as your comments:
COPY on segment might work for the bulk-loading case, but I haven't explored it, as this code pre-dates the on-segment implementation. Would an N-way copy-on-segment perform comparably to a single insert-into-select-from external table with N files? My instinct would be that the copy would be slower due to N transactions instead of 1.

The "COPY ... ON SEGMENT" also running in one global transaction, and the performance must be better than insert-into-select-from external table if we make all segment node running data evenly.


--

Daniel Wang

unread,
Jan 16, 2018, 10:47:16 PM1/16/18
to Greenplum Users, dan...@awakenetworks.com
Hi,


On Tuesday, January 16, 2018 at 6:38:19 PM UTC-8, Jasper Li wrote:
Hi Daniel,

gpfdist is available in opensource version and it is your best choice to bulk-load the data.

A co-worker mentioned that it wasn't in our build, but I just checked and it seems to be there.  I wasn't thrilled about having another running service that needed startup/shutdown and crash recovery management, but if it works....

gpfdist looks like it wants to stream in 32kB chunks. How does it handle big rows (e.g., >500kB)? 
Also, it looks like the gpfdist protocol wants to stream sequentially, so non-sequential reading is not supported, right?  (obviously, this is fine for row-oriented text or csv inputs, but would die on a less linear file format)

Anyway, thanks to everyone in the thread for trying to help!
-Daniel

Jasper Li

unread,
Jan 16, 2018, 11:54:26 PM1/16/18
to Daniel Wang, Greenplum Users
Hi Daniel,

gpfdist has -m parameter to change the max line length and its max value is 256M.
gpfdist must read single file sequentially from beginning to end. You can invoke several gpfdist instances if you have many files and you want to serve them in parallel and set multiple gpfdist address in the location parameter.

You can of course choose your best solution according to your request. Gpfdist is supposed to be the official ETL solution of Greenplum. If there is anything that gpfdist can't solve for you, we are very happy to hear your feedback

Best wishes
Jasper
--

Keaton Adams

unread,
Jan 17, 2018, 8:02:37 AM1/17/18
to Greenplum Users
Jasper posted a technical article on the Greenplum Blog that has more details on the inner workings of gpfdist, with suggestions for future improvements to the utility:

http://greenplum.org/introduction-writable-gpfdist/#more-2985

There are many good articles posted there as a resource to use for Greenplum Database:


Here are links to two example scripts for how to use gpfdist. If your data is broken up into multiple files across multiple directories, for example, it is possible to create an external table that points to several gpfdist ETL processes (each running on their own port) so that a SELECT FROM the external table uses all gpfdists in parallel to read (or write) the data in MPP fashion. 


This is the utility that many third-party ETL tools that support Greenplum as a source/target have integrated for fast data loading/unloading with GPDB. The gpfdist docs can be found here: http://greenplum.org/docs/530/admin_guide/external/g-using-the-greenplum-parallel-file-server--gpfdist-.html

Thanks,

Keaton 
Reply all
Reply to author
Forward
0 new messages