Dear all,
We are working with a Greenplum Cluster 6.19.0 running on RHEL7. Interconnect is 40G.
1 master, 1 standby and 4 datanodes with 20 primary segments, 5 each, and 20 mirror segments. Our master/standby are 256GB RAM and 80 cores. Datanodes are 512GB RAM and 80 cores respectively.
From time to time our queries hang and we find these type of errors:
2023-02-08 14:20:50.697116 CET,"postgres","pgeuclid_ops",p220860,th876542080,"192.168.118.46","29252",2023-02-07 20:55:09 CET,0,con66,cmd1412,seg-1,,,,sx1,"ERROR","58M01","failed to acquire resources on one or more segments","could not connect to server: Connection timed out
Is the server running on host ""192.168.118.21"" and accepting
TCP/IP connections on port 45000?
When we ran the gpperfcheck on master and datanodes it throws:
/home/gpadmin/software/bin/gpcheckperf -f hostfile_all -r N -d /tmp
-------------------
-- NETPERF TEST
-------------------
====================
== RESULT 2023-02-09T12:21:42.182903
====================
Netperf bisection bandwidth test
easgpopsmaster01.scieuc.lan -> easgpopsnode01.scieuc.lan = 2767.440000
easgpopsnode02.scieuc.lan -> easgpopsnode03.scieuc.lan = 2194.770000
easgpopsnode04.scieuc.lan -> easgpopsmaster01.scieuc.lan = 2401.130000
easgpopsnode01.scieuc.lan -> easgpopsmaster01.scieuc.lan = 2196.900000
easgpopsnode03.scieuc.lan -> easgpopsnode02.scieuc.lan = 2200.570000
easgpopsmaster01.scieuc.lan -> easgpopsnode04.scieuc.lan = 2796.790000
Summary:
sum = 14557.60 MB/sec
min = 2194.77 MB/sec
max = 2796.79 MB/sec
avg = 2426.27 MB/sec
median = 2401.13 MB/sec
[Warning] connection between easgpopsnode02.scieuc.lan and easgpopsnode03.scieuc.lan is no good
[Warning] connection between easgpopsnode04.scieuc.lan and easgpopsmaster01.scieuc.lan is no good
[Warning] connection between easgpopsnode01.scieuc.lan and easgpopsmaster01.scieuc.lan is no good
[Warning] connection between easgpopsnode03.scieuc.lan and easgpopsnode02.scieuc.lan is no good
What is exactly not good?
We are using default parameters in the configuration except for:
in master:
log_statement=all gp_contentid=-1 gp_vmem_protect_limit=84760 gp_interconnect_type=tcp gp_interconnect_tcp_listener_backlog=4096 #max_connections=500 max_connections=500 max_prepared_transactions=500 #gp_fts_probe_timeout=500 gp_fts_probe_timeout=500 shared_buffers=64GB effective_cache_size=192GB maintenance_work_mem=2GB checkpoint_completion_target=0.9 wal_buffers=16MB random_page_cost=1.1 effective_io_concurrency=200 max_worker_processes=80
in datanodes:
gp_contentid=0 gp_vmem_protect_limit=84760 gp_interconnect_type=tcp gp_interconnect_tcp_listener_backlog=4096 max_connections=2500 max_prepared_transactions=2500 log_statement=all gp_fts_probe_timeout=500 shared_buffers=8GB effective_cache_size=24GB maintenance_work_mem=256MB checkpoint_completion_target=0.9 wal_buffers=16MB random_page_cost=1.1 effective_io_concurrency=25 max_worker_processes=10
Our application connects to the GP cluster using jdbc 42.2
How can we know what is really happening? What should we check on the network?
Than you very much for any suggestion,
Pilar
--
You received this message because you are subscribed to the Google Groups "Greenplum Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-users+...@greenplum.org.
To view this discussion on the web visit https://groups.google.com/a/greenplum.org/d/msgid/gpdb-users/2145ad1a-06c9-4bf2-ae90-fd83f21b099dn%40greenplum.org.