Estimating n_distinct for extremely large tables

Chris Hajas

unread,

Nov 8, 2023, 5:08:18 PM11/8/23

to gpdb...@greenplum.org

Hi Greenplum Developers,

I don't have a concrete ask from this post, but more an observation within GPDB.

While doing some internal testing, I saw that the analyze estimates for n_distinct (the number of distinct values, NDVs) is innaccurate for extremely large tables when the sample size is small. This is further magnified in GPDB with partitioned tables and distributed data.

As an example, consider the customer column of the orders table of tpch, which for a 3TB data set has 4.5B rows. Of this, 300M are distinct customers (around 6% of the orders). In the sample we take, 96% are distinct. Using the Haas and Stokes formula in GPDB/postgres, we estimate around 45M are distinct (~1%), so a 6X difference from the actual number--which causes us to underestimate the cost of a hash aggregate. If the data set is a bit smaller, we get a fairly accurate estimate. However, this formula becomes less and less accurate when the sampled rows are much less than the actual rows. It's the same issue seen in https://www.postgresql.org/message-id/200504231639.11897.josh%40agliodbs.com and also discussed in https://www.postgresql.org/message-id/flat/1136401829.21025.151.camel%40localhost.localdomain.

This is inherinitly a hard problem, and will be more prevalent in GPDB where we apply this same formula to the root partition for more tuples than a typical postgres database. With this merging, we could even get an NDV value for the root partition that is less than the max NDV of a leaf partition!

Since NDV is used so heavily by the optimizer for joins/group bys, I'm hesitent to make any changes to the formula beyond ensuring the NDV is at least that of the leaf. We do have an HyperLogLog fullscan ability within analyze (analyze fullscan <table>), but this isn't commonly used since it requires scanning the entire table/partition during analyze, though it lets us get much more accurate estimates. Additionally, we could ask users to also increase the default statistics target for large tables.

It seems like this is a known limitation with sampling. Does the default statistics target of 100 still make sense for large tables, or is it something we should increase? On the more extreme end, there's also using `analyze fullscan <table>` which does a scan of the entire table to get perfect NDVs, but this isn't something we advertise/use often.

Chris

Soumyadeep Chakraborty

unread,

Nov 8, 2023, 6:08:57 PM11/8/23

to Chris Hajas, gpdb...@greenplum.org

On Wed, Nov 8, 2023 at 2:08 PM 'Chris Hajas' via Greenplum Developers
<gpdb...@greenplum.org> wrote:

> On the more extreme end, there's also using `analyze fullscan <table>` which does a scan of the entire table to get perfect NDVs, but this isn't something we advertise/use often.

Wow we have that? :D
If Analyze = scanning the full table, then it's not really sampling! I
think that is a deal breaker.

> Additionally, we could ask users to also increase the default statistics target for large tables

I think this is the way to go. There are a few ways:
(0) Setting default_statistics_target cluster wide with gpconfig.

(1) Setting default_statistics_target database wide with:
alter database postgres set default_statistics_target=1000;

(2) Setting default_statistics_target session wide, maybe for the
session running analyze.
PS: Unfortunately, there is no way to hook up a target for autoanalyze :(

(3) Running ALTER TABLE SET STATISTICS on the columns of the table to
hike the stats target for just that table.
This only acquires a ShareUpdateExclusiveLock, which is not that
restrictive and is a very quick operation.
For partitioned tables this command recurses appropriately.

-- hikes stats target for count column in partitioned hierarchy
alter table rank alter count set statistics 1000;
select attrelid::regclass, attstattarget from pg_attribute where
attrelid IN ('rank'::regclass, 'rank_1_prt_boys'::regclass) and
attname = 'count';
attrelid | attstattarget
-----------------+---------------
rank | 1000
rank_1_prt_boys | 1000
(2 rows)

Regards,
Soumyadeep (VMware)

Zhenghua Lyu

unread,

Nov 8, 2023, 7:52:30 PM11/8/23

to gpdb...@greenplum.org, Chris Hajas

Hi Chris Hajas,

How about the idea: only do n_distinct using the full scan.

Things can be different in Greenplum from Postgres because Greenplum is MPP database and

can process in Parallel.

Analyze a table:

begin;

1. declare a cursor select n distinct values for each column;

2. normal analyze;

3. fetch the cursor's result and update stats info's ndv

commit;

Cursor is executed async in Greenplum, its dispatch state is external, so it is just like the NDV full

scan computation happen at background at the same time when you execute step 2 (normal analyze).

Thus, we will have an accurate estimate of NDV and due to executing in parallel it should be fast.

The only cost is that we have one extra reader gang to do the full scan NDV estimate.

Thoughts?

From: 'Chris Hajas' via Greenplum Developers <gpdb...@greenplum.org>
Sent: Thursday, November 9, 2023 6:08 AM
To: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: Estimating n_distinct for extremely large tables

!! External Email

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Reply all

Reply to author

Forward