I'm reposting this question on behalf of an advanced CitusDB + pg_shard user.
"We have a formula to map a column value to a particular shard in pg_shard. When we apply this formula, we're now seeing a slightly uneven distribution for hash token ranges.
Here's the breakdown of the shard ranges that led me to having to derive a formula. The number of shards is 256.
select range_size, count(*) as shard_count from (select shardid, (shardmaxvalue::bigint - shardminvalue::bigint) + 1 as range_size from pg_dist_shard where logicalrelid = 94514) as t group by 1 order by 1;
range_size | shard_count
------------+-------------
16777215 | 255
16777471 | 1
(2 rows)
This is for sharding keys based on BIGINT. I can show that this is consistent for all four tables on each cluster that are sharded on the same key.
The formula in SQL for column "x":
CASE WHEN hashint8(x) >= 2130706177 THEN 255 ELSE (hashint8(x) + 2147483648) / 16777215 END AS shardid
(A quick note on why this formula incorporates additional IF / ELSE checks. When PostgreSQL applies hashint8 on a Datum vs human readable text, you may get different reasonable. This formula therefore applies some transformations.)
It looks like the width of each shard other than the last one is (2^32 / 2^8) - 1. Is this expected?"