I tried increasing the number of segments in my test Docker instance of gpdb.
I invoked gpexpand in interactive mode. I made a mistake in entering the directories of the new segments, and specified directories that didn't exist yet, thinking that gpexpand was going to create them; I didn't realize that it wanted the parent directory, and it would create the segment directories.
Anyhow, gpexpand failed because the parent directory didn't exist. I tried to roll back with gpexpand --rollback, and I get this:
20170511:09:16:50:018177 gpexpand:9b2d6636c155:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.7.1 build 1'
20170511:09:16:50:018177 gpexpand:9b2d6636c155:gpadmin-[ERROR]:-gpexpand failed: could not connect to server: Connection refused
Is the server running on host "localhost" (127.0.0.1) and accepting
TCP/IP connections on port 5432?
So it appears the server isn't running. I try to start it with gpstart (via /usr/local/bin/run.sh in dbbaskette's Dockerfile):
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Starting gpstart with args: -a
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Gathering information and validating the environment...
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.7.1 build 1'
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Greenplum Catalog Version: '201310150'
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Starting Master instance in admin mode
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Obtaining Segment details from master...
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Setting new master era
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Master Started...
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Shutting down master
20170511:09:26:01:018437 gpstart:9b2d6636c155:gpadmin-[ERROR]:-gpstart error: Found a System Expansion Setup in progress. Please run 'gpexpand --rollback'
This seems like a catch-22: I can't rollback the failed expansion without a server, but I can't start a server without rolling back the failed expansion.
Obviously I can blow away the docker container and start the expansion properly, but this is exactly the kind of scenario that could occur in production and I wanted to know how to fix it.
I was stumped on this one until I ran across a Chinese article,
http://blog.csdn.net/wxc20062006/article/details/53126076 and it turns out the fix is to use:
gpstart -m
to start the master node only, whereupon gpexpand --rollback can be applied, and then gpstart as normal will work again.
I know this is a bit out of the ordinary because it wouldn't be expected to use a single node docker instance to run in production, so if there's a technical reason why this kind of failure can't happen in production, that would be just as good.
I write this to increase the probability that future me, or someone else seeing the same issue, may find this post and discover the fix.