Experience: recovery of failed gpexpand requires server but server won't start - catch 22

248 views

Skip to first unread message

Barry

unread,

May 11, 2017, 5:38:00 AM5/11/17

to Greenplum Users

I tried increasing the number of segments in my test Docker instance of gpdb.

I invoked gpexpand in interactive mode. I made a mistake in entering the directories of the new segments, and specified directories that didn't exist yet, thinking that gpexpand was going to create them; I didn't realize that it wanted the parent directory, and it would create the segment directories.

Anyhow, gpexpand failed because the parent directory didn't exist. I tried to roll back with gpexpand --rollback, and I get this:

20170511:09:16:50:018177 gpexpand:9b2d6636c155:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.7.1 build 1'
20170511:09:16:50:018177 gpexpand:9b2d6636c155:gpadmin-[ERROR]:-gpexpand failed: could not connect to server: Connection refused
    Is the server running on host "localhost" (127.0.0.1) and accepting
    TCP/IP connections on port 5432?

So it appears the server isn't running. I try to start it with gpstart (via /usr/local/bin/run.sh in dbbaskette's Dockerfile):

20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Starting gpstart with args: -a
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Gathering information and validating the environment...
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.7.1 build 1'
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Greenplum Catalog Version: '201310150'
20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Starting Master instance in admin mode
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Obtaining Segment details from master...
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Setting new master era
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Master Started...
20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Shutting down master
20170511:09:26:01:018437 gpstart:9b2d6636c155:gpadmin-[ERROR]:-gpstart error: Found a System Expansion Setup in progress. Please run 'gpexpand --rollback'

This seems like a catch-22: I can't rollback the failed expansion without a server, but I can't start a server without rolling back the failed expansion.

Obviously I can blow away the docker container and start the expansion properly, but this is exactly the kind of scenario that could occur in production and I wanted to know how to fix it.

I was stumped on this one until I ran across a Chinese article, http://blog.csdn.net/wxc20062006/article/details/53126076 and it turns out the fix is to use:

   gpstart -m

to start the master node only, whereupon gpexpand --rollback can be applied, and then gpstart as normal will work again.

I know this is a bit out of the ordinary because it wouldn't be expected to use a single node docker instance to run in production, so if there's a technical reason why this kind of failure can't happen in production, that would be just as good.

I write this to increase the probability that future me, or someone else seeing the same issue, may find this post and discover the fix.

Asim Praveen

unread,

May 12, 2017, 12:07:26 AM5/12/17

to Barry, Greenplum Users

On Thu, May 11, 2017 at 2:38 AM, Barry <barry...@du.co> wrote:
>
> I tried increasing the number of segments in my test Docker instance of gpdb.
>
> I invoked gpexpand in interactive mode. I made a mistake in entering the directories of the new segments, and specified directories that didn't exist yet, thinking that gpexpand was going to create them; I didn't realize that it wanted the parent directory, and it would create the segment directories.
>
> Anyhow, gpexpand failed because the parent directory didn't exist. I tried to roll back with gpexpand --rollback, and I get this:
>
> 20170511:09:16:50:018177 gpexpand:9b2d6636c155:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 4.3.7.1 build 1'
> 20170511:09:16:50:018177 gpexpand:9b2d6636c155:gpadmin-[ERROR]:-gpexpand failed: could not connect to server: Connection refused
> Is the server running on host "localhost" (127.0.0.1) and accepting
> TCP/IP connections on port 5432?
>
> So it appears the server isn't running. I try to start it with gpstart (via /usr/local/bin/run.sh in dbbaskette's Dockerfile):
>
> 20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Starting gpstart with args: -a
> 20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Gathering information and validating the environment...
> 20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 4.3.7.1 build 1'
> 20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Greenplum Catalog Version: '201310150'
> 20170511:09:25:58:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Starting Master instance in admin mode
> 20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Obtaining Greenplum Master catalog information
> 20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Obtaining Segment details from master...
> 20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Setting new master era
> 20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Master Started...
> 20170511:09:26:00:018437 gpstart:9b2d6636c155:gpadmin-[INFO]:-Shutting down master
> 20170511:09:26:01:018437 gpstart:9b2d6636c155:gpadmin-[ERROR]:-gpstart error: Found a System Expansion Setup in progress. Please run 'gpexpand --rollback'
>
> This seems like a catch-22: I can't rollback the failed expansion without a server, but I can't start a server without rolling back the failed expansion.
>

The solution to use `gpstart -m` in this situation is documented in Admin Guide at the bottom of this page: http://greenplum.org/docs/admin_guide/expand/expand-initialize.html

>
> I know this is a bit out of the ordinary because it wouldn't be expected to use a single node docker instance to run in production, so if there's a technical reason why this kind of failure can't happen in production, that would be just as good.
>

Yes, this situation can occur in production and there is an escape hatch.

Asim

Reply all

Reply to author

Forward

0 new messages