Disk no space node failure - Client hangs

28 views
Skip to first unread message

Ramnatthan Ala

unread,
Jul 17, 2016, 4:41:38 PM7/17/16
to CockroachDB
Hi,

I am trying to understand some details regarding replication, raft, and leader election. I start a 3-node cluster using a small script like this:

cockroach start --store=data0 --log-dir=log0
cockroach start --store=data1 --log-dir=log1 --port=26258 --http-port=8081 --join=localhost:26257 --join=localhost:26259 &
cockroach start --store=data2 --log-dir=log2 --port=26259 --http-port=8082 --join=localhost:26257 --join=localhost:26258 &
sleep 5

At this point, I see that the cluster is setup correctly and I can start inserting and reading data out. So far so good.

Now, during this startup, node2 (the node with data2 as its store) hits a disk full error (-ENOSPC) when trying to append to a SSTable file and fails to start. At this point,  expectation is that the cluster continues to accept client connections and make progress with transactions (as majority of nodes are available). This is my client code:

server_ports = ["26257", "26258", "26259"]
server_id = 0
for port in server_ports:
try:
conn = psycopg2.connect(host="localhost", port=port, database = "mydb", user="xxx", connect_timeout=10)
conn.set_session(autocommit=True)
cur = conn.cursor()
cur.execute("SELECT * FROM mytable;")
record = cur.fetchall()
        print result
cur.close()
conn.close()
except Exception as e:
        print 'Exception:' + str(e)
time.sleep(3) 

I see that the client successfully connected to node0 but hangs after that (in the execute statement) atleast for 30 seconds (after which I abort the thread running this above code).  

I am not sure what is going wrong here. Shouldn't the client be able to talk to node0 or node1 irrespective of node2's failure? I am not sure who was the leader for this table before node2 crashed. Irrespective of who was the old leader, shouldn't the cluster automatically elect a new leader (if node2 were the old leader) and continue to make progress in any case? 

Any help would be appreciated much.

Thanks
Ram   

Peter Mattis

unread,
Jul 18, 2016, 10:50:26 AM7/18/16
to Ramnatthan Ala, CockroachDB
Hi Ramnatthan,

The cluster should be able to continue when one node runs out of space. That it isn't sounds like a bug. Can you file an issue?

--
You received this message because you are subscribed to the Google Groups "CockroachDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cockroach-db...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ramnatthan Ala

unread,
Jul 18, 2016, 12:04:52 PM7/18/16
to Peter Mattis, CockroachDB
Thanks Peter.


Thanks
Ram
Reply all
Reply to author
Forward
0 new messages