Lots of "Cannot write to ostream" / "Cannot read from istream" errors

2,249 views
Skip to first unread message

Rohit Agarwal

unread,
Feb 11, 2017, 1:38:45 PM2/11/17
to ClickHouse
Hi,

We run a 4 node Clickhouse Cluster (100G RAM, 66TB RAID0 on each node)
There are 4 shards. Each replicated 3x.

Recently one node lost all the data and we started the procedure to recover the data. We set force_restore_data flag in zookeeper for all the tables we wanted. The data transfer started as soon as we started the server back. But there were a lot of errors in the logs.

On the server which was recovering the data, there were a lot of errors like:
2017.02.11 18:22:59.902288 [ 12 ] <Error> DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Code: 23, e.displayText() = DB::Exception: Cannot read from istream, e.what() = DB::Exception, Stack trace:
0. /usr/bin/clickhouse-server(StackTrace::StackTrace()+0x16) [0x1217286]
1. /usr/bin/clickhouse-server(DB::Exception::Exception(std::string const&, int)+0x1f) [0xf7c44f]
2. /usr/bin/clickhouse-server(DB::ReadBufferFromIStream::nextImpl()+0x97) [0xf85357]
3. /usr/bin/clickhouse-server(DB::ReadBufferFromHTTP::nextImpl()+0x26) [0x1e35236]
4. /usr/bin/clickhouse-server() [0x1e386f1]
5. /usr/bin/clickhouse-server(DB::DataPartsExchange::Fetcher::fetchPartImpl(std::string const&, std::string const&, std::string const&, int, std::string const&, bool)+0xc6e) [0x1395c8e]
6. /usr/bin/clickhouse-server(DB::DataPartsExchange::Fetcher::fetchPart(std::string const&, std::string const&, std::string const&, int, bool)+0x61) [0x1397451]
7. /usr/bin/clickhouse-server(DB::StorageReplicatedMergeTree::fetchPart(std::string const&, std::string const&, bool, unsigned long)+0x1f7) [0x12d3a07]
8. /usr/bin/clickhouse-server(DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry const&)+0x7d0) [0x12d4d60]
9. /usr/bin/clickhouse-server() [0x12d80ce]
10. /usr/bin/clickhouse-server(DB::ReplicatedMergeTreeQueue::processEntry(std::function<std::shared_ptr<zkutil::ZooKeeper> ()>, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&, std::function<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>)+0x3b) [0x137514b]
11. /usr/bin/clickhouse-server(DB::StorageReplicatedMergeTree::queueTask()+0x148) [0x12b71e8]
12. /usr/bin/clickhouse-server(DB::BackgroundProcessingPool::threadFunction()+0x3cc) [0x1305c1c]
13. /usr/bin/clickhouse-server() [0x31b48ef]
14. /lib/x86_64-linux-gnu/libpthread.so.0(+0x8064) [0x7ff82cd33064]
15. /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff82c35b62d]

On the servers which were sending the data, there were a lot of errors like:
2017.02.11 18:22:59.904507 [ 12 ] <Error> InterserverIOHTTPHandler: Code: 24, e.displayText() = DB::Exception: Cannot write to ostream, e.what() = DB::Exception, Stack trace:
0. /usr/bin/clickhouse-server(StackTrace::StackTrace()+0x16) [0x1217286]
1. /usr/bin/clickhouse-server(DB::Exception::Exception(std::string const&, int)+0x1f) [0xf7c44f]
2. /usr/bin/clickhouse-server(DB::WriteBufferFromOStream::nextImpl()+0x7e) [0x105a0ae]
3. /usr/bin/clickhouse-server(DB::WriteBufferFromHTTPServerResponse::nextImpl()+0xda) [0x122e83a]
4. /usr/bin/clickhouse-server(DB::WriteBuffer::next()+0x26) [0xf7c496]
5. /usr/bin/clickhouse-server(DB::HashingWriteBuffer::nextImpl()+0x2d) [0x133be5d]
6. /usr/bin/clickhouse-server() [0x1e3861c]
7. /usr/bin/clickhouse-server(DB::DataPartsExchange::Service::processQuery(Poco::Net::HTMLForm const&, DB::ReadBuffer&, DB::WriteBuffer&)+0xf29) [0x1393e29]
8. /usr/bin/clickhouse-server(DB::InterserverIOHTTPHandler::processQuery(Poco::Net::HTTPServerRequest&, Poco::Net::HTTPServerResponse&)+0xd3f) [0xf9800f]
9. /usr/bin/clickhouse-server(DB::InterserverIOHTTPHandler::handleRequest(Poco::Net::HTTPServerRequest&, Poco::Net::HTTPServerResponse&)+0x44) [0xf98c04]
10. /usr/bin/clickhouse-server(Poco::Net::HTTPServerConnection::run()+0x27b) [0x3038f1b]
11. /usr/bin/clickhouse-server(Poco::Net::TCPServerConnection::start()+0xf) [0x301d62f]
12. /usr/bin/clickhouse-server(Poco::Net::TCPServerDispatcher::run()+0x10b) [0x303b1bb]
13. /usr/bin/clickhouse-server(Poco::PooledThread::run()+0x87) [0x30b4087]
14. /usr/bin/clickhouse-server(Poco::ThreadImpl::runnableEntry(void*)+0x96) [0x3077a06]
15. /lib/x86_64-linux-gnu/libpthread.so.0(+0x8064) [0x7ff6b5257064]
16. /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff6b487f62d]

Are these expected? How can we avoid them?

--
Rohit

tatiana....@revjet.com

unread,
Mar 2, 2017, 2:29:26 PM3/2/17
to ClickHouse
Hi,

We have the same problem.
We added one more replica to a shard, and it is trying to copy 8T to another DC.
On the new replica there is a lot of "Cannot read from istream" errors and timeout errors.
Are there any settings that we can use to increase these timeouts?
Or should we just copy data manually and then attach all tables on the new replica instead of creating them?


2017.03.02 18:24:32.035899 [ 37 ] <Error> DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 11, e.displayText() = Timeout, e.what() = Timeout
2017.03.02 18:24:32.351434 [ 40 ] <Error> DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 11, e.displayText() = Timeout, e.what() = Timeout
2017.03.02 18:24:33.091490 [ 38 ] <Error> DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 11, e.displayText() = Timeout, e.what() = Timeout
2017.03.02 18:24:33.711482 [ 40 ] <Error> DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 11, e.displayText() = Timeout, e.what() = Timeout
2017.03.02 18:24:34.243378 [ 38 ] <Error> DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Poco::Exception. Code: 1000, e.code() = 11, e.displayText() = Timeout, e.what() = Timeout
2017.03.02 18:24:36.565250 [ 46 ] <Error> DB::StorageReplicatedMergeTree::queueTask()::<lambda(DB::StorageReplicatedMergeTree::LogEntryPtr&)>: Code: 23, e.displayText() = DB::Exception: Cannot read from istream, e.what() = DB::Exception, Stack trace:

0. clickhouse-server(StackTrace::StackTrace()+0x16) [0x1217286]
1. clickhouse-server(DB::Exception::Exception(std::string const&, int)+0x1f) [0xf7c44f]
2. clickhouse-server(DB::ReadBufferFromIStream::nextImpl()+0x97) [0xf85357]
3. clickhouse-server(DB::ReadBufferFromHTTP::nextImpl()+0x26) [0x1e35236]
4. clickhouse-server(DB::DataPartsExchange::Fetcher::fetchPartImpl(std::string const&, std::string const&, std::string const&, int, std::string const&, bool)+0x196e) [0x139698e]
5. clickhouse-server(DB::DataPartsExchange::Fetcher::fetchPart(std::string const&, std::string const&, std::string const&, int, bool)+0x61) [0x1397451]
6. clickhouse-server(DB::StorageReplicatedMergeTree::fetchPart(std::string const&, std::string const&, bool, unsigned long)+0x1f7) [0x12d3a07]
7. clickhouse-server(DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry const&)+0x7d0) [0x12d4d60]
8. clickhouse-server() [0x12d80ce]
9. clickhouse-server(DB::ReplicatedMergeTreeQueue::processEntry(std::function<std::shared_ptr<zkutil::ZooKeeper> ()>, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&, std::function<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>)+0x3b) [0x137514b]
10. clickhouse-server(DB::StorageReplicatedMergeTree::queueTask()+0x148) [0x12b71e8]
11. clickhouse-server(DB::BackgroundProcessingPool::threadFunction()+0x3cc) [0x1305c1c]
12. clickhouse-server() [0x31b48ef]
13. /lib/x86_64-linux-gnu/libpthread.so.0(+0x8064) [0x7fba121d2064]
14. /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fba117fa62d]

man...@gmail.com

unread,
Mar 2, 2017, 2:32:40 PM3/2/17
to ClickHouse
We have similar issue on some of our servers and currently investigating it.

mvav...@cloudflare.com

unread,
Mar 2, 2017, 11:41:41 PM3/2/17
to ClickHouse
This is the ticket with followup findings: https://github.com/yandex/ClickHouse/issues/520
It's not fully solved yet though.

man...@gmail.com

unread,
Mar 13, 2017, 5:32:25 PM3/13/17
to ClickHouse
We are currently developing the solution for this issue.
Details are here: https://github.com/yandex/ClickHouse/issues/520

To avoid this issue, you could temporarily lower 'background_pool_size' on repaired replica during repair.
Write <background_pool_size>4</background_pool_size> at /profiles/default in users.xml file and restart the server.

Don't forget to raise it back after repair (either remove this parameter or set it to value of 16).
Reply all
Reply to author
Forward
0 new messages