A failover test takes more than eight hours and reports frequently 'request timeout' for 0.9.7.5 version

113 views
Skip to first unread message

David

unread,
May 22, 2013, 11:41:23 PM5/22/13
to hyperta...@googlegroups.com
My cluster has six RangeServers(rs1-rs6), the hypertable.cfg file contains the following items:
Hypertable.RangeServer.MemoryLimit=24G
Hypertable.RangeServer.MemoryLimit.Percentage=50
Hypertable.Failover.Quorum.Percentage=80
Hypertable.RangeServer.CommitLog.PruneThreshold.Max=12G
Hypertable.RangeServer.Maintenance.MaxAppQueuePause=120000
Hypertable.RangeServer.CommitLog.FragmentRemoval.RangeReferenceRequired=false

When the cluster has accumulated 2T+ data, the commit log size and the memory use rate of every RangeServer don't exceed the config limit. I manually interrupt rs2 to start the first failover test, it takes about 30 minutes and become success, then i bright the interrputed RangeServer back and its proxy name become rs7. Now, both the commit log and the memory  of every RangeServer are normal.

After a day,  the cluster has accumulated 3T+ data,  the commit log size is normal, but the memory use rate of several RangeServers reach to 26-27G.  I manually interrupt rs5 to start the second failover test at 12 am.  It seems like normal in the original 1 hour, but from 13 pm, the master send a mail per 5 mimutes, the mail remind the following ERROR: 

player rs1: HYPERTABLE request timeout - Problem connecting to rs7 player rs3: HYPERTABLE request timeout - Problem connecting to rs7 player rs4: HYPERTABLE request timeout - Problem connecting to rs7  player rs6: HYPERTABLE request timeout - Problem connecting to rs7

It seem like that other RangeServers can't connect to rs7, but executing ‘ht_rsclient rs7ip' on the other RangeServers is ok. Up to 14:30 pm, the problem still exists. I notice some RangeServer has the higher memory use rate(28-30G), so i doubt  it maybe cause the problem. So i try to restart the rangeserver service of some RangeServers, unfortunately, master service interrupt and throw the following exception:
1369204575 INFO Hypertable.Master : (/root/src/hypertable/src/cc/Hypertable/Master/Context.cc:103) replay_complete(id=46257, rs5, plan_generation=2) = RANGE SERVER phantom range map not found
terminate called after throwing an instance of 'std::logic_error'
  what():  basic_string::_S_construct NULL not valid

After restarting master service, the exception disappeared.  After the RangeServer restarted completely start, i still receive the mail from master, but the mail body has missed the content about  the machine restarted, it seems that the RangeServers can connect to rs7 now. So, i restart the rangeserver service of other RangeServers.  Of course, the maser still throw the previous exception, but after restarting service ok. 

During the RangeServers start,  i can find many prompts which indicate the commit log and range of rs5 be replayed or compacted from the log. The process alway continues to 17:56 pm,  until looking the notice from the mail, i know a RangeServer(rs6) suddenly interrupt and start recovery. In meanwhile, the four lived RangeServers have  fairly high workload, I feel hopelessness. But to my surprise, the recovery of rs6  succeed at 18:15pm. Then, i bright it back at once, i notice the workload of other RangeServers slightly reduce.

Up to 20:29, i receive the mail that notice the recovery of rs5 succeed.

Athought the recovery of rs5 secceed eventually, but whole process is full of worries. I have some questions:

1. The frequent timeout to rs7, Why?  can't other RangeServer actually connect to rs7? 

2. Why it takes so long time? 

3. If i don't restart the rangeserver service, can the recovery of rs5 succeed? 

Maybe, the questions can be boiled down to one. 

Any idea can be appreciated.

Doug Judd

unread,
May 23, 2013, 8:38:08 AM5/23/13
to hypertable-user
Hi David,

Can you send me the log files (Hypertable.Master.log and all Hypertable.RangeSever.log files) from when you manually killed rs2 to the time rs5 succeeded?  From the logs, I should be able to figure out what's happening.

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Doug Judd
CEO, Hypertable Inc.

David

unread,
May 23, 2013, 11:22:10 PM5/23/13
to hyperta...@googlegroups.com, do...@hypertable.com
I already send all logs to your mailbox, please take a look.

David

unread,
May 23, 2013, 11:22:20 PM5/23/13
to hyperta...@googlegroups.com, do...@hypertable.com

David

unread,
May 23, 2013, 11:22:40 PM5/23/13
to hyperta...@googlegroups.com, do...@hypertable.com

David

unread,
Jun 4, 2013, 10:30:27 AM6/4/13
to hyperta...@googlegroups.com, do...@hypertable.com
Hi,Doug,
I have upgraded from 0.9.7.5 to 0.9.7.6, and also encountered a failover test that took for more than a hour on 0.9.7.6 version.
The cluster almost has not any writing workload for several days, but when i interrupt a RangeServer to test the failover automatically,  after about thirty minutes, i started to receive a mail per five minutes, the mail always reports the same ERROR:
player rs1: HYPERTABLE request timeout - Problem connecting to rs12 player rs8: HYPERTABLE request timeout - Problem connecting to rs12 player rs9: HYPERTABLE request timeout - Problem connecting to rs12
rs12 is a RangeServer that was brought back after the previous failover test.  it seems like the three RangeServers can not connect to rs12.
I inspected the workload of the four RangeServers(rs1,rs8,rs9,rs12),  all were slight, but after a hour the ERROR still appeared.
So, I restarted the cluster. during restarting, rs1 also appeared 'hypertable no response' and started recovery automatically. After twenty-five minutes, the recovery of the  previous RangeServer interrupted succeeded, then after about fifty minutes, the recovery of rs1 succeeded.

Doug Judd

unread,
Jun 4, 2013, 12:24:13 PM6/4/13
to hypertable-user
Can you try with 0.9.7.7?  There was a bug fix in the communication subsystem that may resolve the issue.

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

David

unread,
Jun 4, 2013, 9:11:19 PM6/4/13
to hyperta...@googlegroups.com, do...@hypertable.com
ok, i will try 0.9.7.7 out and report the result.

David

unread,
Jun 7, 2013, 9:59:56 PM6/7/13
to hyperta...@googlegroups.com, do...@hypertable.com
It seems like that the problem of 'request timeout' still appears on the 0.9.7.7 version.
Now, i am doing the third time failover test, of course, the previous two times are ok.  But this time, i get frequently the fllowing ERROR:
Failure encountered during REPLAY FRAGMENTS step of recovery
of range server rs4
player rs2: HYPERTABLE request timeout - Problem connecting to rs8
player rs5: HYPERTABLE request timeout - Problem connecting to rs8
player rs6: HYPERTABLE request timeout - Problem connecting to rs8
player rs7: HYPERTABLE request timeout - Problem connecting to rs8

rs8 is the machine that was interrupted deliberately when the second test, and then was brought back to cluster after recovery succeed.
From the many times to test failover, i find a fact that if other RangeServers attempted to connect the machine brought back. the problem of 'request timeout' would occur easily.

 


Doug Judd

unread,
Jun 9, 2013, 2:11:01 AM6/9/13
to hypertable-user
Hi David,

Thanks for reporting back and providing details.  I think there is enough information to help isolate the problem.  It would be good if you could provide some of the log files, however.  Ideally all of the log files from the time when the server was first interrupted to when the errors started appearing.  If it's too much effort, then at least capture the Hypertable.Master.log file and one of the RangeServer logs that is experiencing the error.  You can either post them here, or directly to issue 1088.  Thanks!

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

David

unread,
Jun 10, 2013, 7:23:47 AM6/10/13
to hyperta...@googlegroups.com, do...@hypertable.com
Hi,doug,
When the problem appeared ealier, i ever sent you the logs on May 24. I re-sent the previous email to you just now, please inpect your mail-box.
If you thought that the logs make no sense, i would send the new logs to you.

Doug Judd

unread,
Jun 10, 2013, 12:31:17 PM6/10/13
to hypertable-user
Ok, we can use those logs that you sent me earlier.  One other thing, please describe your test in as much detail as possible.  I'd like to try to recreate the problem on our development cluster.  Things I'd like to know are:  What does the workload look like?  Are you reading and writing during the test?  Does the load stop and then you kill the RangeServer?  Or do you continue to send load to the cluster while you kill the RangeServer?

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

David

unread,
Jun 11, 2013, 3:22:30 AM6/11/13
to hyperta...@googlegroups.com, do...@hypertable.com
There is writing and rarely reading on the cluster. The writing always continues during the test, but i find that little data is written into the database during the test, so after recovery succeed, i must restart the writing application to assure the normal inserting.
The cluster workload should not be heavy, each RangeServer don't beyond 10G that is value of "CommitLog.PruneThreshold.Max".
There are seven RangeSevers in the cluster. The first time, i kill rs2, the failover recovery will take about ten minutes and succeed finally, then i will bright it back, its proxy name also changes to rs7. The second time, i kill rs3, its failover recovery is similar to rs2, of course, after it is brought back, the proxy name is rs8.
The third time, i kill rs4. its failover recovery seems very difficult,  report frequently "HYPERTABLE request timeout - Problem connecting to rs8".
I test for several time, the problem above always appears at the third time.

Doug Judd

unread,
Jun 11, 2013, 1:53:08 PM6/11/13
to hypertable-user
Hi David,

It looks like this may be a problem with how the proxy-to-address map is maintained in the Comm layer.  I've implemented a fix (and added some additional asserts) which you can find here:

SHA1:  56fec716b420b27a112b2cb78fcec95fe3146c7d

SHA1:  251bb64e1bdc9f2674282b22c0963ab4d056ed7c

SHA1:  0afc7cc84059eb8273a5c6c2f3c0139b9fee89f2

Can you give this version a try and let us know if it resolves the problem?  Thanks.

- Doug


--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

David

unread,
Jun 12, 2013, 7:11:55 AM6/12/13
to hyperta...@googlegroups.com, do...@hypertable.com
ok, i will try it out and feedback you.

David

unread,
Jun 14, 2013, 5:03:33 AM6/14/13
to hyperta...@googlegroups.com, do...@hypertable.com
I tried the 0.9.7.7.990f07c version out, but the problem is still exist.
I upgraded from 0.9.7.7 to 0.9.7.7.990f07c yesterday. I killed rs2 to start the first failover test yesterday afternoon, it's ok, then, i brought rs2 back, its proxy name changed to rs7. I killed rs3 to start the second failover test this morning, it's also ok, then i brought rs3 back, its proxy name changed to rs8.
I killed rs4 to start the third failover test this afternoon, the similar problem appeared as previous test, the following ERROR frequently was reported in the mail:
  player rs1: HYPERTABLE request timeout - Problem connecting to rs8 
  player rs5: HYPERTABLE request timeout - Problem connecting to rs8 
  player rs6: HYPERTABLE request timeout - Problem connecting to rs8
I had no other way, only to retart the cluster, after restarted, the third test succeeded finally.
All logs has mailed you.

Doug Judd

unread,
Jun 15, 2013, 12:00:00 PM6/15/13
to hypertable-user
Hi David,

Thanks for reporting back.  I'm going to be on vacation this upcoming week, but I'll look into it when I get back.

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

David

unread,
Jun 17, 2013, 8:18:43 AM6/17/13
to hyperta...@googlegroups.com, do...@hypertable.com
ok, have a good vacation.

Doug Judd

unread,
Jun 26, 2013, 2:58:25 AM6/26/13
to hypertable-user
Hi David,

I was able to reproduce this problem and have come up with a fix for it.  Try out the patched version 0.9.7.7.eb6ce84.  With this version, you shouldn't have to restart any of your loading clients during failover and you should be able to re-add any failed RangeServer machines without any problems.  Please run your test again and report back.  As soon as I hear word from you, I'll cut the 0.9.7.8 release.  Thanks again for all of your help!

- Doug



On Mon, Jun 17, 2013 at 5:18 AM, David <cwh...@gmail.com> wrote:
ok, have a good vacation.

--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

David

unread,
Jun 26, 2013, 9:03:15 PM6/26/13
to hyperta...@googlegroups.com, do...@hypertable.com
ok, i will try it  out at once.

David

unread,
Jun 29, 2013, 11:28:53 PM6/29/13
to hyperta...@googlegroups.com, do...@hypertable.com
I have tried it out. 
As you said, all are ok except one fact that the writing application using Natvic API have to restart. After failover recovery succeeds, the writing application report the INFO as follow:
1372560955 INFO Pre_CDR_Pro : (/root/src/hypertable/src/cc/AsyncComm/ConnectionManager.cc:218) Connection attempt to RangeServer at rs8 failed - COMM invalid proxy.  Will retry again in 3000 milliseconds...

'Pre_CDR_Pro' is the name of my wrting application, rs8 is the machine to kill for failover test. i must restart the writing application, then the INFO can only disappear. 

Doug Judd

unread,
Jun 30, 2013, 10:14:31 PM6/30/13
to hypertable-user
Hi David,


Awesome, thanks for running your test and reporting back.  Are you sure you re-built Pre_CDR_Pro with the new version?  When I look at line 218 of ConnectionManager.cc in version 0.9.7.7.eb6ce84, I see no logging statement.  Can you double-check that you built Pre_CDR_Pro against 0.9.7.7.eb6ce84?  If not, try the test again and let us know if the problem is fully resolved.  Thanks.

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

David

unread,
Jul 2, 2013, 8:57:48 PM7/2/13
to hyperta...@googlegroups.com, do...@hypertable.com
Doug, you are right.
The writing application built indeed on the previous version. When i use the 0.9.7.7.eb6ce84 version to build it, every thing is ok.
Congratulation, the failover function of the  0.9.7.7.eb6ce84 version is fairly robust.

Doug Judd

unread,
Jul 4, 2013, 12:51:59 AM7/4/13
to hypertable-user
Thanks David, I appreciate your reporting these problems and helping us to verify the fixes.  I give you a lot of credit for helping us to stabilize Hypertable.  If you encounter any more stability problems, please let us know.  Thanks!

- Doug



--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Reply all
Reply to author
Forward
0 new messages