Hi doug,
I encounter the same problem again and find some new hints.
When i upgraded the cluster from 0.9.7.8 to 0.9.7.10 and restarted it yesterday, the same problem appeared on the most RangeServers. i.e. executing 'cap start' succeeded, but several minutes later, the load of many RangeServers is very high through 'top' command, even reach 80+, then the RangeServers terminate and report the same exception in the log.
I increased the HDFS read timeout(default value is 60s) and decreased the HdfsBroker.Workers and Hypertable.RangeServer.Workers, then the cluster restarted successfully this time, and the exception above also disappeared in the log. By the way, in whole process, the write application don't stop, so i am not sure whether it would influence the RangeServer startup?
Some hours later, i found the writing speed is too slow to tolerate, up to today, the speed is still too slow. So, i just stopped all writing applications, then recovered the HdfsBroker.Workers and Hypertable.RangeServer.Workers, and restarted the cluster again. Although this time to restart is very successful, but the writing speed is still slow, although the current speed is a little faster than yesterday's, but it is just half of the previous speed. The writing application bases on the 0.9.7.8 version, i don't know whether it would slow the writing speed?
In a word, i find that the 'All datanodes *.*.*.*:50010 are bad. Aborting' problem is relative with the higher load. But for the both doubts described above, i hope get your guide.
Thanks a lot.