I'll answer the specific questions below. If you have a general recommendation on how docs could be improved, that would be great feedback.
- What properties should I set, and to what values, if I wanted circuits to open only if 100% of the requests fail over any 60s window?
circuitBreaker.errorThresholdPercentage controls the threshold at which an individual circuit trips. Setting this value to 100 would ensure that only a window with all errors would turn a circuit from CLOSED -> OPEN.
metrics.rollingStats.timeInMilliseconds controls the overall window of data that Hystrix examines when deciding to evaluate a circuit's health. In your case, you would want 60000.
Please note that circuitBreaker.requestVolumeThreshold is also relevant. If your circuit has not seen this volume of requests in the time window, the circuit will NOT open, regardless of the observed error percentage.
- Do the rollingStats/rollingPercentile timeInMilliseconds and numBuckets have any impact on how circuit health is calculated and when a circuit is tripped? Or are they just used for recording metrics and calculating the percentages shown on the dashboard?
- Are there any guidelines/best practices around choosing values for these properties based on anticipated throughput or other criteria?
The considerations when choosing these values are trading off computation versus sensitivity and then tuning how sensitive you want to be. Given a 10 second window of stats, consider 2 cases: 10 buckets of 1000ms each, and 2 buckets of 5000ms. The former case triggers data structure updates 5 times as often, but sees old data for longer. If problems develop quickly, having that old data is a detriment to the ability to react quickly. So that's the tradeoff on bucket sizing. For size of the overall window, you have to consider how much data you want to look at. A 60s window implies more memory usage, computation, and allocations than a 10s window. It also more heavily weights older data, so reacts more slowly but is not as easily influenced by current data. That could be good or bad, depending on your application
- What's the purpose of the healthSnapshot.intervalInMilliseconds property? Is that the windows of time over which the errorThresholdPercentage is checked to determine circuit status, or is just a throttle of some kind? If it's the latter what specifically is it protecting against?
Yes, a throttle, protecting every command invocation from recalculating the entire circuit health.