sFlow for queue length monitoring

655 views
Skip to first unread message

Sonia Panchen

unread,
Jul 4, 2011, 5:27:17 AM7/4/11
to sFlow
An important aspect of managing network performance is understanding
congestion in a forwarding device. Monitoring queue depth is a
practical way to understand whether buffers or queue resources are
over or under utilised and inferring queuing delay.

Here are some suggestions for using sFlow to monitor queue length. Any
comments?

Exporting the queue depth experienced by a sampled packet, as an
extended flow structure, is ideal and enables scalable analysis of
queuing delay experienced by classes of traffic.

In addition, in order to monitor device congestion and resource usage,
a device can maintain a number of counters for each queue. Each
counter represents a range of queue lengths. When a packet is enqueued
the counter corresponding to the queue length experienced by the
packet is incremented. These counters are then exported as an sFlow
extended counter sample structure.

/* Extended queue length data
Used to indicate the queue length experienced by the sampled packet.
If the extended_queue_length record is exported, queue_length counter
records must also be exported with the if_counter record.*/

/* opaque = flow_data; enterprise = 0; format = 1019 */

struct extended_queue_length
{
unsigned int queueIndex; /* persistent index within port of queue
used
to enqueue sampled packet.
The ifIndex of the port can be inferred
from
the data source. */
unsigned int queueLength; /* length of queue, in segments,
experienced by
the packet (ie queue length immediately
before the sampled packet is enqueued).
*/
}

/* Queue length counters
Histogram of queue lengths experienced by packets when they are
enqueued (ie queue length immediately before packet is enqueued)
thus giving the queue lengths experienced by each packet.
Queue length is measured in segments occupied by the enqueued
packets.
Queue length counter records for each of the queues for a
port must be exported with the generic interface counters
record, if_counters, for the port.*/

/* Queue length histogram counters
opaque = counter_data; enterprise = 0; format = 1003 */

struct queue_length {
unsigned int queueIndex; /* persistent index of queue within port */
unsigned int segmentSize; /* size of queue segment in bytes */
unsigned int queueSegments; /* total number of segments allocated
(ie available) to this queue. */
unsigned int queueLength0; /* queue is empty when a packet is
enqueued. */
unsigned int queueLength1; /* queue length == 1 segment when a
packet is enqueued. */
unsigned int queueLength2; /* queue length == 2 segments when a
packet is enqueued. */
unsigned int queueLength4; /* 2 segments > queue length <= 4
segments
when a packet is enqueued. */
unsigned int queueLength8; /* 4 segments > queue length <= 8
segments
when packet is enqueued. */
unsigned int queueLength32; /* 8 segments > queue length <= 32
segments
when packet is enqueued. */
unsigned int queueLength128; /* 32 segments > queue length <= 128
segments
when packet is enqueued. */
unsigned int queueLength1024; /* 128 segments > queue length <= 1024
segments
when packet is enqueued. */
unsigned int queueLengthMore; /* queue length > 1024 segments when
packet
is enqueued. */
unsigned int dropped; /* count of packets intended for this queue
that are
dropped on enqueuing. */
}

rick jones

unread,
Jul 4, 2011, 12:39:27 PM7/4/11
to sf...@googlegroups.com

On Jul 4, 2011, at 2:27 AM, Sonia Panchen wrote:

> An important aspect of managing network performance is understanding
> congestion in a forwarding device. Monitoring queue depth is a
> practical way to understand whether buffers or queue resources are
> over or under utilised and inferring queuing delay.

I like using an outbound queue length, and during parts of my "day job" suggest just that. Alas for some reason the SNMP folks decided to deprecate outQueueLen from the interface MIB - something about "by the time the NMS sees it it is already old" - of course, that holds true for just about anything queried via SNMP, and presumably anything defined as a GAUGE.

> Here are some suggestions for using sFlow to monitor queue length. Any
> comments?
>
> Exporting the queue depth experienced by a sampled packet, as an
> extended flow structure, is ideal and enables scalable analysis of
> queuing delay experienced by classes of traffic.

I suppose that depends on just where the packet is sampled. If it is sampled on "inbound" it will not have yet experienced any queuing the sampler can discern. If it is sampled on "outbound" past the queue (say on transmit completion) the information is already toast unless the packet descriptor was tagged with a length upon enqueuing. Only if it is sampled at the time of queuing for outbound can the frame be tagged with a queue depth without having to "remember" it.

Do the sFlow specs mandate or even suggest "where" the sample point should be?

> /* Queue length counters
> Histogram of queue lengths experienced by packets when they are
> enqueued (ie queue length immediately before packet is enqueued)
> thus giving the queue lengths experienced by each packet.
> Queue length is measured in segments occupied by the enqueued
> packets.
> Queue length counter records for each of the queues for a
> port must be exported with the generic interface counters
> record, if_counters, for the port.*/
>
> /* Queue length histogram counters
> opaque = counter_data; enterprise = 0; format = 1003 */
>

If one is including a queue length with counters, is there really a need to transmit an entire histogram? If there is indeed some randomization of the counter sampling interval (perhaps even if not), then presumably the collector can keep a histogram of the individual queue length values it has seen. Further, ostensibly the dropped stat is redundant with ifOutDiscards already present in the generic counters no?

While all the world is not IP (mores the pity?-) if an IP datagram encounters congestion isn't an ECN bit supposed to be set these days? There is still something of a race condition between the setting of ECN and the sample point, but it doesn't require any further sFlow enhancements - just for the collector to check the sampled headers for ECN bits.

rick jones

Peter Phaal

unread,
Jul 4, 2011, 10:45:18 PM7/4/11
to sf...@googlegroups.com
> I suppose that depends on just where the packet is sampled.  If it is sampled on "inbound" it will not have yet
> experienced any queuing the sampler can discern.  If it is sampled on "outbound" past the queue (say on transmit
> completion) the information is already toast unless the packet descriptor was tagged with a length upon enqueuing.
> Only if it is sampled at the time of queuing for outbound can the frame be tagged with a queue depth without having to
> "remember" it.

The queue length needs to be captured at the point the packet is
enqueued, but it is possible that the packet could have been marked
for sampling on ingress. The only way that you can accurately
associate the queue length with the sampled packet is if the hardware
provides explicit support for feature, but there are many possible
implementations.

>
> Do the sFlow specs mandate or even suggest "where" the sample point should be?
>

sFlow permits ingress, egress or bidirectional sampling. Switch
designers are free to choose the location for sampling that best fits
their forwarding architecture.

> If one is including a queue length with counters, is there really a need to transmit an entire histogram?  If there is
> indeed some randomization of the counter sampling interval (perhaps even if not), then presumably the collector can
> keep a histogram of the individual queue length values it has seen.

The set of counters needs to be maintained on the switch since longer
queue sized should be rare and you don't want to miss them. Exporting
the full set of counters in the histogram allows the sFlow analyzer to
see if any packets experienced queueing delays irrespective of the
polling intervals and packet rates.

>Further, ostensibly the dropped stat is redundant with ifOutDiscards already present in the generic counters no?

Not necessarily, there may be 4 or 8 queues per interface and you want
to know discards by queue. The sum of all the discards across all the
queues on the interface should add up to ifOutDiscards.

Kenneth Duda

unread,
Jul 15, 2011, 12:19:51 PM7/15/11
to sFlow
There are two mechanisms proposed:
(a) attach queue length snapshots to sampled packets;
(b) build a histogram of queue length at time of enqueue for all
packets.

This does not add up for me.

The problem with (a) is that it is not very useful. Congestion can be
transient, particularly in the high-frequency trading market segment,
which is a segment that values visibility into queuing delay very
highly. In that market segment, congestion events that last 10's of
microseconds are interesting. You would need a sample rate of 100,000
frames per second to get a 50% chance of seeing a 10-usec congestion
event. This is totally impractical. Even in a more traditional data
center or even enterprise context, you would need a lot of data to
assess how close you are to losing packets during bursts, because
switch buffers are small and the probability of seeing congestion due
to a burst is small. This is the reason to drop outQueueLength from
SNMP. It is useless to sample the output queue length at SNMP
timeframes (100's of milliseconds) because the probability of seeing
congestion is basically zero in almost any data center context.

(b) seems very useful, but the problem with (b) is that no ASIC that I
know of supports this. (This is actually a problem with (a) also, but
if you accept that (a) is not very useful, it doesn't really matter
whether ASICs support it.)

What ASICs I'm aware of actually support is:
(1) thresholds: the ability to notifying software when queue lengths
exceed thresholds
(2) tracking worst case queue length: recording maximum queue length
over software-compatible timeframes (milliseconds)

So what I would recommend here is:

(1) adding here is a configurable threshold and a counter of over-
threshold events; and,

(2) adding to the "queue length histogram counters" the notion that
the counter might not count every packet, but instead count the worst
case experienced by any packet over a fixed time window (sampling
period). For example, if the sampling period is one millisecond, then
the histogram counters would increment one counter each millisecond
based on the maximum length of the queue over that millisecond.
Hardware maintains the maximum queue length; software polls and
maintains the histogram.

Thanks for listening,

-Ken

Kenneth Duda
VP Software Engineering
Arista Networks, Inc.
kd...@aristanetworks.com

Peter Phaal

unread,
Jul 15, 2011, 3:48:02 PM7/15/11
to sf...@googlegroups.com
Ken,

Thanks for the comments. Answers in-line.

On Fri, Jul 15, 2011 at 9:19 AM, Kenneth Duda <kd...@aristanetworks.com> wrote:
> There are two mechanisms proposed:
>   (a) attach queue length snapshots to sampled packets;
>   (b) build a histogram of queue length at time of enqueue for all
> packets.
>
> This does not add up for me.
>
> The problem with (a) is that it is not very useful.  Congestion can be
> transient, particularly in the high-frequency trading market segment,
> which is a segment that values visibility into queuing delay very
> highly.  In that market segment, congestion events that last 10's of
> microseconds are interesting.  You would need a sample rate of 100,000
> frames per second to get a 50% chance of seeing a 10-usec congestion
> event.  This is totally impractical.  Even in a more traditional data
> center or even enterprise context, you would need a lot of data to
> assess how close you are to losing packets during bursts, because
> switch buffers are small and the probability of seeing congestion due
> to a burst is small.  This is the reason to drop outQueueLength from
> SNMP.  It is useless to sample the output queue length at SNMP
> timeframes (100's of milliseconds) because the probability of seeing
> congestion is basically zero in almost any data center context.

Sonia's original message might not have been entirely clear, but the extended_queue length structure is not exported as a periodic measurement, perhaps you misunderstood? This structure is attached to an sFlow packet sample record, describing a randomly sampled packet (sampled at 1 in N - not based on time) and the queue depth experienced by that packet.

I agree that it is useless to poll a queue depth counter (as is done with SNMP's outQueueLength). However, with the appropriate hardware support, capturing the egress queue length associated with randomly selected packet samples will correctly estimate the queue length distribution (as seen by arriving packets). What packet sampling does is take time out of the equation.

When thinking about packet sampling accuracy in this context, it's better not to think about microbursts and time. The key to modeling queueing systems is to look at the state of the queue at the instant of an arrival. So it's better to think about the problem this way: what is the probability that a packet will see a queue depth of greater than X? If you have 100 samples and none of them have seen a queue depth greater than X, then the probability that a packet will see a queue depth of greater than X is less than 1%.

Queues tend to be empty most of the time so polling tends to miss the congestion events as you pointed out. However, if you only make measurements when packets arrive, then you skip over the idle periods and measure the queue as seen by an arriving packet - which is a much more useful measure. Linking the the queue depth to the sampled packet header lets you see which applications see deeper queues. For example, an application generating bursts of traffic is likely to see deeper queues than an application that sends periodic messages.

The packet sampling measurement takes a while to accurately calculate the queue depth distribution, but it will accurately converge over time, even at the sampling rates typically used on 10G networks (1 in 10,000). The accuracy of the estimated queue size distribution is a function of the number of samples , so even at these low sampling rates you will still be getting enough samples to produce useful results. Packet sampling is meant to complement the counters maintained in (b). If you detect that there are queuing problems on an interface using the counters, you can look at the sampled data to see what class of traffic is causing the congestion and what other applications are being affected.

In fact, if an interface is seeing congestion, it is likely that there will be a lot of packets, generating many samples, so the packet samples from that interface will quickly lead you to the root cause of the problem.

>
> (b) seems very useful, but the problem with (b) is that no ASIC that I
> know of supports this.  (This is actually a problem with (a) also, but
> if you accept that (a) is not very useful, it doesn't really matter
> whether ASICs support it.)
>
> What ASICs I'm aware of actually support is:
>  (1) thresholds: the ability to notifying software when queue lengths
> exceed thresholds
>  (2) tracking worst case queue length: recording maximum queue length
> over software-compatible timeframes (milliseconds)

I think that (1) is all you need to accurately calculate the queue depth distribution.  If you set the threshold to zero (i.e. generate a notification whenever a packet is sent to a non-empty queue) then you can calculate the queue length distribution (provided that the notification tells you the actual depth of the queue - which I believe some ASICs will). In a well provisioned system, most packets don't see any queueing delays, even if you are running at high link utilization. The primary reasons for a queue to build up is a change in link speed, or traffic streams converging on a port. The fraction of packets that will trigger these notifications is going to be very small. If a queue is building up it is likel only one port on the switch. The management software on the switch can take the queue events and maintain the histogram counters.

This scheme won't maintain the queueLength0 value, however, there is typically a total packets counter that can be used to calculate queueLength0:
queueLength0 = total_packets - sum(queuelengthN)

I don't think (2) is terribly useful - I guess you could use it to calculate a kind of load average by resetting the high water mark every N seconds and then running the results through an exponential smoothing function.


>
> So what I would recommend here is:
>
>  (1) adding here is a configurable threshold and a counter of over-
> threshold events; and,

The problem with this measurement is that it is hard to see how you would compare different interfaces with different threshold settings. It's a non-linear metrics that can't easily be rolled up, or aggregated, and it's not entirely clear how you would use it to model performance. As I mentioned above, I believe that this hardware capability could be used to calculate the full distribution, which would be a much more interesting measurement. It also has the advantage that it doesn't require any configuration and produces data that can easily be compared.


>
>  (2) adding to the "queue length histogram counters" the notion that
> the counter might not count every packet, but instead count the worst
> case experienced by any packet over a fixed time window (sampling
> period).  For example, if the sampling period is one millisecond, then
> the histogram counters would increment one counter each millisecond
> based on the maximum length of the queue over that millisecond.
> Hardware maintains the maximum queue length; software polls and
> maintains the histogram.

I don't believe that this yields a reasonable approximation to the correct histogram. It's worth thinking about what the best approximation might be. If you kept track of the number of packets and the high water mark in each interval, you could ignore intervals in which no packets were sent - this would improve the accuracy. You need to weight the contribution to the histogram using the packet count in the interval. For example, suppose 100 packets arrived in the millisecond and the maximum queue depth was 50. How do you apportion those 100 packets to the different bins? I don't see a good way to make the update without having to make assumptions about the traffic.

Part of the goal defining these measurements is to provide guidance to hardware vendors. For example, when the sFlow standard was first published 10 years ago, many ASIC vendors didn't have support for hardware packet sampling. One of the strengths of the sFlow standard has been it's focus on defining mathematically rigorous measurements that accurately characterize performance, and ensuring that all implementations of sFlow faithfully implement the standard, ensuring maximum interoperability. Today, virtually all ASICs provide hardware packet sampling providing accurate and consistent multi-vendor interoperability. So, in looking at the queueing metrics, I don't think we should be overly concerned about what current hardware can do. It is worth considering how difficult it would be for an ASIC designer to add the measurement and making sure that the measurements are easily implementable, but the focus should be on ensuring that the right measurements are made. Today, adding the queue depth to packet sampling would only be possible on software platforms (such as the Open vSwitch), but I think there are at least some hardware platforms that could implement the histogram counters today. Both measurements seem like they would be relatively simple to implement in future hardware, and demand for the measurements will be increased if they are well defined and easy to interpret.

Peter
Reply all
Reply to author
Forward
0 new messages