ALTQ/CBQ Performance

Test Date: 97/08/15

Test System Configuration

The following measurements were done with three PentiumPro machines (all 200MHz with 440FX chipset) running FreeBSD-2.2.2/altq-0.3.2.

Host A is a source, host B is a router, and host C is a sink. CBQ is enabled only on the interface of host B connected to host C. The link between host A and host B is 155Mbps ATM. The link between host B and Host C is either 155M ATM, 10baseT, or 100based. When 10baseT is used, a dumn hub is inserted. When 100baseT is used, a direct connection is made by a cross cable, and the interfaces are set to the full-duplex mode. Efficient Network Inc. ENI-155p cards are used for ATM, Intel EtherExpress Pro/100B cards are used for 10baseT and 100baseT.

Most of the tests use the TCP test mode of Netperf benchmark program.

Figure 1

Throughput Overhead

Table 1 shows throughputs with CBQ on/off by TCP over different link type.

CBQ has three classes, and there is no other background traffic.

No overhead incured by CBQ can be observed, because the CBQ packet processing can overlap the sending time of the previous packet.

Table 1. Throughput over Different Data Link
device	cbq off (Mbps)	cbq on (Mbps)
ATM	133.20	133.31
10baseT	6.39	6.46
100baseT	93.04	92.89
local loop	326.58	302.30

Latency Overhead

Table 2 and 3 show the CBQ overhead in latency for ATM and 10baseT. In this test, request/reply style transactions are performed by UDP, and the test measures how many transactions can be performed per second. The rightmost column shows the average roundtrip time. From the table, the overhead per packet is about 10 micro seconds.

Table 2. Latency over ATM
CBQ	request (bytes)	reply (bytes)	trans. per sec	RTT (usec)
off	1	1	2821.89	354
on	1	1	2744.03	364
off	64	64	2301.06	435
on	64	64	2243.14	446
off	1024	64	1476.31	677
on	1024	64	1454.39	688
off	8192	64	349.59	2534
on	8192	64	392.76	2546

Table 3. Latency over 10baseT
CBQ	request (bytes)	reply (bytes)	trans. per sec	RTT (usec)
off	1	1	2277.37	439
on	1	1	2234.13	448
off	64	64	1800.75	555
on	64	64	1768.02	566
off	1024	64	681.05	1468
on	1024	64	676.45	1478
off	8192	64	116.64	8573
on	8192	64	116.67	8571

Bandwidth Allocation

Figure 2 shows the bandwidth allocation performance. TCP throughputs are measured when a class is allocated from 5% to 95% of the link bandwidth.

The plot of the 10baseT case is scaled by 10 to be put in the same graph.

The plots with "-FIFO" show the throughputs of the original queueing (CBQ disabled).

Figure 2

As can be seen from the graph, the allocated bandwidth changes linearly for ATM and 10baseT, but not so well for 100baseT. The problem of the 100baseT case is caused by the timer granularity. Most Unix systems use 10 msec timer as default, and CBQ uses 20 msec as the minimum timer since 1 tick can be arbitrarily short. In CBQ, a class can send at most "maxburst" back-to-back packets. If a class sends "maxburst" back-to-back packets at the beginning of a 20 msec cycle, the class gets suspended and would not be resumed until the next timer event, unless other event trigger occurs. If this situation continues, the transfer rate becomes

rate = packetsize * maxburst * 8 / 0.02

Now, assume maxburst is 16 (default) and the packet size is MTU. For Ethernet whose MTU is 1500 bytes, the calculated rate is 9.6Mbps. For ATM whose MTU is 9180 bytes, the calculated rate is 58.8Mbps. This makes it difficult to handle 100Mbps 100baseT whose MTU is small (1500) compared with its bandwidth.
To back up this theory, I tested the performance of the kernel whose timer granularity is modified to 1kHz, plotted as "100baseT-1kHz". With this kernel, the calculated rate becomes 96Mbps.
Depending solely on the kernel timer, however, is the worst case. In more realistic settings, there are other flows that can trigger CBQ to calibrate sending rates.

Also, TCP's ACK can be a good trigger since TCP receives an ACK every two packets when in the steady state. This is the reason the ATM case scales beyond 58.8Mbps.

Bandwidth Guarantee

Figure 3 shows the bandwidth guarantee performance. Four classes, allocated 10Mbps, 20Mbps, 30Mbps and 40Mbps, are defined. A TCP flow matching the default class is sent during the test period. Four flows each corresponding to the defined classes start with 5 second delay.

To avoid oscillation by process scheduling, class-0 and class-2 are sent from host B and the other three classes are sent from host A.

The cbqprobe tool included in the altq release is used to get the CBQ statistics every 400 msec, and the cbqmonitor tool also included in the release is used to make the graph.

As can be seen from the graph, each class gets its share and there's no effect from other traffic. Also note that the background flow gets the remaining bandwidth.

Figure 3

Figure 4 shows the trace that running the same scenario with CBQ disabled.

Figure 4

Link Sharing by Borrowing

Link Sharing Configuration

The setting is similar to the two agency setting used by Sally Floyd in her papers. The class hierarchy is defined as in Figure 4.

Figure 5
Four TCP flows are generated as in Figure 5. Agency X is emulated by host B and agency Y is emulated by host A. Each TCP tries to send at its maximum rate and has some idle period.

Figure 6

Traffic Trace

Figure 7 is generated by the same way described for Figure 3.

Both agencies get their share most of the time, but high priority class-4 gets more than its share in some situations.

Figure 7

Figure 8 shows the trace where all the classes are set to the same priority. Now, the problem of class-4 is improved. The combination of priority and borrowing seems to need some refinement.

Figure 8

Figure 9 shows the trace that running the same scenario with CBQ disabled.

Figure 9