Next: Chapter 24 Up: Notes on ``TCP/IP Illustrated'' Previous: Chapter 20

Chapter 21

p. 300

The calculation of RTO is re-described in RFC 2988. This states that RTO should always be at least one second (older implementations with a 500ms timer essentially did this anyway, but it now a requirement).

p. 301

RFC 2988 makes it a requirement to use Karn's algorithms for deciding which samples to take. The only exception is when TCP timestamps (see section 24.5) are used, when there is no ambiguity.

Sections 21.6-8

These became RFC 2001, which was later re-written into RFC 2581, which can be viewed as an upgrade of RFC 1122 (Host Requirements).

p. 310

Congestion avoidance assumes that all packet losses are caused by congestion. If this is not the case, then, as pointed out in RFC 3155,

TCP connections experiencing high error rates on their paths interact badly with Slow Start and with Congestion Avoidance, because high error rates make the interpretation of losses ambiguous - the sender cannot know whether detected losses are due to congestion or to data corruption. TCP makes the "safe" choice and assumes that the losses are due to congestion.

Whenever sending TCPs receive three out-of-order acknowledgement, they assume the network is mildly congested and invoke fast retransmit/fast recovery (described below).
Whenever TCP's retransmission timer expires, the sender assumes that the network is congested and invokes slow start.
Less-reliable link layers often use small link MTUs. This slows the rate of increase in the sender's window size during slow start, because the sender's window is increased in units of segments. Small link MTUs alone don't improve reliability. Path MTU discovery [RFC1191] must also be used to prevent fragmentation. Path MTU discovery allows the most rapid opening of the sender's window size during slow start, but a number of round trips may still be required to open the window completely.

p. 310

RFC 2414 suggests experimental initial values of cwnd between two and four segments (but never more than

bytes). This is not yet standard, but RFC 2581 permits two segments, rather than the one in the book/RFC 2001. [8] shows that, as of May 2001, over 90% of Web servers sampled used 2MSS as the initial value.

TCP Vegas (see [3]) makes a slight change by only deciding that congestion has occurred (point 3) if the segment referred to was transmitted after the last time congestion was detected. Otherwise it is possible for one over-estimate of the sending rate to generate multiple decreases.

p. 311

At the top of the page, it is stated that cwnd be updated by

segments (i.e. segsize/cwnd bytes) every time an ACK is received. RFC 2581 clarifies this to say that the updating happens for every non-duplicate ACK received. This formula may appear mysterious, but the aim is to increase cwnd by one segment per round-trip time, and we estimate that there are cwnd/segsize packets in transit.

p. 312

One theoretical problem with the algorithm here is that is assumes that at least the lost packet, plus three to generate duplicate ACKs, can be in transmit simultaneously. Because of ``slow start'', this will often not be the case at the start of a session, or for a small session such as HTTP often generates. A further practical problem with the standard BSD (including BSD Reno) algorithm is that its definition of when to do a standard re-transmit is based on the RTO estimate calculated from the 500ms timer. [3] notes that it took 1100 ms to spot a time-out, on average, whereas the figure calculated from a perfect clock would be less than 300ms (but note that RFC 2988 mandates a minimum of 1 second for the RTO). They therefore propose a variant, nick-named Vegas, which does the following.

Uses the system clock (generally microsecond resolution these days) rather than the 500ms timer for its own RTT and RTO calculations.
Regards a single duplicate ACK arriving after its RTO calculations as an indication to re-transmit.
If the first or second ACK after a re-transmission is not duplicate, Vegas checks for any unacknowledged segments which are overdue by its RTO calculations, and re-transmits them.

The ``fast recovery'' algorithm mentioned in the book, and the variant mentioned above, is fine if only one packet is ever lost at a time (meaning during an RTT). However, we note that a full RTT (plus the arrival times of the packets to generate the duplicate ACKs) is required to get an ACK back acknowledging data beyond the lost segment. If there is more than one lost segment per RTT, the process has to start again immediately (the receiver knows that there is another gap, but can't signal it until all the data up to the gap has been received, since an ACK acknowledges all data up to that point). In practice, what happens is that the original re-transmit algorithm kicks in, and we re-transmit a lot of unnecessary data. RFC 2582 proposes what to do in these circumstances (see steps 1A and 6 in its procedures). RFC 2582 implementations are, according to [8], the commonest kind of webserver, with 40%, or more if one includes Windows implementations with a bug in the handling of small pages.

``Selective acknowledgements'', originally proposed in RFC 1072, and revised to a proposed standard in RFC 2018⁵⁰, solves this problem by adding a new TCP option, which says that ``in addition to all the bytes up to the ACK value, I also acknowledge that I have the following blocks of bytes''. Once this has been received, the sender know precisely which blocks to re-transmit. Since this is a TCP option, it has to fit in the space allowed by the TCP header length, whose maximum value is 15 (= 60 bytes). This leaves room in principle for selective acknowledgement of four blocks. However, if we are using this option, it is because we have a large fat network, so we probably also have the timestamp option (pp. 349-351) enabled, which cuts down the available space to leave room for three blocks. A block may be more than one segment, of course, and it is likely on a large fat network that losses are bursty: a router has a momentary overload and drops several packets in quick succession. According to [8], 40% of webservers claim to support selective acknowledgements, but in fact only 40% of these actually make use of the information. RFC 2757 recommends selective acknowledgements in the wireless setting.

RFC 2883 extends the proposal for selective acknowledgements in an upwards-compatible way to allow the receiver to specify that duplicate segments have been received. This may help performance in the presence of re-ordering (which otherwise is hard to distinguish from duplication). This extension works by transmitting block numbers before the main TCP ACK field, whereas SACK was originally used to report holes in received data transmits block numbers after the main TCP ACK field.

p. 312

RFC 2581 slightly revises step 1. Rather than cwnd/2, it mandates half the number of unacknowledged bytes (which may be different from cwnd, especially if two fast-retransmits happen in succession).

p. 312

A further problem with ``fast retransmit'' is that there have to be three duplicate ACKs, which means that three more segments have to be there to cause these duplicate ACKs. This means that the data must exist, and the congestion window must be large enough to permit them being sent. In the case of a web server, with lots of small transactions, this may well not be the case, and indeed [1] observes that 56% of the retransmissions were caused by a time-out, and only 44% by ``fast retransmit''. RFC 3042 proposed some modifications to deal with this problem, notably by responding to the first two duplicate ACKs by sending new data even if this breaches the cwin rules. This would have converted 25% of the RTO-based retransmissions into ``fast retransmit'' retransmissions.

Emil Naepflein (Emil.Naepflein@philosys.de) also observes the following.

In the early days the windows were often so small that only about 4 packets were in transit most of the time. If you go much higher [than three duplicate ACKs before invoking fast retransmit] with the fast-retransmit trigger you will probably never get any benefit from it. I very often noticed that even 3 duplicate ACKs was too high in case of transmitting packets over a lossy link with high delay and large MTUs.

The fast-retransmit can only achieve good performance if there are a lot of packets in transit so that you get enough duplicate ACKs before you are running out of the congestion window. Otherwise you will get some stop-and-go behaviour with bad performance.

p. 314

RFC 2525 shows that the extra term (256/8 in this case) is bad, and RFC 2581 forbids it.

p. 316

Storing per-route metrics is very valuable for some higher-level protocols such as FTP (where multiple files can be fetched, each with a different TCP connection -- see pp. 419-439) and HTTP, especially to a Web cache.

Next: Chapter 24 Up: Notes on ``TCP/IP Illustrated'' Previous: Chapter 20

James Davenport 2004-03-09