During some off-line discussion with Florian - one of the main developers of TCP SYN cookies - I was a little bit skeptic about the mechanism and the interplay with the TCP window scaling option.
First I will describe these two mechanism; later on I will discuss their relationship and interplay. At the end I will discuss the regression and possible solutions. I shifted this off-line discussion to the kernel ml because it is not that trivial as it sounds.
TCP Window Scaling
This TCP extension was introduced by RFC 1323 (TCP Extensions for High Performance) and expands the 16 bit window to an effective 32 bit window. This option specify a logarithmic scale factor which is applied to the received and transmitted window. Receive and send window scale factor are established separately in each direction. This factor is fixed at the three way handshake (in the SYN and SYN/ACK packet ) and cannot be changed during the TCP session. The scale factor and therefore the maximum receive window is determined by the maximum receive buffer space. Linux for example check the maximum possible receive memory in bytes and level the window scale factor based on this value (sysctl_rmem_max and sysctl_tcp_rmem).
The actual window size is calculated each time a TCP packet is transmitted via @tcp_output.c:tcp_select_window()@ and advertise the amount of free space in the receive buffer (under consideration of RFC1323 scaling is applied). The algorithm never shrink the offered window - conforming to the RFC 793. This buffer is sticked to exactly one socket. Expanding the window is more complicated, RFC 1122 says:
the suggested [SWS] avoidance algorithm for the receiver is to keep RECV.NEXT + RCV.WIN fixed until: RCV.BUFF - RCV.USER - RCV.WINDOW >= min(1/2 RCV.BUFF, MSS)
Means that the window is never raised on the right side until at least memory is available to increase it at least MSS bytes. This RFC statement is a little bit unlovely because it breaks the header prediction algorithm but this is another topic. ;)
Last but not least the actual window calculation is not purely based on actual available and advanced allocated memory, it is also based on window clamping. Which is roughly 3/4 the size of the receive buffer minus the size of the application buffer minus the maximum segment size. In the presence of dynamic window sizing the window clamping is a little bit more complicated because the memory control is more dynamic and so on ...
Network Stack Regression
The detected regression in the current network stack arise from the circumstance that that their is a race between the SYN/ACK where we initial force a particular window scale and the next time where we recalculate the window via tcp_select_initial_window().
If the user change net.core.rmem_max or net.ipv4.tcp_rmem in between this time, the recalculated window scale (rcv_wscale) can be smaller. But the receiver still operates with the initial window scale and can overshot the granted window - and bang.
There are several solutions:
- encode rcv_wscale into the SYN cookie and don't recalculate the scaling factor via tcp_select_initial_window() or
- disable window scaling and don't transmit any scaling option when SYN cookies are active. The later option is not that defective as it sounds. Even if the server suffers from memory the window scaling becomes insignificant.