The CRC-16-CCITT is a cyclic redundancy check - one among many - defined and derived from polynomial x^16^ + x^12^ + x^5^ + 1. It is used by X.25, HDLC, XMODEM, bluetooth, phy header verification and many others. During some analysis I stumbled across the following implementation:

unsigned char ser_data;
static unsigned int crc;

crc  = (unsigned char)(crc >> 8) | (crc << 8);
crc ^= ser_data;
crc ^= (unsigned char)(crc & 0xff) >> 4;
crc ^= (crc << 8) << 4;
crc ^= ((crc & 0xff) << 4) << 1;

The expression @(crc << 8) << 4@ quizzicaled me a little bit because @crc<<12@ seems more efficient and is mainly equivalent. The advice on the "site": is to implement the above algorithm with the two shift operation because the latter generates much more code and executes slower! I compiled the above code fragment, disabled any optimization and voila:

shrb  $4, %al
movzbl  %al, %eax
xorw  %ax, -2(%rbp)
movzwl  -2(%rbp), %eax
sall  $12, %eax
movl  %eax, %edx
movzwl  -2(%rbp), %eax
xorl  %edx, %eax
movw  %ax, -2(%rbp)
movzwl  -2(%rbp), %eax
andl  $255, %eax
sall  $5, %eax

Independent of the compiler optimization level (@-O0 .. -O7@) the compiler always generates a 12 bit shift. It is even not possible to instruct the compiler not to do so. But why does the compiler does not optimize the construct more aggressively? It does, if the user tell him to do so. ;-) But these optimization are more cosmetic and restricted to reduce the instruction set size.

shrb  $4, %dl
movzbl  %dl, %edx
xorl  %eax, %edx
movl  %edx, %eax
sall  $12, %eax
xorl  %edx, %eax
movzbl  %al, %edx
sall  $5, %edx

In average the processing time on my @x86_64@ take @2.06076e^-05^ us@ per byte: