Point Of Interest
Through some additional performance measurements I realize some interesting plateau. Two points are of interest, first at ~200 byte and the seconds at 600 byte (the x scale is denoted as DWORD size (uint32_t)

Also quit interesting: the long duration to "warm" the cache, tlb, etc .. (BTW: we talk about microseconds)