HalfSipHash Acceptable Usage

Jason A. Donenfeld

unread,

Dec 19, 2016, 12:40:06 PM12/19/16

to

Hi JP,

With the threads getting confusing, I've been urged to try and keep
the topics and threads more closely constrained. Here's where we're
at, and here's the current pressing security concern. It'd be helpful
to have a definitive statement on what you think is best, so we can
just build on top of that, instead of getting lost in the chorus of
opinions.

1) Anything that requires actual long-term security will use
SipHash2-4, with the 64-bit output and the 128-bit key. This includes
things like TCP sequence numbers. This seems pretty uncontroversial to
me. Seem okay to you?

2) People seem to want something competitive, performance-wise, with
jhash if it's going to replace jhash. The kernel community
instinctively pushes back on anything that could harm performance,
especially in networking and in critical data structures, so there
have been some calls for something faster than SipHash. So, questions
regarding this:

2a) George thinks that HalfSipHash on 32-bit systems will have roughly
comparable speed as SipHash on 64-bit systems, so the idea would be to
use HalfSipHash on 32-bit systems' hash tables and SipHash on 64-bit
systems' hash tables. The big obvious question is: does HalfSipHash
have a sufficient security margin for hashtable usage and hashtable
attacks? I'm not wondering about the security margin for other usages,
but just of the hashtable usage. In your opinion, does HalfSipHash cut
it?

2b) While I certainly wouldn't consider making the use case in
question (1) employ a weaker function, for this question (2), there
has been some discussion about using HalfSipHash1-3 (or SipHash1-3 on
64-bit) instead of 2-4. So, the same question is therefore posed:
would using HalfSipHash1-3 give a sufficient security margin for
hashtable usage and hashtable attacks?

My plan is essentially to implement things according to your security
recommendation. The thread started with me pushing a heavy duty
security solution -- SipHash2-4 -- for _everything_. I've received
understandable push back on that notion for certain use cases. So now
I'm trying to discover what the most acceptable compromise is. Your
answers on (2a) and (2b) will direct that compromise.

Thanks again,
Jason

Jason A. Donenfeld

unread,

Dec 19, 2016, 4:10:05 PM12/19/16

to

Hi JP,

On Mon, Dec 19, 2016 at 9:49 PM, Jean-Philippe Aumasson
<jeanphilip...@gmail.com> wrote:

>
> On Mon, Dec 19, 2016 at 6:32 PM Jason A. Donenfeld <Ja...@zx2c4.com> wrote:
>>
>> Hi JP,
>>
>> With the threads getting confusing, I've been urged to try and keep
>> the topics and threads more closely constrained. Here's where we're
>> at, and here's the current pressing security concern. It'd be helpful
>> to have a definitive statement on what you think is best, so we can
>> just build on top of that, instead of getting lost in the chorus of
>> opinions.
>>
>> 1) Anything that requires actual long-term security will use
>> SipHash2-4, with the 64-bit output and the 128-bit key. This includes
>> things like TCP sequence numbers. This seems pretty uncontroversial to
>> me. Seem okay to you?
>
>
>

> Right, since 2012 when we published SipHash many cryptanalysts attempted to
> break SipHash-2-4 with a 128-bit key, for various notions of "break", and
> nothing worth worrying was ever found. I'm totally confident that
> SipHash-2-4 will live up to its security promises.
>
> Don't use something weaker for things like TCP sequence numbers or RNGs. Use
> SipHash2-4 for those. That is the correct choice.

>
>>
>>
>> 2) People seem to want something competitive, performance-wise, with
>> jhash if it's going to replace jhash. The kernel community
>> instinctively pushes back on anything that could harm performance,
>> especially in networking and in critical data structures, so there
>> have been some calls for something faster than SipHash. So, questions
>> regarding this:
>>
>

> No free lunch I guess: either go with a cryptographically secure,
> time-proved keyed hash such as SipHash, or go with some simpler hash deemed
> secure cos its designer can't break it :) #DontRollYourOwnCrypto

>
>> 2a) George thinks that HalfSipHash on 32-bit systems will have roughly
>> comparable speed as SipHash on 64-bit systems, so the idea would be to
>> use HalfSipHash on 32-bit systems' hash tables and SipHash on 64-bit
>> systems' hash tables. The big obvious question is: does HalfSipHash
>> have a sufficient security margin for hashtable usage and hashtable
>> attacks? I'm not wondering about the security margin for other usages,
>> but just of the hashtable usage. In your opinion, does HalfSipHash cut
>> it?
>
>

> HalfSipHash takes its core function from Chaskey and uses the same
> construction as SipHash, so it *should* be secure. Nonetheless it hasn't
> received the same amount of attention as 64-bit SipHash did. So I'm less
> confident about its security than about SipHash's, but it obviously inspires
> a lot more confidence than non-crypto hashes.
>
> Too, HalfSipHash only has a 64-bit key, not a 128-bit key like SipHash, so
> only use this as a mitigation for hash-flooding attacks, where the output of
> the hash function is never directly shown to the caller. Do not use
> HalfSipHash for TCP sequence numbers or RNGs.

>
>
>>
>>
>> 2b) While I certainly wouldn't consider making the use case in
>> question (1) employ a weaker function, for this question (2), there
>> has been some discussion about using HalfSipHash1-3 (or SipHash1-3 on
>> 64-bit) instead of 2-4. So, the same question is therefore posed:
>> would using HalfSipHash1-3 give a sufficient security margin for
>> hashtable usage and hashtable attacks?
>
>

> My educated guess is that yes, it will, but that it may not withhold
> cryptanalysis as a pseudorandom function (PRF). For example I wouldn't be
> surprised if there were a "distinguishing attack" that detects non-random
> patterns in HalfSipHash-1-3's output. But most of the non-crypto hashes I've
> seen have obvious distinguishing attacks. So the upshot is that HSH will get
> you better security that AnyWeakHash even with 1 & 3 rounds.
>
> So, if you're willing to compromise on security, but still want something
> not completely unreasonable, you might be able to get away with using
> HalfSipHash1-3 as a replacement for jhash—in circumstances where the output
> of the hash function is kept secret—in order to mitigate hash-flooding
> attacks.
>

Thanks for the detailed response. I will continue exactly how you've specified.

1. SipHash2-4 for TCP sequence numbers, syncookies, and RNG. IOW, the
things that MD5 is used for now.

2. HalfSipHash1-3 for hash tables where the output is not revealed,
for jhash replacements. On 64-bit this will alias to SipHash1-3.

3. I will write Documentation/siphash.txt detailing this.

4. I'll continue to discourage other kernel developers from rolling
their own crypto or departing from the tried&true in substantial ways.

Thanks again,
Jason

Theodore Ts'o

unread,

Dec 20, 2016, 4:40:06 PM12/20/16

to

On Mon, Dec 19, 2016 at 06:32:44PM +0100, Jason A. Donenfeld wrote:
> 1) Anything that requires actual long-term security will use
> SipHash2-4, with the 64-bit output and the 128-bit key. This includes
> things like TCP sequence numbers. This seems pretty uncontroversial to
> me. Seem okay to you?

Um, why do TCP sequence numbers need long-term security? So long as
you rekey every 5 minutes or so, TCP sequence numbers don't need any
more security than that, since even if you break the key used to
generate initial sequence numbers seven a minute or two later, any
pending TCP connections will have timed out long before.

See the security analysis done in RFC 6528[1], where among other
things, it points out why MD5 is acceptable with periodic rekeying,
although there is the concern that this could break certain hueristics
used when establishing new connections during the TIME-WAIT state.

[1] https://tools.ietf.org/html/rfc6528

- Ted

George Spelvin

unread,

Dec 20, 2016, 6:10:05 PM12/20/16

to

Because we don't rekey TCP sequence numbers, ever. See commit
6e5714eaf77d79ae1c8b47e3e040ff5411b717ec

To rekey them requires dividing the sequence number base into a "random"
part and some "generation" msbits. While we can do better than the
previous 8+24 split (I'd suggest 4+28 or 3+29), only 2 is tricks, and
1 generation bit isn't enough.

So while it helps in the long term, it reduces the security offered by
the random part in the short term. (If I know 4 bits of your ISN,
I only need to send 256 MB to hit your TCP window.)

At the time, I objected, and suggested doing two hashes, with a fixed
32-bit base plus a split rekeyed portion, but that was vetoed on the
grounds of performance.

On further consideration, the fixed base doesn't help much.
(Details below for anyone that cares.)

Suppose we let the TCP initial sequence number be:

(Hash(<srcIP,dstIP,srcPort,dstPort>, fixed_key) & 0xffffffff) +
(i << 28) + (Hash(<srcIP,dstIP,srcPort,dstPort>, key[i]) & 0x0fffffff) +
(current_time_in_nanoseconds / 64)

It's not hugely difficult to mount an effective attack against a
64-bit fixed_key.

As an attacker, I can ask the target to send me these numbers for dstPort
values i control and other values I know. I can (with high probability)
detect the large jumps when the generation changes, so I can make a
significant number of queries with the same generation. After 23-ish
queries, I have enough information to identify a 64-bit fixed_key.

I don't know the current generation counter "i", but I know it's the
same for all my queries, so for any two queries, the maximum difference
between the 28-bit hash values is 29 bits. (We can also add a small
margin to allow for timeing uncertainty, but that's even less.)

So if I guess a fixed key, hash my known plaintexts with that guess,
subtract the ciphertexts from the observed sequence numbers, and the
difference between the remaining (unknown) 28-bit hash values plus
timestamps exceeds what's possible, my guess is wrong.

I can then repeat with additional known plaintexts, reducing the space
of admissible keys by about 3 bits each time.

Assuming I can rent GPU horsepower from a bitcoin miner to do this in a
reasonable period of time, after 22 known plaintext differences, I have
uniquely identified the key.

Of course, in practice I'd do is a first pass with maybe 6 plaintexts
on the GPU, and then deal with the candidates found in a second pass.
But either way, it's about 2.3 SipHash evaluations per key tested.
As I noted earlier, a bitcoin blockchain block, worth 25 bitcoins,
currently costs 2^71 evaluations of SHA-2 (2^70 evaluations of double
SHA-2), and that's accomplished every 10 minutes, this is definitely
practical.

Eric Dumazet

unread,

Dec 20, 2016, 7:00:04 PM12/20/16

to

We do not use rekeying for TCP ISN, not anymore after commit
6e5714eaf77d79ae1 (where we switched from MD4 to MD5 )

It might hurt some common cases and I do not believe it is mandated by a
current (ie not obsolete) RFC.

Our clock has a 64 ns resolution and 274 second period (commit
9b42c336d0641) (compared to 4 usec one in RFC 6528)

I do not see why SipHash, if faster than MD5 and more secure, would be a
problem.

Same for syncookies.

BTW, we probably should add a ratelimit on SYNACK retransmits,
because it seems that attackers understood linux kernels resist to
synfloods, and they (the bad guys) use reflection attacks.

George Spelvin

unread,

Dec 20, 2016, 10:30:05 PM12/20/16

to

> I do not see why SipHash, if faster than MD5 and more secure, would be a
> problem.

Because on 32-bit x86, it's slower.

Cycles per byte on 1024 bytes of data:
Pentium Core 2 Ivy
4 Duo Bridge
SipHash-2-4 38.9 8.3 5.8
HalfSipHash-2-4 12.7 4.5 3.2
MD5 8.3 5.7 4.7

SipHash is more parallelizable and runs faster on superscalar processors,
but MD5 is optimized for 2000-era processors, and is faster on them than
HalfSipHash even.

Now, in the applications we care about, we're hashing short blocks, and
SipHash has the advantage that it can hash less than 64 bytes. But it
also pays a penalty on short blocks for the finalization, equivalent to
two words (16 bytes) of input.

It turns out that on both Ivy Bridge and Core 2 Duo, the crossover happens
between 23 (SipHash is faster) and 24 (MD5 is faster) bytes of input.

This is assuming you're adding the 1 byte of length padding to SipHash's
input, so 24 bytes pads to 4 64-bit words, which makes 2*4+4 = 12 rounds,
vs. one block for MD5. (MD5 takes a similar jump between 55 and 56 bytes.)

On a P4, SipHash is *never* faster; it takes 2.5x longer than MD5 on a
12-byte block (an IPv4 address/port pair).

This is why there was discussion of using HalfSipHash on these machines.
(On a P4, the HalfSipHash/MD5 crossover is somewhere between 24 and 31
bytes; I haven't benchmarked every possible size.)

Eric Dumazet

unread,

Dec 21, 2016, 12:30:05 AM12/21/16

to

On Tue, 2016-12-20 at 22:28 -0500, George Spelvin wrote:
> > I do not see why SipHash, if faster than MD5 and more secure, would be a
> > problem.
>
> Because on 32-bit x86, it's slower.
>
> Cycles per byte on 1024 bytes of data:
> Pentium Core 2 Ivy
> 4 Duo Bridge
> SipHash-2-4 38.9 8.3 5.8
> HalfSipHash-2-4 12.7 4.5 3.2
> MD5 8.3 5.7 4.7

So definitely not faster.

38 cycles per byte is a problem, considering IPV6 is ramping up.

But TCP session establishment on P4 is probably not a big deal.
Nobody would expect a P4 to handle gazillions of TCP flows (using a
32bit kernel)

What about SHA performance (syncookies) on P4 ?

Synfloods are probably the only case we might take care of for 2000-era
cpus.

George Spelvin

unread,

Dec 21, 2016, 1:40:05 AM12/21/16

to

Eric Dumazet wrote:
> On Tue, 2016-12-20 at 22:28 -0500, George Spelvin wrote:
>> Cycles per byte on 1024 bytes of data:
>> Pentium Core 2 Ivy
>> 4 Duo Bridge
>> SipHash-2-4 38.9 8.3 5.8
>> HalfSipHash-2-4 12.7 4.5 3.2
>> MD5 8.3 5.7 4.7
>
> So definitely not faster.
>
> 38 cycles per byte is a problem, considering IPV6 is ramping up.

As I said earlier, SipHash performance on 32-bit x86 really sucks,
because it wants an absolute minimum of 9 32-bit registers (8 for the
state plus one temporary for the rotates), and x86 has 7.

> What about SHA performance (syncookies) on P4 ?

I recompiled with -mtune=pentium4 and re-ran. MD5 time went *up* by
0.3 cycles/byte, HalfSipHash went down by 1 cycle, and SipHash didn't
change:

Cycles per byte on 1024 bytes of data:
Pentium Core 2 Ivy
4 Duo Bridge
SipHash-2-4 38.9 8.3 5.8

HalfSipHash-2-4 11.5 4.5 3.2
MD5 8.6 5.7 4.7
SHA-1 19.0 8.0 6.8

(This is with a verbatim copy of the lib/sha1.c code; I might be
able to optimize it with some asm hackery.)

Anyway, you see why we were looking longingly at HalfSipHash.

In fact, I have an idea. Allow me to make the following concrete
suggestion for using HalfSipHash with 128 bits of key material:

- 64 bits are used as the key.
- The other 64 bits are used as an IV which is prepended to
the message to be hashed.

As a matter of practical implementation, we precompute the effect
of hashing the IV and store the 128-bit HalfSipHash state, which
is used just like a 128-bit key.

Because of the way it is constructed, it is obviously no weaker than
standard HalfSipHash's 64-bit security claim.

I don't know the security of this, and it's almost certainly weaker than
128 bits, but I *hope* it's at least a few bits stronger than 64 bits.
80 would be enough to dissuade any attacker without a six-figure budget
(that's per attack, not a one-time capital investment). 96 would be
ample for our purposes.

What I do know is that it makes a brute-force attack without
significant cryptanalytic effort impossible.

To match the spec exactly, we'd need to add the 8-byte IV length to
the length byte which pads the final block, but from a security point
of view, it does not matter. As long as we are consistent within any
single key, any unique mapping between padding byte and message length
(mod 256) is equally good.

We may choose based on implementation convenience.

(Also note my earlier comments about when it is okay to omit the padding
length byte entirely: any time all the data to be hashed with a given
key is fixed in format or self-delimiting (e.g. null-terminated).
This applies to many of the networking uses.)

Jason A. Donenfeld

unread,

Dec 21, 2016, 9:30:05 AM12/21/16

to

Hi George,

On Wed, Dec 21, 2016 at 7:34 AM, George Spelvin
<li...@sciencehorizons.net> wrote:
> In fact, I have an idea. Allow me to make the following concrete
> suggestion for using HalfSipHash with 128 bits of key material:
>
> - 64 bits are used as the key.
> - The other 64 bits are used as an IV which is prepended to
> the message to be hashed.
>
> As a matter of practical implementation, we precompute the effect
> of hashing the IV and store the 128-bit HalfSipHash state, which
> is used just like a 128-bit key.
>
> Because of the way it is constructed, it is obviously no weaker than
> standard HalfSipHash's 64-bit security claim.
>
> I don't know the security of this, and it's almost certainly weaker than
> 128 bits, but I *hope* it's at least a few bits stronger than 64 bits.
> 80 would be enough to dissuade any attacker without a six-figure budget
> (that's per attack, not a one-time capital investment). 96 would be
> ample for our purposes.
>
> What I do know is that it makes a brute-force attack without
> significant cryptanalytic effort impossible.

Depends who's doing the cryptanalytic effort I guess.

Please don't roll your own crypto. It's a dangerous road. Putting
homebrew crypto into the kernel would be an error. Let's stick with
the constructions and security margins that the cryptographers give
us. JP made that fairly clear, I thought.

There are already people working on this problem who undergo peer
review and a career devoted to solving these problems. One result for
small systems that need 128-bit security is Chaskey, which you can go
read about if you're curious.

Jason

Jason A. Donenfeld

unread,

Dec 21, 2016, 9:50:07 AM12/21/16

to

Hi Eric,

I computed performance numbers for both 32-bit and 64-bit using the
actual functions in which talking about replacing MD5 with SipHash.
The basic harness is here [1] if you're curious. SipHash was a pretty
clear winner for both cases.

x86_64:
[ 1.714302] secure_tcpv6_sequence_number_md5# cycles: 102373398
[ 1.747685] secure_tcp_sequence_number_md5# cycles: 92042258
[ 1.773522] secure_tcpv6_sequence_number_siphash# cycles: 70786533
[ 1.798701] secure_tcp_sequence_number_siphash# cycles: 68941043

x86:
[ 1.635749] secure_tcpv6_sequence_number_md5# cycles: 106016335
[ 1.670259] secure_tcp_sequence_number_md5# cycles: 95670512
[ 1.708387] secure_tcpv6_sequence_number_siphash# cycles: 105988635
[ 1.740264] secure_tcp_sequence_number_siphash# cycles: 88225395

>>> 102373398 > 70786533
True
>>> 92042258 > 68941043
True
>>> 106016335 > 105988635
True
>>> 95670512 > 88225395
True

While MD5 is probably faster for some kind of large-data
cycles-per-byte, due to its 64-byte internal state, SipHash -- the
"Sip" part standing "Short Input PRF" -- is fast for shorter inputs.
In practice with the functions we're talking about replacing, there's
no need to hash 64-bytes. So, SipHash comes out faster and more
secure.

I also haven't begun to look focusedly at the assembly my SipHash
implemention is generating, which means there's still window for even
more performance improvements.

Jason

[1] https://git.zx2c4.com/linux-dev/tree/net/core/secure_seq.c?h=siphash-bench#n194

Eric Dumazet

unread,

Dec 21, 2016, 11:00:05 AM12/21/16

to

Now I am quite confused.

George said :

> Cycles per byte on 1024 bytes of data:
> Pentium Core 2 Ivy
> 4 Duo Bridge
> SipHash-2-4 38.9 8.3 5.8
> HalfSipHash-2-4 12.7 4.5 3.2
> MD5 8.3 5.7 4.7

That really was for 1024 bytes blocks, so pretty much useless for our
discussion ?

Reading your numbers last week, I thought SipHash was faster, but George
numbers are giving the opposite impression.

I do not have a P4 to make tests, so I only can trust you or George.

Thanks.

George Spelvin

unread,

Dec 21, 2016, 11:00:06 AM12/21/16

to

Actually, DJB just made a very relevant suggestion.

As I've mentioned, the 32-bit performance problems are an x86-specific
problem. ARM does very well, and other processors aren't bad at all.

SipHash fits very nicely (and runs very fast) in the MMX registers.

They're 64 bits, and there are 8 of them, so the integer registers can
be reserved for pointers and loop counters and all that. And there's
reference code available.

How much does kernel_fpu_begin()/kernel_fpu_end() cost?

Although there are a lot of pre-MMX x86es in embedded control applications,
I don't think anyone is worried about their networking performance.
(Specifically, all of this affects only connection setup, not throughput
on established connections.)

Jason A. Donenfeld

unread,

Dec 21, 2016, 11:40:05 AM12/21/16

to

Hi Eric,

On Wed, Dec 21, 2016 at 4:56 PM, Eric Dumazet <eric.d...@gmail.com> wrote:
> That really was for 1024 bytes blocks, so pretty much useless for our
> discussion ?
>
> Reading your numbers last week, I thought SipHash was faster, but George
> numbers are giving the opposite impression.
>
> I do not have a P4 to make tests, so I only can trust you or George.

I'm not sure how George came up with those numbers, but the ones I
sent are output from that benchmark function in the last email. I'd be
interested in learning this too.

As mentioned in the last email, it looks like potential 32-bit issues
are really just specific to old Intel chips. Other 32-bit
architectures do fine. So, for new kernels, even if somehow there is a
tiny performance regression (though I couldn't see one) on old
architectures, I really doubt it will affect anybody in practice.

Jason

Rik van Riel

unread,

Dec 21, 2016, 11:40:05 AM12/21/16

to

Does anybody still have a P4?

If they do, they're probably better off replacing
it with an Atom. The reduced power bills will pay
for replacing that P4 within a year or two.

In short, I am not sure how important the P4
performance numbers are, especially if we can
improve security for everybody else...

--
All Rights Reversed.

signature.asc

Jason A. Donenfeld

unread,

Dec 21, 2016, 11:40:06 AM12/21/16

to

Hi George,

On Wed, Dec 21, 2016 at 4:55 PM, George Spelvin
<li...@sciencehorizons.net> wrote:
> Actually, DJB just made a very relevant suggestion.
>
> As I've mentioned, the 32-bit performance problems are an x86-specific
> problem. ARM does very well, and other processors aren't bad at all.
>
> SipHash fits very nicely (and runs very fast) in the MMX registers.
>
> They're 64 bits, and there are 8 of them, so the integer registers can
> be reserved for pointers and loop counters and all that. And there's
> reference code available.
>
> How much does kernel_fpu_begin()/kernel_fpu_end() cost?

In my experience, these functions are only worth calling when
processing more significant amounts of data. I don't have any
benchmarks, but when I _remove_ all of these calls in a kernel,
accelerated crypto gets noticeably faster (until the system crashes).
We can measure it, though.

By the way, if somehow SipHash becomes acceptably fast on x86, would
you consider HalfSipHash for hash tables to be no longer needed? Or do
you suspect that HalfSipHash will always be faster even on, say,
32-bit ARM.

Jason

Rik van Riel

unread,

Dec 21, 2016, 11:50:06 AM12/21/16

to

On Wed, 2016-12-21 at 10:55 -0500, George Spelvin wrote:
> Actually, DJB just made a very relevant suggestion.
>
> As I've mentioned, the 32-bit performance problems are an x86-
> specific
> problem. ARM does very well, and other processors aren't bad at all.
>
> SipHash fits very nicely (and runs very fast) in the MMX registers.
>
> They're 64 bits, and there are 8 of them, so the integer registers
> can
> be reserved for pointers and loop counters and all that. And there's
> reference code available.
>
> How much does kernel_fpu_begin()/kernel_fpu_end() cost?

Those can be very expensive. Almost certainly not
worth it for small amounts of data.

--
All Rights Reversed.

signature.asc

Eric Dumazet

unread,

Dec 21, 2016, 12:10:06 PM12/21/16

to

On Wed, 2016-12-21 at 11:39 -0500, Rik van Riel wrote:

> Does anybody still have a P4?
>
> If they do, they're probably better off replacing
> it with an Atom. The reduced power bills will pay
> for replacing that P4 within a year or two.

Well, maybe they have millions of units to replace.

>
> In short, I am not sure how important the P4
> performance numbers are, especially if we can
> improve security for everybody else...

Worth adding that the ISN or syncookie generation are less than 10% of
the actual cost of handling a problematic (having to generate ISN or
syncookie) TCP packet anyway.

So we are talking of minors potential impact for '2000-era' cpus.

Definitely I vote for using SipHash in TCP ASAP.

Linus Torvalds

unread,

Dec 21, 2016, 12:30:06 PM12/21/16

to

On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin
<li...@sciencehorizons.net> wrote:
>
> How much does kernel_fpu_begin()/kernel_fpu_end() cost?

It's now better than it used to be, but it's absolutely disastrous
still. We're talking easily many hundreds of cycles. Under some loads,
thousands.

And I warn you already: it will _benchmark_ a hell of a lot better
than it will work in reality. In benchmarks, you'll hit all the
optimizations ("oh, I've already saved away all the FP registers, no
need to do it again").

In contrast, in reality, especially with things like "do it once or
twice per incoming packet", you'll easily hit the absolute worst
cases, where not only does it take a few hundred cycles to save the FP
state, you'll then return to user space in between packets, which
triggers the slow-path return code and reloads the FP state, which is
another few hundred cycles plus.

Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX
unit and keep it powered up for a while afterwards", while in real
life you would quite easily hit the "oh, AVX is powered down because
we were idle, now it powers up at half speed which is another latency
hit _and_ the AVX unit won't run full out anyway".

Don't do it. There are basically no real situations where the AVX
state optimizations help for the kernel. We just don't have the loop
counts to make up for the problems it causes.

The one exception is likely if you're doing things like
high-throughput disk IO encryption, and then you'd be much better off
using SHA256 instead (which often has hw encryption on modern CPU's -
both x86 and ARM).

(I'm sure that you could see it on some high-throughput network
benchmark too when the benchmark entirely saturates the CPU. And then
in real life it would suck horribly for all the reasons above).

Linus

George Spelvin

unread,

Dec 21, 2016, 1:10:06 PM12/21/16

to

Linus wrote:
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.

I think I've been thoroughly dissuaded, but just to clarify one thing
that resembles a misunderstanding:

> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Everything being discussed is per-TCP-connection overhead, *not* per
packet. (Twice for outgoing connections, because one is to generate
the ephemeral port number.)

I know you know this, but I don't want anyone spectating to be confused
about it.

George Spelvin

unread,

Dec 21, 2016, 1:40:05 PM12/21/16

to

Eric Dumazet wrote:
> Now I am quite confused.
>
> George said :
>> Cycles per byte on 1024 bytes of data:
>> Pentium Core 2 Ivy
>> 4 Duo Bridge
>> SipHash-2-4 38.9 8.3 5.8
>> HalfSipHash-2-4 12.7 4.5 3.2
>> MD5 8.3 5.7 4.7
>
> That really was for 1024 bytes blocks, so pretty much useless for our
> discussion ?

No, they're actually quite relevant, but you have to interpret them
correctly. I thought I explained in the text following that table,
but let me make it clearer:

To find the time to compute the SipHash of N bytes, round (N+17) up to
the next multiple of 8 bytes and multiply by the numbers above.

To find the time to compute the HalfSipHash of N bytes, round (N+9) up to
the next multiple of 4 bytes and multiply by the numbers above.

To find the time to compute the MD5 of N bytes, round (N+9) up to the
next multiple of 64 bytes and multiply by the numbers above.

It's the different rounding rules that make all the difference. For small
input blocks, SipHash can be slower per byte yet still faster because
it hashes fewer bytes.

> Reading your numbers last week, I thought SipHash was faster, but George
> numbers are giving the opposite impression.

SipHash annihilates the competition on 64-bit superscalar hardware.
SipHash dominates the field on 64-bit in-order hardware.
SipHash wins easily on 32-bit hardware *with enough registers*.
On register-starved 32-bit machines, it really struggles.

As I explained, in that last case, SipHash barely wins at all.
(On a P4, it actually *loses* to MD5, not that anyone cares. Running
on a P4 and caring about performance are mutually exclusive.)

Jason A. Donenfeld

unread,

Dec 21, 2016, 1:50:05 PM12/21/16

to

On Wed, Dec 21, 2016 at 7:37 PM, George Spelvin
<li...@sciencehorizons.net> wrote:
> SipHash annihilates the competition on 64-bit superscalar hardware.
> SipHash dominates the field on 64-bit in-order hardware.
> SipHash wins easily on 32-bit hardware *with enough registers*.
> On register-starved 32-bit machines, it really struggles.
>
> As I explained, in that last case, SipHash barely wins at all.
> (On a P4, it actually *loses* to MD5, not that anyone cares. Running
> on a P4 and caring about performance are mutually exclusive.)

From the discussion off list which examined your benchmark code, it
looks like we're going to move ahead with SipHash.

Jason A. Donenfeld

unread,

Dec 21, 2016, 5:30:05 PM12/21/16

to

On Wed, Dec 21, 2016 at 11:27 PM, Theodore Ts'o <ty...@mit.edu> wrote:
> And "with enough registers" includes ARM and MIPS, right? So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections. :-)

Plus the benchmark was bogus anyway, and when I built a more specific
harness -- actually comparing the TCP sequence number functions --
SipHash was faster than MD5, even on register starved x86. So I think
we're fine and this chapter of the discussion can come to a close, in
order to move on to more interesting things.

Theodore Ts'o

unread,

Dec 21, 2016, 5:30:05 PM12/21/16

to

On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote:
> SipHash annihilates the competition on 64-bit superscalar hardware.
> SipHash dominates the field on 64-bit in-order hardware.
> SipHash wins easily on 32-bit hardware *with enough registers*.
> On register-starved 32-bit machines, it really struggles.

And "with enough registers" includes ARM and MIPS, right? So the only
real problem is 32-bit x86, and you're right, at that point, only
people who might care are people who are using a space-radiation
hardened 386 --- and they're not likely to be doing high throughput
TCP connections. :-)

- Ted

George Spelvin

unread,

Dec 21, 2016, 7:20:05 PM12/21/16

to

Theodore Ts'o wrote:
> On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote:
>> SipHash annihilates the competition on 64-bit superscalar hardware.
>> SipHash dominates the field on 64-bit in-order hardware.
>> SipHash wins easily on 32-bit hardware *with enough registers*.
>> On register-starved 32-bit machines, it really struggles.

> And "with enough registers" includes ARM and MIPS, right?

Yes. As a matter of fact, 32-bit ARM does particularly well
on 64-bit SipHash due to its shift+op instructions.

There is a noticeable performance drop, but nothing catastrophic.

The main thing I've been worried about is all the flow tracking
and NAT done by small home routers, and that's addressed by using
HalfSipHash for the hash tables. They don't *initiate* a lot of
TCP sessions.

> So the only
> real problem is 32-bit x86, and you're right, at that point, only
> people who might care are people who are using a space-radiation
> hardened 386 --- and they're not likely to be doing high throughput
> TCP connections. :-)

The only requirement on performance is "don't make DaveM angry." :-)

I was just trying to answer the question of why we *worried* about the
performance, not specifically argue that we *should* use HalfSipHash.

George Spelvin

unread,

Dec 21, 2016, 8:20:04 PM12/21/16

to

As a separate message, to disentangle the threads, I'd like to
talk about get_random_long().

After some thinking, I still like the "state-preserving" construct
that's equivalent to the current MD5 code. Yes, we could just do
siphash(current_cpu || per_cpu_counter, global_key), but it's nice to
preserve a bit more.

It requires library support from the SipHash code to return the full
SipHash state, but I hope that's a fair thing to ask for.

Here's my current straw man design for comment. It's very similar to
the current MD5-based design, but feeds all the seed material in the
"correct" way, as opposed to Xring directly into the MD5 state.

* Each CPU has a (Half)SipHash state vector,
"unsigned long get_random_int_hash[4]". Unlike the current
MD5 code, we take care to initialize it to an asymmetric state.

* There's a global 256-bit random_int_secret (which we could
reseed periodically).

To generate a random number:
* If get_random_int_hash is all-zero, seed it with fresh a half-sized
SipHash key and the appropriate XOR constants.
* Generate three words of random_get_entropy(), jiffies, and current->pid.
(This is arbitary seed material, copied from the current code.)
* Crank through that with (Half)SipHash-1-0.
* Crank through the random_int_secret with (Half)SipHash-1-0.
* Return v1 ^ v3.

Here are the reasons:
* The first step is just paranoia, but SipHash's security promise depends
on starting with an asymmetric state, we want unique per-CPU states,
and it's a one-time cost.
* When the input words are themselves secret, there's no security
advantage, and almost no speed advantage, to doing two rounds for one
input word versus two words with one round each. Thus, SipHash-1.
* The above is not exactly true, due to the before+after XOR pattern
that SipHash uses, but I think it's true anyway.
* Likewise, there's no benefit to unkeyed finalization rounds over keyed
ones. That's why I just enlarged the global secret.
* The per-call seed material is hashed first on general principles,
because that's the novel part that might have fresh entropy.
* To the extent the initial state is secret, the rounds processing the
global secret are 4 finalization rounds for the initial state and
the per-call entropy.
* The final word(s) of the global secret might be vulnerable to analysis,
due to incomplete mixing, but since the global secret is always hashed
in the same order, and larger that the desired security level, the
initial words should be secure.
* By carrying forward the full internal state, we ensure that repeated
calls return different results, and to the extent that the per-call
seed material has entropy, it's preserved.
* The final return is all that's needed, since the last steps in the
SipRound are "v1 ^= v2" and "v3 ^= v0". It's no security loss,
and a very minor speedup.
* Also, this avoids directly "exposing" the final XOR with the last
word of the global secret (which is made to v0).

If I'm allowed to use full SipHash, some shortcuts can be taken,
but I believe the above would be secure with HalfSipHash.

If additional performance is required, I'd consider shrinking the
global secret to 192 bits on 32-bit machines but I want more than
128 bits of ey material, and enough rounds to be equivalent to 4
finalization rounds.

Andy Lutomirski

unread,

Dec 21, 2016, 9:10:04 PM12/21/16

to

On Wed, Dec 21, 2016 at 9:25 AM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin
> <li...@sciencehorizons.net> wrote:
>>
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.
>
> And I warn you already: it will _benchmark_ a hell of a lot better
> than it will work in reality. In benchmarks, you'll hit all the
> optimizations ("oh, I've already saved away all the FP registers, no
> need to do it again").
>
> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Hah, you're thinking that the x86 code works the way that Rik and I
want it to work, and you just made my day. :) What actually happens
is that the state is saved in kernel_fpu_begin() and restored in
kernel_fpu_end(), and it'll take a few hundred cycles best case. If
you do it a bunch of times in a loop, you *might* trigger a CPU
optimization that notices that the state being saved is the same state
that was just restored, but you're still going to pay the full restore
code each round trip no matter what.

The code is much clearer in 4.10 kernels now that I deleted the unused
"lazy" branches.

>
> Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX
> unit and keep it powered up for a while afterwards", while in real
> life you would quite easily hit the "oh, AVX is powered down because
> we were idle, now it powers up at half speed which is another latency
> hit _and_ the AVX unit won't run full out anyway".

I *think* that was mostly fixed in Broadwell or thereabouts (in terms
of latency -- throughput and power consumption still suffers).

George Spelvin

unread,

Dec 21, 2016, 11:00:04 PM12/21/16

to

> Plus the benchmark was bogus anyway, and when I built a more specific
> harness -- actually comparing the TCP sequence number functions --
> SipHash was faster than MD5, even on register starved x86. So I think
> we're fine and this chapter of the discussion can come to a close, in
> order to move on to more interesting things.

Do we have to go through this? No, the benchmark was *not* bogus.

Here's myresults from *your* benchmark. I can't reboot some of my test
machines, so I took net/core/secure_seq.c, lib/siphash.c, lib/md5.c and
include/linux/siphash.h straight out of your test tree.

Then I replaced the kernel #includes with the necessary typedefs
and #defines to make it compile in user-space. (Voluminous but
straightforward.) E.g.

#define __aligned(x) __attribute__((__aligned__(x)))
#define ____cacheline_aligned __aligned(64)
#define CONFIG_INET 1
#define IS_ENABLED(x) 1
#define ktime_get_real_ns() 0
#define sysctl_tcp_timestamps 0

... etc.

Then I modified your benchmark code into the appended code. The
differences are:
* I didn't iterate 100K times, I timed the functions *once*.
* I saved the times in a buffer and printed them all at the end
so printf() wouldn't pollute the caches.
* Before every even-numbered iteration, I flushed the I-cache
of everything from _init to _fini (i.e. all the non-library code).
This cold-cache case is what is going to happen in the kernel.

In the results below, note that I did *not* re-flush between phases
of the test. The effects of cacheing is clearly apparent in the tcpv4
results, where the tcpv6 code loaded the cache.

You can also see that the SipHash code benefits more from cacheing when
entered with a cold cache, as it iterates over the input words, while
the MD5 code is one big unrolled blob.

Order of computation is down the columns first, across second.

The P4 results were:
tcpv6 md5 cold: 4084 3488 3584 3584 3568
tcpv4 md5 cold: 1052 996 996 1060 996
tcpv6 siphash cold: 4080 3296 3312 3296 3312
tcpv4 siphash cold: 2968 2748 2972 2716 2716
tcpv6 md5 hot: 900 712 712 712 712
tcpv4 md5 hot: 632 672 672 672 672
tcpv6 siphash hot: 2484 2292 2340 2340 2340
tcpv4 siphash hot: 1660 1560 1564 2340 1564

SipHash actually wins slightly in the cold-cache case, because
it iterates more. In the hot-cache case, it loses horribly.

Core 2 duo:
tcpv6 md5 cold: 3396 2868 2964 3012 2832
tcpv4 md5 cold: 1368 1044 1320 1332 1308
tcpv6 siphash cold: 2940 2952 2916 2448 2604
tcpv4 siphash cold: 3192 2988 3576 3504 3624
tcpv6 md5 hot: 1116 1032 996 1008 1008
tcpv4 md5 hot: 936 936 936 936 936
tcpv6 siphash hot: 1200 1236 1236 1188 1188
tcpv4 siphash hot: 936 804 804 804 804

Pretty much a tie, honestly.

Ivy Bridge:
tcpv6 md5 cold: 6086 6136 6962 6358 6060
tcpv4 md5 cold: 816 732 1046 1054 1012
tcpv6 siphash cold: 3756 1886 2152 2390 2566
tcpv4 siphash cold: 3264 2108 3026 3120 3526
tcpv6 md5 hot: 1062 808 824 824 832
tcpv4 md5 hot: 730 730 740 748 748
tcpv6 siphash hot: 960 952 936 1112 926
tcpv4 siphash hot: 638 544 562 552 560

Modern processors *hate* cold caches. But notice how md5 is *faster*
than SipHash on hot-cache IPv6.

Ivy Bridge, -m64:
tcpv6 md5 cold: 4680 3672 3956 3616 3525
tcpv4 md5 cold: 1066 1416 1179 1179 1134
tcpv6 siphash cold: 940 1258 1995 1609 2255
tcpv4 siphash cold: 1440 1269 1292 1870 1621
tcpv6 md5 hot: 1372 1111 1122 1088 1088
tcpv4 md5 hot: 997 997 997 997 998
tcpv6 siphash hot: 340 340 340 352 340
tcpv4 siphash hot: 227 238 238 238 238

Of course, when you compile -m64, SipHash is unbeatable.

Here's the modified benchmark() code. The entire package is
a bit voluminous for the mailing list, but anyone is welcome to it.

static void clflush(void)
{
extern char const _init, _fini;
char const *p = &_init;

while (p < &_fini) {
asm("clflush %0" : : "m" (*p));
p += 64;
}
}

typedef uint32_t cycles_t;
static cycles_t get_cycles(void)
{
uint32_t eax, edx;
asm volatile("rdtsc" : "=a" (eax), "=d" (edx));
return eax;
}

static int benchmark(void)
{
cycles_t start, finish;
int i;
u32 seq_number = 0;
__be32 saddr6[4] = { 1, 4, 182, 393 }, daddr6[4] = { 9192, 18288, 2222222, 0xffffff10 };
__be32 saddr4 = 28888, daddr4 = 182112;
__be16 sport = 22, dport = 41992;
u32 tsoff;
cycles_t result[4];

printf("seq num benchmark\n");

for (i = 0; i < 10; i++) {
if ((i & 1) == 0)
clflush();

start = get_cycles();
seq_number += secure_tcpv6_sequence_number_md5(saddr6, daddr6, sport, dport, &tsoff);
finish = get_cycles();
result[0] = finish - start;

start = get_cycles();
seq_number += secure_tcp_sequence_number_md5(saddr4, daddr4, sport, dport, &tsoff);
finish = get_cycles();
result[1] = finish - start;

start = get_cycles();
seq_number += secure_tcpv6_sequence_number(saddr6, daddr6, sport, dport, &tsoff);
finish = get_cycles();
result[2] = finish - start;

start = get_cycles();
seq_number += secure_tcp_sequence_number(saddr4, daddr4, sport, dport, &tsoff);
finish = get_cycles();
result[3] = finish - start;

printf("* Iteration %d results:\n", i);
printf("secure_tcpv6_sequence_number_md5# cycles: %u\n", result[0]);
printf("secure_tcp_sequence_number_md5# cycles: %u\n", result[1]);
printf("secure_tcpv6_sequence_number_siphash# cycles: %u\n", result[2]);
printf("secure_tcp_sequence_number_siphash# cycles: %u\n", result[3]);
printf("benchmark result: %u\n", seq_number);
}

printf("benchmark result: %u\n", seq_number);
return 0;
}
//device_initcall(benchmark);

int
main(void)
{
memset(net_secret, 0xff, sizeof net_secret);
memset(net_secret_md5, 0xff, sizeof net_secret_md5);
return benchmark();
}

Jason A. Donenfeld

unread,

Dec 21, 2016, 11:50:06 PM12/21/16

to

Hi George,

On Thu, Dec 22, 2016 at 4:55 AM, George Spelvin
<li...@sciencehorizons.net> wrote:
> Do we have to go through this? No, the benchmark was *not* bogus.

> Then I replaced the kernel #includes with the necessary typedefs
> and #defines to make it compile in user-space.

> * I didn't iterate 100K times, I timed the functions *once*.
> * I saved the times in a buffer and printed them all at the end
> so printf() wouldn't pollute the caches.
> * Before every even-numbered iteration, I flushed the I-cache
> of everything from _init to _fini (i.e. all the non-library code).
> This cold-cache case is what is going to happen in the kernel.

Wow! Great. Thanks for the pointers on the right way to do this. Very
helpful, and enlightening results indeed. Think you could send me the
whole .c of what you finally came up with? I'd like to run this on
some other architectures; I've got a few odd boxes laying around here.

> The P4 results were:

> SipHash actually wins slightly in the cold-cache case, because
> it iterates more. In the hot-cache case, it loses

> Core 2 duo:

> Pretty much a tie, honestly.
> Ivy Bridge:

> Modern processors *hate* cold caches. But notice how md5 is *faster*
> than SipHash on hot-cache IPv6.
> Ivy Bridge, -m64:

> Of course, when you compile -m64, SipHash is unbeatable.

Okay, so I think these results are consistent with some of the
assessments from before -- that SipHash is really just fine as a
replacement for MD5. Not great on older 32-bit x86, but not too
horrible, and the performance improvements on every other architecture
and the security improvements everywhere are a net good.

> Here's the modified benchmark() code. The entire package is
> a bit voluminous for the mailing list, but anyone is welcome to it.

Please do send! I'm sure I'll learn from reading it. Thanks again for
doing the hardwork of putting something proper together.

Thanks,
Jason