If we assume that ECDHE keys are ephemeral in the sense that they are rotated once in a while, but may be reused for multiple connections, then a significant speedup is possible - not the obvious improvement of avoiding key generation per connection, but for the key compute operation, which must be done per connection anyway.
This can be achieved by pre-computation of the multiplications of the ECDH private key.
In a PoC I've written, I've seen an increase of x4 on a specific platform.
In general, the performance of ECDH_compute_key can be made to be very similar to that of ECDSA_sign, which means a x2-x5 speedup, depending on HW and existing optimizations used.
On a busy server doing hundreds of ECDH ops/sec, even a ECDH key changed every 10 seconds would benefit from what I'm suggesting.
Also, as far as I can see, that's also the default behavior in the openssl's server code, if I read the code correctly.
The code in s_server.c MAIN() calls SSL_CTX_set_tmp_ecdh, and doesn't set the SSL_OP_SINGLE_ECDH_USE flag.
If we look at what ECDH_compute_key does, the main CPU-intensive operation is a EC multiplication of a point (the peer's public key) with a scalar (the private key).
This is identical to what ECDSA_sign does, and what EC_KEY_generate_key does, except that for them the point is the group's generator.
Yet, ECDH_compute_key is x2-x5 times slower than ECDSA_sign (depending on HW, exact curve, exact algorithm, etc).
This is because the ECDSA_sign operation can use the pre-computed multiplications of the group's generator, created when the EC_KEY_precompute_mult is called. The name of EC_KEY_precompute_mult is a bit misleading - it actually has nothing to do with the EC_KEY, only the EC_GROUP associated with it.
If the ECDH key is reused, then it's possible to pre-compute values for the private key (d), in exactly the same manner as we do for the generator of the group (G). Once we have that, operations on d will be as fast as operations on G.
My PoC code (which is a complete hack) does the following:
1) moved most of the code in ec_wNAF_precompute_mult to ec_wNAF_precompute_mult_internal which pre-computes an arbitrary point, not specifically the generator
2) moved most of the code from ec_wNAF_mul to ec_wNAF_mul_internal which multiplies the scalar parameter by an explicit base_point, instead of the implicit generator of the group. as is done today. In addition it gets an optional pre-copmuted parameter, which matches the base_point
3) The ec_wNAF_precompute_mult code now just calls ec_wNAF_precompute_mult_internal with the group's generator, and saves the returned pre-compute object on the group (as before)
4) The ec_wNAF_mul function now calls the ec_wNAF_precompute_mult_internal with:
a) the generator as the base_point and the pre-comp data of the generator if the scalar parameter is not NULL (as is the case in ECDSA_sign / EC_KEY_generate_key)
b) in the specific case of NULL scalar and a single point (as is the case in ECDH_compute_key), passes the single point (which is actually the private key d) as the base point, with a pre-comp of d (which as a hack I saved on a global variable).
This showed that indeed now the ECDH_compute_key is as fast as the other operations mentioned above. I've also made sure that the ECDH shared secret computed with the optimization and without the optimization is the same.
I'm not posting the code because it's a complete hack, and based on an ancient version of openssl.
But the idea in general is good (as far as I can tell) for all EC multiplication code (wNAF, nistp256, etc) and for newer versions of openssl.
If someone is interested, I can provide similar code for a specific mul code, over a recent openssl version.
If we would want to make this concept part of the openssl code, then the general lines for the solution are as I've described above, but we'll need to find a proper place to save the pre-comp data for the private key, probably in somewhere in EC_key->method_data struct, like is done for generator EC_GROUP->method data, and expose a new API to pre-compute a general EC_KEY.
I would appreciate any comments on the ideas above from the community.