In the "simple, 4-bit tables method", the following algorithm is used (see
page 11 of http://www.cryptobarn.com/papers/gcm-spec.pdf for notation) to
compute X * H, with x_i being a 4-bit piece of X, and x_i * H * P^4i being
done with a table lookup using x_i as the index. It needs 32 tables since
there are 32 possibilities for P^4i.
Z = 0
for i = 0 to 31
Z = Z + x_i * H * P^4i
end for
return Z
My algorithm is this, with x_i the same as above.
Z_0 = Z_1 = Z_2 = Z_3 = 0
for i = 0 to 3
for j = 0 to 3
Z_j = Z_j + x_(32i+8j) * H * P^32i
Z_j = Z_j + x_(32i+8j+4) * H * P^(32i+4)
end for
end for
return ((((Z_3 * P^8) + Z_2) * P^8) + Z_1) * P^8 + Z_0
There are four P^32i and four P^(32i+4), so 8 tables are needed per key. The
last step is done using a shared table of 512 bytes, similar to Shoup's
method.