Happy to explain that design decision, it was an irritating one to make.
That encoding is one of my own invention, and the docstring for ab64_encode() is
specification. That's why the docstring uses the phrasing "identical
... except ...", it's not informally saying the encoding is similar in some way, it's laying out exactly how the encoding is formed:
take base64 as defined by RFC4648
, replace index-62 character "+" with a
".", omit the padding "=" and any whitespace. The output should in
all other ways be identical to the base64 spec, or it's a bug in the code. I don't feel this was treading new ground, since there's already a wide range of base64 variants
which make modifications like this, and the RFC acknowledges such modifications may be needed for various
As to why I even bothered? Sadly the answer is backwards compatibility.
My apologies for the digression, but a little background... Originally most Unix systems used the des_crypt hash to store passwords. It encoded it's data using a (never formalized) scheme that was similar to base64, but used the charset "A-Z0-9./", no padding character, big-endian octet-packing, and a character map that's completely different from base64. For some reason, md5_crypt (des_crypt's primary successor) kept the custom character map, but swapped to little-endian octet packing (ala base64), and added a crazy byte-transposition step as well. Bcrypt couldn't leave well enough alone either: it went back to des_crypt's encoding scheme, uses the same charset, but completely rearranged the character map yet again
. Gaaah! It was maddening getting all those implemented in passlib, and to work up test vectors -- there are no standards or good references for any of those encodings.
So when I sat down to design a PBKDF2-based hash format, I decided I needed to retain the same character set that had been in use since des_crypt, in order to minimize the chances of storage incompatibilities. But I wanted this format to be portable and easily to reimplement, so instead of using one of des_crypt / md5_crypt / bcrypt's crazy encoding schemes, I settled on standard base64 plus the two small changes. That way, other pbkdf2_sha256 implementations could leverage their language's existing base64 routines, and just make a couple of simple character replacements afterwards. Hence the "ab64" scheme.
Regarding the padding characters, I'm not sure what you mean by "Since you omit padding it is not exactly 27, 43, 86 but at most. As you see decision to trail output is error prone."
The digest portion of a pbkdf2_sha256 hash will always be 32 raw bytes, and will always require 43 characters when encoded (the governing equation is `encoded_size = ceil(raw_size * 4/3.0)`, the rounded portion corresponds to 0, 8, or 16 unused bits encoded into the resulting string). Because the size is already known, the pbkdf2_sha256 parser will throw an error if the digest is the wrong size, and the trailing "=" characters are just wasted space. This is allowed by RFC4648 Section 3.2, but more to the point, RFC4648 Section 5 Paragraph 3 states that if the correct length is known implicitly, then the padding can be omitted.
Hope that clears things up!
I've got some questions according to this endoding.
Is it your own invention or there is a rfc or other standard for it?
If it is your own invention why you decided to create another "b64like" standard (slightly shorter output is not a good reason for me) ?
"the output of this function is identical to stdlib’s b64_encode, except that it uses . instead of +, and omits trailing padding = and whitepsace."
Maybe it would be better to change word identical to similar.