Ldap module and base64 oncoding

Joseph L. Casale

unread,

May 24, 2013, 5:00:01 PM5/24/13

to pytho...@python.org

I have some data I am working with that is not being interpreted as a string requiring
base64 encoding when sent to the ldif module for output.

The base64 string parsed is ZGV0XDMzMTB3YmJccGc= and the raw string is det\3310wbb\pg.
I'll admit my understanding of the handling requirements of non ascii data in 2.7 is weak
and as such I am failing at adjusting the regex that deduces is the string contains characters
requiring base64 encoding when being output.

Any insight, or nudges in the right direction would be appreciated!
Thanks,
jlc

Carlos Nepomuceno

unread,

May 24, 2013, 9:10:18 PM5/24/13

to pytho...@python.org

Can you give an example of the code you have?

----------------------------------------
> From: jca...@activenetwerx.com
> To: pytho...@python.org
> Subject: Ldap module and base64 oncoding
> Date: Fri, 24 May 2013 21:00:01 +0000

> --
> http://mail.python.org/mailman/listinfo/python-list

Joseph L. Casale

unread,

May 25, 2013, 11:42:14 AM5/25/13

to pytho...@python.org

> Can you give an example of the code you have?

I actually just overrode the regex used by the method in the LDIFWriter class to be far more broad
about what it interprets as a safe string. I really need to properly handle reading, manipulating and
writing non ascii data to solve this...

Shame there is no ldap module (with the ldifwriter) in Python 3.
jlc

Michael Ströder

unread,

May 26, 2013, 11:07:35 AM5/26/13

to

Joseph L. Casale wrote:
> I have some data I am working with that is not being interpreted as a string requiring
> base64 encoding when sent to the ldif module for output.
>
> The base64 string parsed is ZGV0XDMzMTB3YmJccGc= and the raw string is det\3310wbb\pg.
> I'll admit my understanding of the handling requirements of non ascii data in 2.7 is weak
> and as such I am failing at adjusting the regex that deduces is the string contains characters
> requiring base64 encoding when being output.

I'm not sure what exactly you're asking for.
Especially "is not being interpreted as a string requiring base64 encoding" is
written without giving the right context.

So I'm just guessing that this might be the usual misunderstandings with use
of base64 in LDIF. Read more about when LDIF requires base64-encoding here:

http://tools.ietf.org/html/rfc2849

To me everything looks right:

Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'ZGV0XDMzMTB3YmJccGc='.decode('base64').decode('utf-8')
u'det\\3310wbb\\pg'
>>>

What do you think is a problem?

Ciao, Michael.

Michael Ströder

unread,

May 26, 2013, 11:16:28 AM5/26/13

to

Joseph L. Casale wrote:
>> Can you give an example of the code you have?
>
> I actually just overrode the regex used by the method in the LDIFWriter class to be far more broad
> about what it interprets as a safe string.

Are you sure that you fully understood RFC 2849 before doing this?
Which version of python-ldap are you using?

> I really need to properly handle reading, manipulating and
> writing non ascii data to solve this...

Module ldif in python-ldap does that for you based on RFC 2849.
Without seeing your code using it I cannot tell what's wrong.

> Shame there is no ldap module (with the ldifwriter) in Python 3.

1. The module ldif is stand-alone. So you could easily make it available for
Python 3.

2. "Shame" is the wrong term here. Personally I currently have no requirement
to use Python 3 and I'm quite busy with other things. So contributors are
welcome. But they should be willing to do some serious work giving continous
support - not only a half-baken patch.

Ciao, Michael.

Joseph L. Casale

unread,

May 26, 2013, 12:19:57 PM5/26/13

to Michael Ströder, pytho...@python.org

> I'm not sure what exactly you're asking for.
> Especially "is not being interpreted as a string requiring base64 encoding" is
> written without giving the right context.
>
> So I'm just guessing that this might be the usual misunderstandings with use
> of base64 in LDIF. Read more about when LDIF requires base64-encoding here:
>
> http://tools.ietf.org/html/rfc2849
>
> To me everything looks right:
>
> Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 'ZGV0XDMzMTB3YmJccGc='.decode('base64').decode('utf-8')
> u'det\\3310wbb\\pg'
> >>>
>
> What do you think is a problem?

Michael,
Thanks for the reply. The issues I am sure are in my code, I read the ldif source file and up
with a values such as 'det\3310wbb\pg' after the base64 encoded entries are decoded.

The problem I am having is when I add this to an add/mod entry list and write it back out.
As it does not get re-encoded to base64 the ldif file ends up seeing a text entry with a ^]
character which if I re-read it with the parser it causes the handle method to break midway
through the entry dict and so the last half re-appears disjoint without a dn.

Like I said, I am pretty sure its my poor misunderstanding of decoding and encoding.
I am using the build from http://www.lfd.uci.edu/~gohlke/pythonlibs/ on a windows
2008 r2 server.

I have re-implemented handle to create a cidict holding all the dn/entry's that are parsed as
I then perform some processing such as manipulating attribute values in the entry dict. I
am pretty sure I am breaking things here. The data I am reading is coming from utf-16-le
encoded files and has Unicode characters as the source directory is globally available, being
written to in just about every country.

Is there a process for manipulating/adding data to the entry dict before I write it out that I
should adhere to? For example, if I am adding a new attribute to be composed of part of
another parsed attr for use in a modlist:

{'customAttr': ['foo.{}.bar'.format(entry['uid'])]}

By looking at the value from above, 'det\3310wbb\pg', I gather the entry dict was parsed
into byte strings. I should have decoded this, where as some of the data is Unicode and
as such I should have encoded it?

I really appreciate the time.

Grazie per tutto,
jlc

Michael Ströder

unread,

May 26, 2013, 3:48:38 PM5/26/13

to

Joseph L. Casale wrote:
>> I'm not sure what exactly you're asking for.
>> Especially "is not being interpreted as a string requiring base64 encoding" is
>> written without giving the right context.
>>
>> So I'm just guessing that this might be the usual misunderstandings with use
>> of base64 in LDIF. Read more about when LDIF requires base64-encoding here:
>>
>> http://tools.ietf.org/html/rfc2849
>>
>> To me everything looks right:
>>
>> Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> 'ZGV0XDMzMTB3YmJccGc='.decode('base64').decode('utf-8')
>> u'det\\3310wbb\\pg'
>>>>>
>>
>> What do you think is a problem?
>

> Thanks for the reply. The issues I am sure are in my code, I read the ldif source file and up
> with a values such as 'det\3310wbb\pg' after the base64 encoded entries are decoded.
>
> The problem I am having is when I add this to an add/mod entry list and write it back out.
> As it does not get re-encoded to base64 the ldif file ends up seeing a text entry with a ^]
> character which if I re-read it with the parser it causes the handle method to break midway
> through the entry dict and so the last half re-appears disjoint without a dn.
>
> Like I said, I am pretty sure its my poor misunderstanding of decoding and encoding.
> I am using the build from http://www.lfd.uci.edu/~gohlke/pythonlibs/ on a windows
> 2008 r2 server.
>
> I have re-implemented handle to create a cidict holding all the dn/entry's that are parsed as
> I then perform some processing such as manipulating attribute values in the entry dict. I
> am pretty sure I am breaking things here. The data I am reading is coming from utf-16-le
> encoded files and has Unicode characters as the source directory is globally available, being
> written to in just about every country.

Processing LDIF is one thing, doing LDAP operations another.

LDIF itself is meant to be ASCII-clean. But each attribute value can carry any
byte sequence (e.g. attribute 'jpegPhoto'). There's no further processing by
module LDIF - it simply returns byte sequences.

The access protocol LDAPv3 mandates UTF-8 encoding for Unicode strings on the
wire if attribute syntax is DirectoryString, IA5String (mainly ASCII) or similar.

So if you're LDIF input returns UTF-16 encoded attribute values for e.g.
attribute 'cn' or 'o' or another attribute not being of OctetString or Binary
syntax something's wrong with the producer of the LDIF data.

> Is there a process for manipulating/adding data to the entry dict before I write it out that I
> should adhere to? For example, if I am adding a new attribute to be composed of part of
> another parsed attr for use in a modlist:
>
> {'customAttr': ['foo.{}.bar'.format(entry['uid'])]}
>
> By looking at the value from above, 'det\3310wbb\pg', I gather the entry dict was parsed
> into byte strings. I should have decoded this, where as some of the data is Unicode and
> as such I should have encoded it?

I wonder what the string really is. At least the base64-encoding you provided
before decodes as UTF-8 but I'm not sure whether it's the right sequence of
Unicode code points you're expecting.

>>> 'ZGV0XDMzMTB3YmJccGc='.decode('base64').decode('utf-8')
u'det\\3310wbb\\pg'

I still can't figure out what you're really doing though. I'd recommend to
strip down your operations to a very simple test code snippet illustrating the
issue and post that here.

Ciao, Michael.

Joseph L. Casale

unread,

May 27, 2013, 1:15:01 AM5/27/13

to Michael Ströder, pytho...@python.org

Hi Michael,

> Processing LDIF is one thing, doing LDAP operations another.
>
> LDIF itself is meant to be ASCII-clean. But each attribute value can carry any
> byte sequence (e.g. attribute 'jpegPhoto'). There's no further processing by
> module LDIF - it simply returns byte sequences.
>
> The access protocol LDAPv3 mandates UTF-8 encoding for Unicode strings on the
> wire if attribute syntax is DirectoryString, IA5String (mainly ASCII) or similar.
>
> So if you're LDIF input returns UTF-16 encoded attribute values for e.g.
> attribute 'cn' or 'o' or another attribute not being of OctetString or Binary
> syntax something's wrong with the producer of the LDIF data.

That could be, I am using ms's ldifde.exe to dump a domino and AD directory for
comparative processing. The problem is I don't have much control on the data in
the directory and I do know that DN's have non ascii characters unique to the

> I wonder what the string really is. At least the base64-encoding you provided
> before decodes as UTF-8 but I'm not sure whether it's the right sequence of
> Unicode code points you're expecting.
>
> >>> 'ZGV0XDMzMTB3YmJccGc='.decode('base64').decode('utf-8')
> u'det\\3310wbb\\pg'
>
> I still can't figure out what you're really doing though. I'd recommend to
> strip down your operations to a very simple test code snippet illustrating the
> issue and post that here.

So I have removed all my likely broken attempts at working with this data and will
soon have some simple code but at this point I may have an indication of what is
awry with my data.

After parsing the data for a user I am simply taking a value from the ldif file and writing
it back out to another which fails, the value parsed is:

officestreetaddress:: T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ==

File "C:\Python27\lib\site-packages\ldif.py", line 202, in unparse
self._unparseChangeRecord(record)
File "C:\Python27\lib\site-packages\ldif.py", line 181, in _unparseChangeRecord
self._unparseAttrTypeandValue(mod_type,mod_val)
File "C:\Python27\lib\site-packages\ldif.py", line 142, in _unparseAttrTypeandValue
self._unfoldLDIFLine(':: '.join([attr_type,base64.encodestring(attr_value).replace('\n','')]))
File "C:\Python27\lib\base64.py", line 315, in encodestring
pieces.append(binascii.b2a_base64(chunk))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 7: ordinal not in range(128)

> c:\python27\lib\base64.py(315)encodestring()
-> pieces.append(binascii.b2a_base64(chunk))
(Pdb) l
310 def encodestring(s):
311 """Encode a string into multiple lines of base-64 data."""
312 pieces = []
313 for i in range(0, len(s), MAXBINSIZE):
314 chunk = s[i : i + MAXBINSIZE]
315 -> pieces.append(binascii.b2a_base64(chunk))
316 return "".join(pieces)
317
318
319 def decodestring(s):
320 """Decode a string."""
(Pdb) args
s = Otto-Meßmer-Straße 1

So moving up a frame or two and looking at the entry dict, I see a modlist entry of:
('streetAddress', [u'Otto-Me\xdfmer-Stra\xdfe 1']) which is correct:

In [2]: 'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=='.decode('base64').decode('utf-8')
Out[2]: u'Otto-Me\xdfmer-Stra\xdfe 1'

Looking at the stack trace, I think I see the issue:
(Pdb) import base64
(Pdb) base64.encodestring(u'Otto-Me\xdfmer-Stra\xdfe 1'.encode('utf-8')).replace('\n','')
'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=='

I now have the exact the value I started with. Ensuring where I ever handle the original
values that I return utf-8 decoded objects for use in a modlist to later write and Sub
classing LDIFWriter and overriding _unparseAttrTypeandValue to do the encoding has
eliminated all the errors.

What remains finally is ldifde.exe's output of what looks like U+00BF, or an inverted question
mark for some values, otherwise this issue looks solved.

Thanks for everything,
jlc

dieter

unread,

May 27, 2013, 2:04:16 AM5/27/13

to pytho...@python.org

"Joseph L. Casale" <jca...@activenetwerx.com> writes:
> ...

> After parsing the data for a user I am simply taking a value from the ldif file and writing
> it back out to another which fails, the value parsed is:
>
> officestreetaddress:: T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ==
>
>
> File "C:\Python27\lib\site-packages\ldif.py", line 202, in unparse
> self._unparseChangeRecord(record)
> File "C:\Python27\lib\site-packages\ldif.py", line 181, in _unparseChangeRecord
> self._unparseAttrTypeandValue(mod_type,mod_val)
> File "C:\Python27\lib\site-packages\ldif.py", line 142, in _unparseAttrTypeandValue
> self._unfoldLDIFLine(':: '.join([attr_type,base64.encodestring(attr_value).replace('\n','')]))
> File "C:\Python27\lib\base64.py", line 315, in encodestring
> pieces.append(binascii.b2a_base64(chunk))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 7: ordinal not in range(128)
>
>> c:\python27\lib\base64.py(315)encodestring()
> -> pieces.append(binascii.b2a_base64(chunk))

This looks like a coding bug: "chunk" seems to be a unicode string;
"b2a_base64" expects an encoded string ("str/bytes"); as
a consequence, Python tries to convert the unicode to "str" by
encoding with its "default enconding" ("ascii" by default) - and
fails.

You could try to find out, why "chunk" is unicode (rather than "str").
That will probably bring you to the real problem.

Michael Ströder

unread,

May 27, 2013, 3:56:43 AM5/27/13

to

Joseph L. Casale wrote:
> After parsing the data for a user I am simply taking a value from the ldif file and writing
> it back out to another which fails, the value parsed is:
>
> officestreetaddress:: T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ==
>
>
> File "C:\Python27\lib\site-packages\ldif.py", line 202, in unparse
> self._unparseChangeRecord(record)
> File "C:\Python27\lib\site-packages\ldif.py", line 181, in _unparseChangeRecord
> self._unparseAttrTypeandValue(mod_type,mod_val)
> File "C:\Python27\lib\site-packages\ldif.py", line 142, in _unparseAttrTypeandValue
> self._unfoldLDIFLine(':: '.join([attr_type,base64.encodestring(attr_value).replace('\n','')]))
> File "C:\Python27\lib\base64.py", line 315, in encodestring
> pieces.append(binascii.b2a_base64(chunk))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 7: ordinal not in range(128)

Note that all modules in python-ldap up to 2.4.10 including module 'ldif'
expect raw byte strings to be passed as arguments. It seems to me you're
passing a Unicode object in the entry dictionary which will fail in case an
attribute value contains NON-ASCII chars.

python-ldap expects raw strings since it's not schema-aware and therefore does
not have any knowledge about the LDAP syntax used for a particular attribute
type. So automagically convert Unicode strings will likely fail in many cases.
=> The calling application has to deal with it.

> I now have the exact the value I started with. Ensuring where I ever handle the original
> values that I return utf-8 decoded objects for use in a modlist to later write and Sub
> classing LDIFWriter and overriding _unparseAttrTypeandValue to do the encoding has
> eliminated all the errors.

Don't muck with overriding _unparseAttrTypeandValue(). Simply pass the
properly encoded data into ldif module.

Ciao, Michael.

Joseph L. Casale

unread,

May 27, 2013, 8:12:58 PM5/27/13

to Michael Ströder, pytho...@python.org

> Note that all modules in python-ldap up to 2.4.10 including module 'ldif'
> expect raw byte strings to be passed as arguments. It seems to me you're
> passing a Unicode object in the entry dictionary which will fail in case an
> attribute value contains NON-ASCII chars.

Yup, I was.

> python-ldap expects raw strings since it's not schema-aware and therefore does
> not have any knowledge about the LDAP syntax used for a particular attribute
> type. So automagically convert Unicode strings will likely fail in many cases.
> => The calling application has to deal with it.

I see, that recco went a long a way in cleaning up my code actually and making the
handling of decoding and encoding more consistent.

> Don't muck with overriding _unparseAttrTypeandValue(). Simply pass the
> properly encoded data into ldif module.

I had some time today, so I attempted to open the ldif files in binary mode to simply
work with the raw byte strings but the moment the first entry was parsed, parse()
stumbled on a character in the first entries dict and passed a dn of None for the last half?

If the option to avoid worrying about decoding and encoding could work, I would be
happy to process the whole lot in byte strings. Any idea what may cause this?

Thanks a lot Michael,
jlc

Michael Ströder

unread,

May 28, 2013, 3:45:32 AM5/28/13

to

Joseph L. Casale wrote:
> I had some time today, so I attempted to open the ldif files in binary mode to simply
> work with the raw byte strings but the moment the first entry was parsed, parse()
> stumbled on a character in the first entries dict and passed a dn of None for the last half?

Without seeing the LDIF data and your code I can't tell what's going on.

> If the option to avoid worrying about decoding and encoding could work, I would be
> happy to process the whole lot in byte strings. Any idea what may cause this?

I would not claim that module 'ldif' has really awesome docs.
But did you follow the example with LDIFParser in the docs?

http://www.python-ldap.org/doc/html/ldif.html#example

It illustrates that for LDIF stream processing one basically derives a class
from ldif.LDIFParser overriding method handle(). The most basic test would be
something like this:

[..]
def handle(self,dn,entry):
print '***dn',repr(dn)
pprint.pprint(entry)

And then carefully look at the output.

Ciao, Michael.