Python library

Samuel Stauffer

unread,

Oct 21, 2009, 11:38:31 PM10/21/09

to BERT-RPC

BERT-RPC looks like something I've been wanting for a while. Thanks
for creating such a cool project.

I've started a couple of Python project for BERT-RPC and Erlang
(Erlastic which will be Erlastic for Python). Currently there's only
support for Erlang External Term Format (de)serialization and BERT (de)
serialization. I want to next add client side support for BERT-RPC and
then look at modifying ernie to be able to host Python. After
everything is working the plan is to make an optimized C version of
the codecs, but most important at first is to get it all working in
pure Python.

If anyone wants to help out or just take a look here's the projects:

http://github.com/samuel/python-erlastic

http://github.com/samuel/python-bert

Couple of things that are still incomplete or undecided:

* Python has no atoms/symbols. For now I've just subclassed 'str' to
form an Atom class to use as a marker of what's an atom. They'll
behave like strings, but the serializer knows to encode them properly.

* Same with binary types. Basically in Python 2.x strings and binaries
are the same. I was thinking of using 'buffer' for binaries, but it
has its own incompatibilities if you're expecting a str. For now I did
the same as Atom and created a Binary class. For both of these I'm up
for better solutions. Sidenote, unicode get encoded as a list of
integers. I believe this is the best way to handle this. If the code
calling the library wants a specific string encoding then it can
encode everything before passing it to the encoder.

* In CPython you can't serialize a compiled regex. The regex instance
doesn't expose the expression nor the options. I'm not sure the best
way around this either. For now you can deserialize regexs but not
serialize them.

-
Samuel

Paul Davis

unread,

Oct 22, 2009, 12:22:16 AM10/22/09

to bert...@googlegroups.com

On Wed, Oct 21, 2009 at 11:38 PM, Samuel Stauffer <sam...@lefora.com> wrote:
>
> BERT-RPC looks like something I've been wanting for a while. Thanks
> for creating such a cool project.
>
> I've started a couple of Python project for BERT-RPC and Erlang
> (Erlastic which will be Erlastic for Python). Currently there's only
> support for Erlang External Term Format (de)serialization and BERT (de)
> serialization. I want to next add client side support for BERT-RPC and
> then look at modifying ernie to be able to host Python. After
> everything is working the plan is to make an optimized C version of
> the codecs, but most important at first is to get it all working in
> pure Python.
>
> If anyone wants to help out or just take a look here's the projects:
>
> http://github.com/samuel/python-erlastic
>
> http://github.com/samuel/python-bert
>
> Couple of things that are still incomplete or undecided:
>
> * Python has no atoms/symbols. For now I've just subclassed 'str' to
> form an Atom class to use as a marker of what's an atom. They'll
> behave like strings, but the serializer knows to encode them properly.
>

I was contemplating this the other day. That's the best solution I
came up with as well.

> * Same with binary types. Basically in Python 2.x strings and binaries
> are the same. I was thinking of using 'buffer' for binaries, but it
> has its own incompatibilities if you're expecting a str. For now I did
> the same as Atom and created a Binary class. For both of these I'm up
> for better solutions. Sidenote, unicode get encoded as a list of
> integers. I believe this is the best way to handle this. If the code
> calling the library wants a specific string encoding then it can
> encode everything before passing it to the encoder.
>

I would treat non-atom strings as binaries. There's no reason to
subclass here AFAICT. As to Unicode as lists of ints, I'd rather see a
UTF-8 default or similar. This way you can round trip through and not
have lots of horridness. Granted as quickly as I read BERT I didn't
see any encoding specifications. Depending on who you ask and which
module's documentation you look at, either a list of integers or
binaries as UTF-8 are acceptable. read: No clear winner on which is
best. Though, unicode -> bert -> python == list of ints?

> * In CPython you can't serialize a compiled regex. The regex instance
> doesn't expose the expression nor the options. I'm not sure the best
> way around this either. For now you can deserialize regexs but not
> serialize them.

Firstly, I thought Regex instances had a pattern method. I haven't had
to investigate the C API for those instances though. Secondly, Regex
serialization just amuses me for some reason. The differences in
syntax and options ought to be fun to reconcile :D

I'll take a closer look at those Python libs in the next day or two,
but overall it sounds like the right direction.

Paul J. Davis

Samuel Stauffer

unread,

Oct 22, 2009, 12:50:51 AM10/22/09

to bert...@googlegroups.com

For string/atom encoding here's how it works currently.

From Erlang to Python:

atom -> Atom("str")

8-bit/byte list -> str

list -> list

binary -> Binary("str")

From Python to Erlang

str -> 8-bit list/byte

Atom("str") -> atom

list -> list

Binary("str") -> binary

Mainly I added the Binary and Atom classes to tell the encoder to make it a binary or atom on the Erlang side. It's important to be able to pass both byte strings and binary.

Another option I thought, and your comment on encoding makes me think it might even be better, is to always have str be binary and unicode encode to a byte string. This would more closely match how Python 3k looks at the world, and it eliminates yet another marker class.

For unicode, I mainly thought to use integer lists as that's how Erlang handles Unicode. Then again, Erlang's string handling isn't exactly sane.. heh. I wouldn't be opposed to adding a default_encoding argument to the encoder/decoder. If given then encode/decode to/from byte lists, but if it's None then use an integer list. Then again, you're probably right and it should just encode always. That's probably the most common case.

Hmm.. thinking about it now I think I'll change it to be:

Python str -> Erlang binary

Python unicode -> Erlang string .. encoded by default to utf-8 but configurable

I'm somewhat surprised by the inclusion of a regex type. The options mapped pretty well to Python (though I would have liked to see a 'dotall' option), but I'm not sure I'll ever need regexs over RPC myself. If you find a way to get info about a compiled regex I would love to know. I couldn't figure it out. If all else fails then it might be possible using a C extension (though would certainly be a hack). Then again, I'm not too motivated to figure it out, but I definitely welcome any patches.

Samuel Stauffer

unread,

Oct 22, 2009, 1:05:27 AM10/22/09

to BERT-RPC

Err, sorry. You're right about regex objects having a pattern
attribute (and flags). I was looking at dir(..) and it wasn't showing
up.

Thanks for pointing that out. I'll add support for it.

Paul Davis

unread,

Oct 22, 2009, 1:34:55 AM10/22/09

to bert...@googlegroups.com

Samuel,

I would make the mappings something like:

Erlang -> Python
atom() -> Atom()
str() (code 107) -> str()
binary() -> str()
list() -> list()

Python -> Erlang
str() -> binary()
Atom() -> atom()
list() -> list()

No Python Binary class needed. The 107 type is IMO just implementation
leaking through abstraction. Binaries are lists of 8-bit values, so
its a duplication of concerns and icky. Such is life.

In terms of Unicode, I'd rather see Python's unicode() --(UTF-8)->
binary(). where UTF-8 is the default (but configurable) encoding for
serialization. I wrote an Erlang JSON parser the other week to be
compliant with Douglas Crockford's idea of JSON. Passing any of that
through the Erlang xmerl routines will most assuredly cause lots and
lots of havoc. This is important because Python does similar things
with Unicode. Ie, Python is lax in its unicode workings where as
Erlang is not. Thus, binaries are probably a better signal of "binary
data, interpret at your own risk".

Concrete example:

Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\uFFFF".encode("utf-8")
'\xef\xbf\xbf'
>>> u"\uFFFF".encode("utf-16")
'\xff\xfe\xff\xff'

Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:0] [kernel-poll:false]

Eshell V5.7.2 (abort with ^G)
1> xmerl_ucs:from_utf8(<<254, 191, 191>>).
** exception exit: {ucs,{bad_utf8_character_code}}
in function xmerl_ucs:from_utf8/1
2> xmerl_ucs:from_utf16be(<<255, 254, 255, 255>>).
** exception error: no true branch found when evaluating an if expression
in function xmerl_ucs:from_utf16be/3
3>

Also, Unicode hurts my brain.

Paul J. Davis

Samuel Stauffer

unread,

Oct 22, 2009, 2:17:51 AM10/22/09

to bert...@googlegroups.com

After writing everything below I realized that there's already a way to serialize complex data types. BERT. heh. Can just add a new atom such as 'unicode' and either standardize on an encoding or specify it in the tuple. e.g. {bert, unicode, EncodedString OR ListOfInts OR Whatever}

I still think the default handling of the byte list (107) type is still important.

I'm starting to think that doing things in a way that's least likely to cause unexpected results is best. Instead of converting a byte list (107) to a str it should be converted to a list of ints.

The reason for this is that it could cause hard to find bugs if all the sudden a type changed. Take for instance a service that returns a list of ints: [432, 522, 14]. It would work all well and good on the client side receiving a list of ints, but if all of the sudden the list happens to only be numbers less than 256 then all of the sudden you get back a str (if byte lists were converted to str).

There's also the issue that you can encode the same value multiple ways. So you could encode [1,2,3] as a list of ints or as a byte list. Erlang looks at them the same way so I think the other languages should as well.

Paul Davis

unread,

Oct 22, 2009, 11:37:17 AM10/22/09

to bert...@googlegroups.com

Samuel,

Hmm. I do see your point on coercing lists of small ints to binaries
being weird. Looking back at the mappings maybe we can do something
like this:

From Erlang to Python:
atom() -> Atom()
binary() -> str()
107_list() -> list
list -> list

From Python to Erlang
Atom() -> atom()
list -> list
str() -> binary

This way we push the weird byte list vs list of anything to the Erlang
side. If people care enough about the list of bytes case for some
reason, exposing an optional bytelist() type to Python that would
check all operations for data type and range may be the best bet.

My biggest point is that Erlang binaries should be mapped directly to
Python strings and vice versa. The important part here is that I'm
pretty certain the 107 byte-list type is actually just a serialization
optimization, the lists still use four bytes in memory. This leads to
lots of issues down the road with people accidentally pushing data
from Python and ending up with a 4x explosion when they hit the
Erlang side because of a weird coercion they didn't notice (in
Python). Ie, it the same as Python Unicode hell that lead to a
splitting of that hierarchy. Oh, and it'd make sense going forward
when we have the bytes class instead of str.

As to {unicode, ...} I would definitely support something like
{unicode, binary(), [Options]}. It worries me a bit for languages that
don't have a native unicode type and can't easily subclass the native
string to at least provide round tripping though. But after
contemplating, I don't see how Python would be able to round trip
without the info available for deserialization.

HTH,
Paul Davis

Samuel Stauffer

unread,

Oct 22, 2009, 1:58:42 PM10/22/09

to bert...@googlegroups.com

That mapping looks good to me.

I guess it's up to Tom to figure out what to do with the unicode if it'll be a bert complex type. For the Erlang encoder I'll just change it to encode unicode and format as binary by default. Users of the class can either subclass or pass in arguments to change the behavior. The BERT encoder can handle it differently and pass a different primitive to the Erlang encoder once it's decided the best way to handle it.

Paul Davis

unread,

Oct 22, 2009, 2:08:17 PM10/22/09

to bert...@googlegroups.com

Samuel,

Sounds good to me. If I have time this weekend I'll write a C module
to pass your tests. Also, +1 on how you've split the modules like
that.

Paul Davis

Tom Preston-Werner

unread,

Oct 27, 2009, 2:24:32 PM10/27/09

to BERT-RPC

On Oct 22, 10:58 am, Samuel Stauffer <sam...@lefora.com> wrote:
> That mapping looks good to me.
> I guess it's up to Tom to figure out what to do with the unicode if it'll be
> a bert complex type.

I've added a proposal for a complex string type outlined here:

http://groups.google.com/group/bert-rpc/browse_thread/thread/b3ccda7b76a3a631

Tom

Reply all

Reply to author

Forward