I was contemplating this the other day. That's the best solution I
came up with as well.
> * Same with binary types. Basically in Python 2.x strings and binaries
> are the same. I was thinking of using 'buffer' for binaries, but it
> has its own incompatibilities if you're expecting a str. For now I did
> the same as Atom and created a Binary class. For both of these I'm up
> for better solutions. Sidenote, unicode get encoded as a list of
> integers. I believe this is the best way to handle this. If the code
> calling the library wants a specific string encoding then it can
> encode everything before passing it to the encoder.
>
I would treat non-atom strings as binaries. There's no reason to
subclass here AFAICT. As to Unicode as lists of ints, I'd rather see a
UTF-8 default or similar. This way you can round trip through and not
have lots of horridness. Granted as quickly as I read BERT I didn't
see any encoding specifications. Depending on who you ask and which
module's documentation you look at, either a list of integers or
binaries as UTF-8 are acceptable. read: No clear winner on which is
best. Though, unicode -> bert -> python == list of ints?
> * In CPython you can't serialize a compiled regex. The regex instance
> doesn't expose the expression nor the options. I'm not sure the best
> way around this either. For now you can deserialize regexs but not
> serialize them.
Firstly, I thought Regex instances had a pattern method. I haven't had
to investigate the C API for those instances though. Secondly, Regex
serialization just amuses me for some reason. The differences in
syntax and options ought to be fun to reconcile :D
I'll take a closer look at those Python libs in the next day or two,
but overall it sounds like the right direction.
Paul J. Davis
I would make the mappings something like:
Erlang -> Python
atom() -> Atom()
str() (code 107) -> str()
binary() -> str()
list() -> list()
Python -> Erlang
str() -> binary()
Atom() -> atom()
list() -> list()
No Python Binary class needed. The 107 type is IMO just implementation
leaking through abstraction. Binaries are lists of 8-bit values, so
its a duplication of concerns and icky. Such is life.
In terms of Unicode, I'd rather see Python's unicode() --(UTF-8)->
binary(). where UTF-8 is the default (but configurable) encoding for
serialization. I wrote an Erlang JSON parser the other week to be
compliant with Douglas Crockford's idea of JSON. Passing any of that
through the Erlang xmerl routines will most assuredly cause lots and
lots of havoc. This is important because Python does similar things
with Unicode. Ie, Python is lax in its unicode workings where as
Erlang is not. Thus, binaries are probably a better signal of "binary
data, interpret at your own risk".
Concrete example:
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\uFFFF".encode("utf-8")
'\xef\xbf\xbf'
>>> u"\uFFFF".encode("utf-16")
'\xff\xfe\xff\xff'
Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:0] [kernel-poll:false]
Eshell V5.7.2 (abort with ^G)
1> xmerl_ucs:from_utf8(<<254, 191, 191>>).
** exception exit: {ucs,{bad_utf8_character_code}}
in function xmerl_ucs:from_utf8/1
2> xmerl_ucs:from_utf16be(<<255, 254, 255, 255>>).
** exception error: no true branch found when evaluating an if expression
in function xmerl_ucs:from_utf16be/3
3>
Also, Unicode hurts my brain.
Paul J. Davis
Hmm. I do see your point on coercing lists of small ints to binaries
being weird. Looking back at the mappings maybe we can do something
like this:
From Erlang to Python:
atom() -> Atom()
binary() -> str()
107_list() -> list
list -> list
From Python to Erlang
Atom() -> atom()
list -> list
str() -> binary
This way we push the weird byte list vs list of anything to the Erlang
side. If people care enough about the list of bytes case for some
reason, exposing an optional bytelist() type to Python that would
check all operations for data type and range may be the best bet.
My biggest point is that Erlang binaries should be mapped directly to
Python strings and vice versa. The important part here is that I'm
pretty certain the 107 byte-list type is actually just a serialization
optimization, the lists still use four bytes in memory. This leads to
lots of issues down the road with people accidentally pushing data
from Python and ending up with a 4x explosion when they hit the
Erlang side because of a weird coercion they didn't notice (in
Python). Ie, it the same as Python Unicode hell that lead to a
splitting of that hierarchy. Oh, and it'd make sense going forward
when we have the bytes class instead of str.
As to {unicode, ...} I would definitely support something like
{unicode, binary(), [Options]}. It worries me a bit for languages that
don't have a native unicode type and can't easily subclass the native
string to at least provide round tripping though. But after
contemplating, I don't see how Python would be able to round trip
without the info available for deserialization.
HTH,
Paul Davis
Sounds good to me. If I have time this weekend I'll write a C module
to pass your tests. Also, +1 on how you've split the modules like
that.
Paul Davis