I've been working on the same issue. So far it has mostly been just
researching various options, but I can give you my two cents...
It really depends on your goals and constraints. I have narrowed down
to two major families of serialization for storage and networking. One
is the JSON/YAML/XML style, where you generate a serialized version of
data structures primarily based on vectors and hashes that contain only
simple data types. (Note, JSON is a subset of YAML, so you can parse
JSON with YAML but not vice versa.) This is by far the fastest to
develop and the most light weight in terms of programmer time.
Basically one line each for read/write. The potential hidden cost
depends on what data structures you use in your program. If you have
clearly defined chunks of data to serialize, YAML works nicely, but for
more complex structures you often have to do an intermediate conversion
to simpler data structures where you deal by hand with things like
circular references and pointers to ephemeral data that you don't want
serialized.
The previous options are however, inefficient for storage, transmission
and parsing in comparison to a more strictly defined protocol. If you
need raw performance and you are willing to spend the effort defining
your protocol, then I think something like the Google protocol buffers
or Facebook thrift are good options. They are basically the new-school
versions of CORBA RPC. In essence, you define a schema for your
messages or data serialization units, and then some tools generate
classes or functions that are used to read/write and transmit this
data. (SOAP pretty much works the same way, but it idiotically sits on
XML too, so you get the worst of both worlds...) Again, if your data
units to be serialized are self contained this can work pretty smoothly,
but in more complex structures you will also have to convert between the
simple, generated classes and your more complex application classes.
The real work though, is in creating and maintaining your protocol
definitions and the code that uses the generated classes.
I think the default for a language like clojure should be YAML too. For
dynamic languages where developer time is the focus it is by far the
quickest mechanism to get up and running using databases, configuration
files, networking, etc. Maybe we should look into integrating the
built-in Clojure data-types with a YAML library, or otherwise creating a
new one, so we can dump and load directly between serialized strings and
Clojure data structures.
If you run up against the limits of YAML, then I would go protocol
buffers. They seem like a clean and efficient way to support
multi-language communication without wasting time writing a bunch of
custom serialization methods. It would be interesting if there was a
way to sort of generate .proto files by example, by sniffing YAML on the
wire or something... It could at least help bootstrap the protocol
definition phase.
Hopefully that helps.
-Jeff