CBOR / JSON don't show any performance differences?

6,184 views
Skip to first unread message

Kevin Burton

unread,
Sep 20, 2015, 2:41:13 PM9/20/15
to jackson-user
I have two compiled parsers / generators using JsonParser and JsonGenerator.

They allow me to swap in a JsonFactory.  

The two I'm testing with are CBORFactory and JsonFactory (for JSON).

The problem is my benchmarks show that they're nearly identical in terms of performance when parsing a large number of documents.

I take about 30 seconds to parse 1M JSON documents. 

Which might be acceptable but I would have assumed CBOR to be 2-3x JSON.

Any thoughts here?


Tatu Saloranta

unread,
Sep 20, 2015, 2:58:54 PM9/20/15
to jackson-user
First of all, for most use cases you should see some performance improvement. The only case where difference should be modest or non-existing would be if all your data consists of String values, Maps of Maps style; in this case amount of information as well as resulting size are very close and there is not much room for improvement.

Second, assumption that CBOR should be 2x - 3x faster is incorrect for most typical payloads. More typical is 20%-50% improvement. Improvement is usually proportional to reduction in size, which in turns depends on kind of content: text values, for example, do not benefit much from binary encoding (they are almost identical, except for lack of escaping/quoting), whereas numeric values benefit more.
This assumption of vast improvement is bolstered by lots of folklore on the web, claiming extraordinary improvements, without backing up, or comparing poorly implemented textual format encoders/decoders: as it is easier to write a decently performing binary codec than textual format one, this is somewhat understandable: on some platforms default JSON codec is inefficient, and thereby well-written binary codecs could well be much faster.

As to CBOR specifically, since CBOR includes same information as JSON (and Smile), and is self-describing, it is not quite as efficient as schema-requiring formats like Protobuf or Avro. This is beneficial for usage, but means that size reduction is more modest, and similarly performance improvement.

Test setup I use for comparing Jackson codecs:


shows, just as an example, 20% improvement for reading and 20-30% for writing; and bit higher improvements when using Afterburner.
Performance improvement using Smile format is slightly better for this case, although for larger documents Smile should perform quite a bit better.

So what is happening in your case? Maybe sharing some of test code would help. There are probably 3 common cases that could occur:

1. You are not using CBOR codec at all, but JSON in both cases. You can rule this out by checking length of encoded document; lengths should always differ
2. Content you have consists of mostly or completely of String values (as per above), and is processed as untyped data (JsonNode or Map). If so, performance really might be very similar
3. Usage itself is accidentally inefficient, like not reusing ObjectMapper, in which case overhead not related to data format makes up most of time used.

-+ Tatu +-


--
You received this message because you are subscribed to the Google Groups "jackson-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jackson-user...@googlegroups.com.
To post to this group, send email to jackso...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kevin Burton

unread,
Sep 21, 2015, 12:00:50 AM9/21/15
to jackson-user


On Sunday, September 20, 2015 at 11:58:54 AM UTC-7, Tatu Saloranta wrote:
First of all, for most use cases you should see some performance improvement. The only case where difference should be modest or non-existing would be if all your data consists of String values, Maps of Maps style; in this case amount of information as well as resulting size are very close and there is not much room for improvement.


It might be that we fall into this situation. Most of our fields are strings..  
 

This assumption of vast improvement is bolstered by lots of folklore on the web, claiming extraordinary improvements, without backing up, or comparing poorly implemented textual format encoders/decoders: as it is easier to write a decently performing binary codec than textual format one, this is somewhat understandable: on some platforms default JSON codec is inefficient, and thereby well-written binary codecs could well be much faster.


Well I admit this was just an assumption of mine based on my experience that binary protocols can have dramatic improvements over textual protocols.  

In a prior life I helped invent RSS and my company pushed about 100TB a month in protocol data (Spinn3r).  In the past we pushed protocol buffers but have migrated to JSON/Jackson (which we really like btw).
 
As to CBOR specifically, since CBOR includes same information as JSON (and Smile), and is self-describing, it is not quite as efficient as schema-requiring formats like Protobuf or Avro. This is beneficial for usage, but means that size reduction is more modest, and similarly performance improvement.


Yes. I was thinking about going back to protobuf for storing our data.  But since we serve JSON anyway it might be more efficient to just store it as a UTF8 blob and then serve the blob.  
 
Test setup I use for comparing Jackson codecs:


shows, just as an example, 20% improvement for reading and 20-30% for writing; and bit higher improvements when using Afterburner.
Performance improvement using Smile format is slightly better for this case, although for larger documents Smile should perform quite a bit better.

So what is happening in your case? Maybe sharing some of test code would help. There are probably 3 common cases that could occur:

1. You are not using CBOR codec at all, but JSON in both cases. You can rule this out by checking length of encoded document; lengths should always differ

Ah. I ruled it out by running it under a profiler and it does show CBOR... 
 
2. Content you have consists of mostly or completely of String values (as per above), and is processed as untyped data (JsonNode or Map). If so, performance really might be very similar
3. Usage itself is accidentally inefficient, like not reusing ObjectMapper, in which case overhead not related to data format makes up most of time used.


Yes.  I profiled the code and didn't see anything like this.  No obvious hotspots other than usual CBOR or JSON variable handling.

Thanks for the feedback. I'll probably look more into the internals this week to see if I can squeeze any more performance or redesign the stack a bit more... 

 

Tatu Saloranta

unread,
Sep 21, 2015, 2:20:47 PM9/21/15
to jackso...@googlegroups.com
On Sun, Sep 20, 2015 at 9:00 PM, Kevin Burton <burto...@gmail.com> wrote:


On Sunday, September 20, 2015 at 11:58:54 AM UTC-7, Tatu Saloranta wrote:
First of all, for most use cases you should see some performance improvement. The only case where difference should be modest or non-existing would be if all your data consists of String values, Maps of Maps style; in this case amount of information as well as resulting size are very close and there is not much room for improvement.


It might be that we fall into this situation. Most of our fields are strings.. 


Ok. Also make sure to use latest (2.6.2) version, if possible. There have been some improvements to CBOR codec with 2.6. It wasn't quite as fully optimized as Smile and JSON codecs are.
 

This assumption of vast improvement is bolstered by lots of folklore on the web, claiming extraordinary improvements, without backing up, or comparing poorly implemented textual format encoders/decoders: as it is easier to write a decently performing binary codec than textual format one, this is somewhat understandable: on some platforms default JSON codec is inefficient, and thereby well-written binary codecs could well be much faster.


Well I admit this was just an assumption of mine based on my experience that binary protocols can have dramatic improvements over textual protocols.  

Right, they can. I am just frustrated at unqualified comments made on many articles -- there are cases where improvement is significant (for example, passing floating-point numeric values), and others where it is less so.
 

In a prior life I helped invent RSS and my company pushed about 100TB a month in protocol data (Spinn3r).  In the past we pushed protocol buffers but have migrated to JSON/Jackson (which we really like btw).

Thanks!
 
One other thing, just in case you do want to use protobuf in places: Jackson now has protobuf backend as well, with 2.6. I actually like protobuf in many ways, as a binary protocol.
It definitely makes different trade-offs than JSON/CBOR/Smile, but I like its simplicity, and it makes for a good format for some use cases.

 
As to CBOR specifically, since CBOR includes same information as JSON (and Smile), and is self-describing, it is not quite as efficient as schema-requiring formats like Protobuf or Avro. This is beneficial for usage, but means that size reduction is more modest, and similarly performance improvement.


Yes. I was thinking about going back to protobuf for storing our data.  But since we serve JSON anyway it might be more efficient to just store it as a UTF8 blob and then serve the blob.  

Right, it depends a lot on kind of data being stored, as well as amount of change.
My main concern with schema-requiring formats is that cost of changing definitions can be huge, as they tend to require external schema repositories, adding one more moving piece in the architecture. Conversely, if format is unlikely to change (or changes are small and rare), they work quite well.

Anyway: if it is easy enough to test, I would suggest you look at Smile codec. It can eliminate overhead of field names (for cases where field names would repeat, esp. streams of similarly structured objects), and is optimized for efficient writing as well as reading.
In many cases it can match size reduction of protobuf (not always, but often).

 
 
Test setup I use for comparing Jackson codecs:


shows, just as an example, 20% improvement for reading and 20-30% for writing; and bit higher improvements when using Afterburner.
Performance improvement using Smile format is slightly better for this case, although for larger documents Smile should perform quite a bit better.

So what is happening in your case? Maybe sharing some of test code would help. There are probably 3 common cases that could occur:

1. You are not using CBOR codec at all, but JSON in both cases. You can rule this out by checking length of encoded document; lengths should always differ

Ah. I ruled it out by running it under a profiler and it does show CBOR... 

Ok good.
 
 
2. Content you have consists of mostly or completely of String values (as per above), and is processed as untyped data (JsonNode or Map). If so, performance really might be very similar
3. Usage itself is accidentally inefficient, like not reusing ObjectMapper, in which case overhead not related to data format makes up most of time used.


Yes.  I profiled the code and didn't see anything like this.  No obvious hotspots other than usual CBOR or JSON variable handling.

Good.
 

Thanks for the feedback. I'll probably look more into the internals this week to see if I can squeeze any more performance or redesign the stack a bit more... 

Sounds good.

-+ Tatu +-

ps. I try to be on #jackson IRC channel at freenode, if you want to bounce ideas -- mailing list is useful, but sometimes chat is good for rapid exchange of ideas

Kevin Burton

unread,
Sep 22, 2015, 1:55:06 PM9/22/15
to jackson-user


Ok. Also make sure to use latest (2.6.2) version, if possible. There have been some improvements to CBOR codec with 2.6. It wasn't quite as fully optimized as Smile and JSON codecs are.
 

Agreed... we actually just did the upgrade.  
 

Well I admit this was just an assumption of mine based on my experience that binary protocols can have dramatic improvements over textual protocols.  

Right, they can. I am just frustrated at unqualified comments made on many articles -- there are cases where improvement is significant (for example, passing floating-point numeric values), and others where it is less so.


Yup. I have the same problem.  Its frustrating when people make overreaching statements about areas that are very complicated like compression algorithms or data encoding.  Maybe an issue of Dunning-Kruger :)
 
One other thing, just in case you do want to use protobuf in places: Jackson now has protobuf backend as well, with 2.6. I actually like protobuf in many ways, as a binary protocol.
It definitely makes different trade-offs than JSON/CBOR/Smile, but I like its simplicity, and it makes for a good format for some use cases.


Agreed.  I had thought of going this route.  However, I think what we're going to do is store the JSON in our Cassandra backend as a byte[] which is just UTF-8 encoded JSON.

This way to re-assemble a document we just have to take 10 documents and elide them into a larger parent JSON structure.

This means that there's very very very little CPU when sending data to our customers.

The only downside is to use the data on our end we have to parse JSON which won't be as fast as say protobufs.  So in the end it might be a wash.  
 
ps. I try to be on #jackson IRC channel at freenode, if you want to bounce ideas -- mailing list is useful, but sometimes chat is good for rapid exchange of ideas


I appreciate it.. I think we're good now.  I think it just makes most sense to store binary JSON as UTF-8 and then serve that directly.  Should be super fast. 
 
Reply all
Reply to author
Forward
0 new messages