byte length for objects and arrays?

427 views
Skip to first unread message

Jochen Kornitzky

unread,
Sep 12, 2013, 8:58:28 AM9/12/13
to universal-...@googlegroups.com
Reading the UBJSON spec, I was very pleased to find an efficient JSON compatible binary container format. In particular, the leading length specifier (for big strings, aka blobs) makes it easy for parsers to skip over content.

However, this feature is limited for objects and arrays. Not knowing the size (in bytes) but only the number of items in advance, I have to recurse into the (possibly nested) structure. On the other hand, if I actually want to iterate over object or array items, I could easily determine the end by looking at its overall length (in bytes).

Would someone mind to explain, why for length of objects/arrays UBJSON chooses the number of items instead of byte-length?

Of course, as a work-around, I could embed every object/array in a string but this would move the type-knowledge of the embedded entity from UBJSON to the application.

Jochen

Alexander Shorin

unread,
Sep 12, 2013, 9:04:56 AM9/12/13
to universal-...@googlegroups.com
Hi Jochen,

As for latest Draft-9 nor Arrays or Objects has any length. They are
designed for streaming processing when you only need to know when to
stop.

http://ubjson.org/type-reference/container-types
--
,,,^..^,,,
> --
> You received this message because you are subscribed to the Google Groups
> "Universal Binary JSON (UBJSON)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to universal-binary...@googlegroups.com.
> To post to this group, send email to universal-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/universal-binary-json.
> For more options, visit https://groups.google.com/groups/opt_out.

Jochen Kornitzky

unread,
Sep 12, 2013, 9:22:00 AM9/12/13
to universal-...@googlegroups.com
Hi kxepal,

On Thursday, September 12, 2013 3:04:56 PM UTC+2, kxepal wrote:
Hi Jochen,

As for latest Draft-9 nor Arrays or Objects has any length. They are
designed for streaming processing when you only need to know when to
stop.

http://ubjson.org/type-reference/container-types
 

Ah, I wasn't that far with reading. Only looked at  http://ubjson.org/
so far, where unknown and given-length containers are discussed.

The markers of the optimized formats are interesting.
I'd be still interested in an byte-size (read-)optimization,
it looks like this could be implemented with an additional
marker, like =, both for containers with identical types
or different types.

Jochen

Alexander Shorin

unread,
Sep 12, 2013, 9:42:10 AM9/12/13
to universal-...@googlegroups.com
On Thu, Sep 12, 2013 at 5:22 PM, Jochen Kornitzky <korn...@gmail.com> wrote:
> The markers of the optimized formats are interesting.
> I'd be still interested in an byte-size (read-)optimization,
> it looks like this could be implemented with an additional
> marker, like =, both for containers with identical types
> or different types.

May I ask you why? Lets assume containers has byte-size information.
What will you do if you receive array with size 1KB? 1MB? 1GB? 10GB?
For last two cases you will don't care about any sizes - you cannot
read whole data into memory, you cannot allocate 10GB memory at once
because memory is finite. Moreover, size information limits you,
preventing using all on-fly operations. Say, merge two 10GB arrays
items into pairs and remove any items that doesn't fulfil some
condition. With size information you will have to apply all these
functions to your data before send any result to the client(s) - this
is time to wait and a lot of resources to waste.

In any way, I couldn't imagine any useful use-case, when byte/items
length information will be useful, but I could give you a lot of them
when it useless.

See next issues for additional discussion on this topic, feel free to
post your thoughts there:
https://github.com/thebuzzmedia/universal-binary-json/issues/16
https://github.com/thebuzzmedia/universal-binary-json/issues/27




--
,,,^..^,,,

Jochen Kornitzky

unread,
Sep 12, 2013, 9:57:47 AM9/12/13
to universal-...@googlegroups.com


On Thursday, September 12, 2013 3:42:10 PM UTC+2, kxepal wrote:
On Thu, Sep 12, 2013 at 5:22 PM, Jochen Kornitzky <korn...@gmail.com> wrote:
> The markers of the optimized formats are interesting.
> I'd be still interested in an byte-size (read-)optimization,
> it looks like this could be implemented with an additional
> marker, like =, both for containers with identical types
> or different types.

May I ask you why? Lets assume containers has byte-size information.
What will you do if you receive array with size 1KB? 1MB? 1GB? 10GB?

If it's on disk, i can lseek() over it without ever reading data I'm not interested in.
If it's coming in from some stream, it's either in an input buffer and I can skip
over a known number of bytes, or I could instruct the reader to discard data
early on.

For last two cases you will don't care about any sizes - you cannot
read whole data into memory, you cannot allocate 10GB memory at once
because memory is finite. Moreover, size information limits you,
preventing using all on-fly operations. Say, merge two 10GB arrays
items into pairs and remove any items that doesn't fulfil some
condition. With size information you will have to apply all these
functions to your data before send any result to the client(s) - this
is time to wait and a lot of resources to waste.

I'd only see byte-size as an optional specific optimization, like
the 'count' marker. If I do a stream-operation, as you describe,
I simply do not know byte-size in advance and
can not add the marker, just as I wouldn't know
the number of items in advance.
 

In any way, I couldn't imagine any useful use-case, when byte/items
length information will be useful, but I could give you a lot of them
when it useless.

See next issues for additional discussion on this topic, feel free to
post your thoughts there:
https://github.com/thebuzzmedia/universal-binary-json/issues/16
https://github.com/thebuzzmedia/universal-binary-json/issues/27


Thanks for the pointer.

Jochen 

Alexander Shorin

unread,
Sep 12, 2013, 10:23:40 AM9/12/13
to universal-...@googlegroups.com
> If it's on disk, i can lseek() over it without ever reading data I'm not
> interested in.
> If it's coming in from some stream, it's either in an input buffer and I can
> skip
> over a known number of bytes, or I could instruct the reader to discard data
> early on.

I see. This case is useful when data stored on disk or something
multiple readable. However, just fast travelling across the data with
seek is meaningless. If your data is structured, there will be some
IDs, keys or something unique that makes one value looks not as
others. If there is no such, probably, you'd like to make something.
In anyway, in your application you'll never operate with file offsets,
you'd like to use something meaningful. This leads you to build index
of your data. This may be separate file, also ubjson with single
object container that holds meaningful value as key and data offset as
value. Such indexes helps you not only scan ubjson data faster, but
also jump to any valuable point.

Lets say, you got next UBJSON data:
[[i\x01i\x02i\x03][i\x04i\x05i\x06][i\x07i\x08i\x0b]]

After first read, you'll make the index, that points to the each arrays:

{C1i\x01C2i\x09C3i\x11}

which gives you after decoding: {'1':1, '2':9, '3':17}
first array at offset 1
second array at offset 9
last array at offset 17

So, you may jump to any array at any time by using meaningful key
without need to read whole file from start, even with ability to skip
large containers. More profit you'll receive as more structured and
complex data you'll have. However, indexes are a out of scope of
UBJSON format, so feel free to design your own that fits you. (: Hope
this helps.


--
,,,^..^,,,
Reply all
Reply to author
Forward
0 new messages