compliant clients: binary keys

164 views
Skip to first unread message

Alex Leverington

unread,
Aug 2, 2012, 2:17:04 AM8/2/12
to couc...@googlegroups.com

Hi,

I'm writing a macos client for couchbase and hope to gracefully handle good and bad binary data (keys and values alike). Values are easy because they should be JSON or will be considered binary. Keys, as I understand it, are a little different though because UTF8 should be supported -- as views can be sorted by _id. This is different than some stores, which reject non-ascii keys (terrible waste of keyspace if you ask me).

It appears that libcouchbase passes through binary keys untouched. I like this and hope it doesn't change -- although I've found it renders views useless if bad binary keys are set.

Are binary keys internally encapsulated in any way (storage, wireline, otherwise)?

Do binary keys affect performance?

Is it safe to allow freeform binary keys?

In other systems a key could be the contents of a PNG file. I don't hope to support writing such content (as keys) -- but would like to be aware of what the data store COULD send the app (as a key).



Thanks,
Alex Leverington

Matt Ingenthron

unread,
Aug 7, 2012, 7:42:09 PM8/7/12
to couc...@googlegroups.com
Hi Alex,

That's cool! out of curiosity, what kind of app are you writing?

We've actually been in discussion about key format recently. I don't
think we do want to support them, actually.

The problem is some parts of Couchbase Server starting in 2.0 do treat the
keys as UTF-8 strings. This means we need to restrict things a bit more
so the 'right things' happen out of the box.

Currently, most people use the 7-bit ASCII set without the memcached
special characters (something that would be a problem with your binary
keys), and we're proposing that we restrict the key space to UTF-8 without
the bytes not allowed by memcached protocol.

At the libcouchbase level, we do not expect to enforce this, but we also
don't plan to test the boundary conditions. You can certainly do this
with thorough testing and at your own risk.

Matt

Alex Leverington

unread,
Aug 8, 2012, 4:03:19 AM8/8/12
to couc...@googlegroups.com

Matt,

Thanks for asking. I've written a MacOS app (Pair) and am adding support for Couchbase (well, memcache/membase first). I would like to know whether binary keys are supported (or not) for improved usability. When viewing binary values I can use a preference for decoding -- I wish to further notate as to whether this will be supported for keys. More importantly I expect to support accessing keys or values with foreign language as the content.

I would propose allowing free-form binary and reject (as a document) any key/value pair where the key OR value can't be encoded as UTF8 (as in, don't even create an 'invalid_json' hash). Then, as the document database evolves, additional encodings (UTF16) could be added by means of a re-index. Until then, documentation and/or client-side code could warn developers so they know what to expect.

Will it be possible to use couchbase without document storage in the future?

Of course the use-case is emoji which is often UTF-16. Emoji support is a rather gentle way of pushing for UTF-16 which is a more efficient encoding for several non-english languages.


All things said, and somewhat unrelated, UTF8 has breakage w/multiget. Haven't had the chance to trace this down yet.

1.9.2-p290 :207 > c.get("a")
=> 1
1.9.2-p290 :208 > c.set("\uD83D\uDC89",1)
=> 17295494660134797312
1.9.2-p290 :209 > c.set("b",1)
=> 13531672831342936064
1.9.2-p290 :210 > c.get("a","\uD83D\uDC89","b")
=> [1, nil, 1]

Here's an insightful post on utf8/16:
http://programmers.stackexchange.com/questions/82396/why-doesnt-everybody-switch-from-japanese-specific-encodings-to-utf-8



--
Alex Leverington

Sergey Avseyev

unread,
Aug 8, 2012, 9:46:07 AM8/8/12
to couc...@googlegroups.com
Good catch, thanks. Actually there is issue when I'm building keys
from response. All of them has #<Encoding:ASCII-8BIT> as the encoding.
This is why it cannot lookup the original key when extracting result
array of the multi get:

1.9.3p194 (main):001:0> keys = ["a", "\uD83D\uDC89"]
=> ["a", "\xED\xA0\xBD\xED\xB2\x89"]
1.9.3p194 (main):002:0> keys.map(&:encoding)
=> [#<Encoding:UTF-8>, #<Encoding:UTF-8>]
1.9.3p194 (main):003:0> Couchbase.bucket.get(keys)
=> [1, nil]
1.9.3p194 (main):004:0> Couchbase.bucket.get(keys, :extended => true)
=> {"a"=>[1, 0, 15824216877734363136],
"\xED\xA0\xBD\xED\xB2\x89"=>[1, 0, 3983500633983090688]}
1.9.3p194 (main):005:0> _.keys.map(&:encoding)
=> [#<Encoding:ASCII-8BIT>, #<Encoding:ASCII-8BIT>]

So I need either remember encoding for each key or force utf-8
everywhere (or Encoding.default_external). What do you think? I like
second case.

Thanks again

--
Sergey Avseyev

Alex Leverington

unread,
Aug 9, 2012, 3:18:53 AM8/9/12
to couc...@googlegroups.com

Sergey,

At this time the couchbase ruby driver doesn't take advantage of Encoding.external. A preferable solution would be to update the couchbase ruby client to respect encoding preferences. This will consist of changing rb_str_new calls to rb_external_str_new.

I have tested this and it fixes the problem for me. However my patch only applies to synchronous get request and all of the methods will need to be updated. Here are my notes and the 2nd commit has a macro which can be used to replace rb_str_new:

https://github.com/nessence/couchbase-ruby-client/commit/20b7510bea65292e62819890cc86446dfc8a579a
https://github.com/nessence/couchbase-ruby-client/commit/e62e6dd5f412bec6044a91c6a311a9692e54b4aa#ext/couchbase_ext/couchbase_ext.c



--
Alex Leverington

Sergey Avseyev

unread,
Aug 9, 2012, 9:04:46 AM8/9/12
to couc...@googlegroups.com
On Thu, Aug 9, 2012 at 10:18 AM, Alex Leverington <ness...@gmail.com> wrote:
>
> Sergey,
>
> At this time the couchbase ruby driver doesn't take advantage of Encoding.external. A preferable solution would be to update the couchbase ruby client to respect encoding preferences. This will consist of changing rb_str_new calls to rb_external_str_new.
>
> I have tested this and it fixes the problem for me. However my patch only applies to synchronous get request and all of the methods will need to be updated. Here are my notes and the 2nd commit has a macro which can be used to replace rb_str_new:
>
> https://github.com/nessence/couchbase-ruby-client/commit/20b7510bea65292e62819890cc86446dfc8a579a
> https://github.com/nessence/couchbase-ruby-client/commit/e62e6dd5f412bec6044a91c6a311a9692e54b4aa#ext/couchbase_ext/couchbase_ext.c
>
>

Thanks, I've updated your patch and published on our code review.

http://review.couchbase.org/19405

You can easily login there using google openid and review/verify the patch

-- Sergey Avseyev

Sergey Avseyev

unread,
Aug 9, 2012, 9:08:29 AM8/9/12
to couc...@googlegroups.com
> http://review.couchbase.org/19405

Note the patch for 'release11' branch, I will merge it back to master
after next stable release will be tagged.

Alex Leverington

unread,
Aug 10, 2012, 1:35:56 AM8/10/12
to couc...@googlegroups.com

Looks great, thanks!



--
Alex Leverington
Reply all
Reply to author
Forward
0 new messages