Inserting Blob Keys

434 views
Skip to first unread message

Chris

unread,
May 7, 2016, 6:48:30 PM5/7/16
to DataStax Python Driver for Apache Cassandra User Mailing List
What's the proper way to insert a key of type `blob` when using the ByteOrderedPartitioner?

My schema looks like this:

CREATE TABLE keys (
  key blob
,
  PRIMARY KEY
(key)
);

Using Python 2, I can insert a key just fine with the following:

session.execute(
  session
.prepare('INSERT INTO keys (key) VALUES (?)'),
 
[bytearray('key_a')])

However, the insert fails if the byte array contains non-ascii data.

session.execute(
  session
.prepare('INSERT INTO keys (key) VALUES (?)'),
 
[bytearray('key_\x9a')])
...
 
File "cassandra/metadata.py", line 1434, in cassandra.metadata.TokenMap.get_replicas (cassandra/metadata.c:30612)
    point
= bisect_right(self.ring, token)
 
File "cassandra/metadata.py", line 1467, in cassandra.metadata.Token.__lt__ (cassandra/metadata.c:31398)
   
return self.value < other.value
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9a in position 4: ordinal not in range(128)

It seems the `bytearray` is converted to a `str` for the BytesToken, and Python 2 tries to treat `str` objects as ascii when comparing them to unicode strings (the BytesToken values in self.ring).

With Python 3, the `bytes` object is not converted to a `str` (which makes sense since it would need to know what codec to use), so I can't figure out how to insert even ascii-compliant values.

session.execute(
  session
.prepare('INSERT INTO keys (key) VALUES (?)'),
 
[bytearray('key_a', encoding='utf-8')])
...
TypeError: Tokens for ByteOrderedPartitioner should be strings (got <class 'bytes'>)

session
.execute(
  session
.prepare('INSERT INTO keys (key) VALUES (?)'),
 
['key_a'])
...
TypeError: Received an argument of invalid type for column "key". Expected: <class 'cassandra.cqltypes.BytesType'>, Got: <class 'str'>; (string argument without an encoding)

Was it a mistake to use a `blob` as the primary key with the ByteOrderedPartitioner, or should I be passing in something other than a `bytearray` during the inserts?

Also, is there a reason BytesToken values are stored as strings? The values are validated against six.string_types, which allows both encoded bytes and decoded string data in Python 2. It seems like it would be easier to order those tokens if they were always raw bytes, but I'm probably missing something.

I'm using Cassandra 2.0.7 with version 3.3.0 of the cassandra-driver.

Thanks,
Chris

Adam Holmberg

unread,
May 11, 2016, 12:10:56 PM5/11/16
to python-dr...@lists.datastax.com
Chris,

Thanks for bringing this up. Unicode has been a tough subject for this driver, especially supporting python 2+3. It looks like you've found a lingering deficiency in a not-oft-used corner: Byte Ordered Partitioning. I've opened a ticket to get this resolved, but it may be a while before we can get to it.

In the mean time, I am obliged to ask if you really need the ByteOrderedPartitioner. It is usually not recommended for most applications. If you do need to use it, I may be able to provide some workarounds, but it's not ideal.

Apologies for the inconvenience.

Regards,
Adam Holmberg

--
You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Chris

unread,
May 11, 2016, 2:17:13 PM5/11/16
to DataStax Python Driver for Apache Cassandra User Mailing List
Thanks, Adam. I know there are many downsides to the ByteOrderedPartitioner, and I would love to move away from it. I'll explain a bit about our use case and why I think we still need it, but it's very possible that I'm mistaken.

Our project implements the Google Cloud Datastore API. Applications that use our platform can define their own data model, change it at any time, and perform range queries on any aspect of their data. To accommodate this, we have a generic "entities" table and separate index tables for single property and composite queries.

As far as I can tell, using range queries with the Murmur3Partitioner would require us to have some knowledge about what type of data we are storing so that we can choose appropriate partition and clustering keys. Unfortunately, I haven't been able to find a schema that can allow us to use the Murmur3Partitioner while still conforming to the flexible requirements of the model we are implementing.
To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-user+unsub...@lists.datastax.com.

Adam Holmberg

unread,
May 13, 2016, 4:56:30 PM5/13/16
to python-dr...@lists.datastax.com
Chris,

I just had to ask on the BOP. I think you understand the problem.

Also, is there a reason BytesToken values are stored as strings? 

Not that I can tell. The tokens come back as string from the database and six.string_types was just there from what I *think* was a naive update to support Python 3. In any event, as you saw, these unicode tokens were causing implicit decode('ascii')/encode('utf-8') in the comparison function. While looking into this I also noticed another issue with BytesToken (bytes value was not parsed out of the hex string). 

I have a branch resolving these issues if you're interested in trying it:

We're going to get this into the current release cycle. Please ping this thread or the PR if you have a chance to try it, and anything else comes up.

Regards,
Adam Holmberg

 

To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

--
You received this message because you are subscribed to the Google Groups "DataStax Python Driver for Apache Cassandra User Mailing List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-driver-u...@lists.datastax.com.

Reply all
Reply to author
Forward
Message has been deleted
0 new messages