get_range_by_token, resuming after failure

13 views
Skip to first unread message

William Oberman

unread,
Apr 7, 2014, 1:25:17 PM4/7/14
to phpc...@googlegroups.com
I'm trying to use get_range_by_token to do a mass delete against one of my CFs.  But, after awhile, I'll get a glitch that causes the process to die (usually a timeout).

I could attack the settings on timeouts/retries, but a more fundamental solution would be to note the key where I left off and start from there.  But I'm not sure if I can mix & match keys/tokens.  

Right now my code is basically:
...setup stuff, including $localCassandra connection string...
$systemManager = new SystemManager($localCassandra);
$ring = $systemManager->describe_ring($keyspace);
...find $startToken/$endToken by finding myself in $ring...
$pool = new ConnectionPool($keyspace, array($localCassandra));
$column_family = new ColumnFamily($pool, $cf);
foreach ($column_family->get_range_by_token($startToken, $endToken, null) as $key => $columns) {
...do stuff...
}

Digging into the code, it seems like it boils down to the KeyRange argument to get_range_slices.   get_range sets KeyRange->start_key and KeyRange->end_key.  get_range_by_token sets either KeyRange->start_token or KeyRange->start_key (depending on first call vs Nth) and KeyRange->end_token.  

I don't see a way to do "start with key, end with token", which is what I'd need for a proper "resume", right?

wil


Tyler Hobbs

unread,
Apr 9, 2014, 1:10:19 PM4/9/14
to phpc...@googlegroups.com

On Mon, Apr 7, 2014 at 12:25 PM, William Oberman <obe...@civicscience.com> wrote:
I don't see a way to do "start with key, end with token", which is what I'd need for a proper "resume", right?

You're right, that's what's needed to make a proper resume easy. Pycassa has supported exactly this case for a while now, I just need to add support for it to phpcassa.  Perhaps a "key_start" argument for get_range_by_token would be easiest.

As a workaround, you could subclass RangeTokenColumnFamilyIterator and change this section of the code:

        if ($this->buffer_number == 0){
            // First time use start token
            $key_range->start_token = $this->token_start;
        } else {
            // Next times use start key
            $key_range->start_key = $this->column_family->pack_key($this->next_start_key, $handle_serialize);
        }



--
Tyler Hobbs
DataStax

William Oberman

unread,
Apr 9, 2014, 2:02:22 PM4/9/14
to phpc...@googlegroups.com
After thinking about my use case (which is deleting 90%+ of the data), "on resume" the data set is smaller so it's not terrible (e.g. I didn't have to solve this problem).  But, for other people, or my "future self" that wants to do some other kind of non-delete mutation, I'll try to think this through :-)

Before I saw your email, where my head was to have (yet another) method/iterator "get_range_by_key_token" and "RangeKeyTokenColumnFamilyIterator".  I didn't *like* it though, as it expands the method/class space too much.  Adding another argument solves that issue.  But, if you add key_start, what if the API consumer passes in token_start AND key_start?  Which do you use?  I guess the new argument could be a boolean, "startIsToken=true".  Of course, at that point you might as well add "endIsToken=true".  Then you could collapse get_range/get_range_by_token (and the two iterators) into one method/class with booleans that drive the initial key/token setup...

Of course, at that point you could remove 5 (!) method arguments by having the consumer pass in the KeyRange object instead...

A lot of options!  I've leave the API designing to the API author.

will



--
You received this message because you are subscribed to the Google Groups "phpcassa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to phpcassa+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



Tyler Hobbs

unread,
Apr 9, 2014, 2:18:36 PM4/9/14
to phpc...@googlegroups.com
On Wed, Apr 9, 2014 at 1:02 PM, William Oberman <obe...@civicscience.com> wrote:
After thinking about my use case (which is deleting 90%+ of the data), "on resume" the data set is smaller so it's not terrible (e.g. I didn't have to solve this problem).  But, for other people, or my "future self" that wants to do some other kind of non-delete mutation, I'll try to think this through :-)

That's true, since you're doing deletions you would technically be safe starting from the original token.  But in practice, Cassandra would have to skip over a lot of tombstones, which would make performance kinda suck.  It's definitely worth adding.
 

Before I saw your email, where my head was to have (yet another) method/iterator "get_range_by_key_token" and "RangeKeyTokenColumnFamilyIterator".  I didn't *like* it though, as it expands the method/class space too much.  Adding another argument solves that issue.  But, if you add key_start, what if the API consumer passes in token_start AND key_start?  Which do you use?

I would raise an Exception :)
 
 I guess the new argument could be a boolean, "startIsToken=true".  Of course, at that point you might as well add "endIsToken=true".  Then you could collapse get_range/get_range_by_token (and the two iterators) into one method/class with booleans that drive the initial key/token setup...

In pycassa, I chose to have four separate arguments: start, finish, start_token, and end token.  It validates that you only use a proper combination of those and raises otherwise.  That works pretty well due to how named/optional parameters work in Python.  However, PHP's named parameter support is... shall we say, half-assed, because you still have to list all params positionally.  For backwards compatibility, I would have to add new options to the end (after count, slice, column names, consistency level, and buffer size).  That's pretty ugly, so perhaps I would just make a new get_range_from_key_to_token() method that takes a start key and and end token.
 
 

Of course, at that point you could remove 5 (!) method arguments by having the consumer pass in the KeyRange object instead...

True, it could take something like a KeyRange argument with native types for the start and finish and handle serializing those itself.  That's not a bad idea.  It would be similar to the ColumnSlice class that phpcassa already has.


--
Tyler Hobbs
DataStax
Reply all
Reply to author
Forward
0 new messages