Index class

utvara

unread,

Apr 29, 2010, 7:10:49 AM4/29/10

to pandra-dev

Hi,

I found Python abstraction layer for Cassandra (http://github.com/enki/
tragedy) that is trying to tackle some of data modelling problems,
namely indexes.

The issue I'm addressing here is that if you are storing reference
between subject and object , and want to use column name for sorting
purpose, you need to have a secondary column family for reverse
reference utilizing the same column names in both column families.

Question of course is if class that would wrap this problem has a
place in Pandra :)

Hope to hear your opinion on this.

utvara

Problem example:

I will use the same example Enki used in his example script, twitter.
So copy paste python script here:

dave = User(username='dave', firstname='dave', password='test').save()
merlin = User(username='merlin', firstname='merlin',
password='sunshine').save()
peter = User(username='peter', firstname='Peter',
password='secret').save()

dave.follow(merlin, peter)
peter.follow(merlin)
merlin.follow(dave)

behind this there are two Column Families (Follow and FolloewdBy)

Follow:
dave:{timeUUID1:merlin, timeUUID2:peter},
peter:{timeUUID3:merlin},
merlin:{timeUUID4:dave}

FollowedBy:
dave:{timeUUID4:merlin},
peter:{timeUUID2:dave},
merlin:{timeUUID1:dave,timeUUID3:peter}

NOTE that I'm using the same column names in both column families so
that someone could actually stop following.

mjpearson

unread,

Apr 29, 2010, 11:03:30 PM4/29/10

to pandra-dev

Hi

On Apr 29, 9:10 pm, utvara <utv...@gmail.com> wrote:

> Question of course is if class that would wrap this problem has a
> place in Pandra :)

Absolutely (are you offering to write one? :P), but mostly waiting
until 0.7's live schema update approaches release before supporting
dynamic non-normalised indexes/migrations. Pandra doesn't really have
relational model outside of a single row, but it's fairly trivial to
implement columns as keys into other column paths as a starting point
if you're looking for a tragedy analogue. Complex relations with
possibly huge datasets will need a much deeper level of thought if
pandra is to maintain its flexibility.

-michael

utvara

unread,

May 2, 2010, 8:29:48 AM5/2/10

to pandra-dev

Hi

> Absolutely (are you offering to write one? :P), but mostly waiting

yes I am :) working on it, will take me some time, still need some
getting used to the whole idea.

> until 0.7's live schema update approaches release before supporting
> dynamic non-normalised indexes/migrations. Pandra doesn't really have

Hmm, from what I read on http://wiki.apache.org/cassandra/LiveSchemaUpdates
it will just add dynamic modification for storage configuration (just
here probably means wooooooooooooow), so that would be just an extra
feature in the whole model.

> relational model outside of a single row, but it's fairly trivial to
> implement columns as keys into other column paths as a starting point
> if you're looking for a tragedy analogue. Complex relations with
> possibly huge datasets will need a much deeper level of thought if
> pandra is to maintain its flexibility.

One thing that I have problem with is the way iteration is implemented
in PandraColumnContainer. If I get it right it will load the whole row/
SupperColumn into array and deal with it in memory. While this will
work if rows are profile data {username:"joe",address:"blabla",...}
once you hit Indexes (lazyboy project refers to this as Views) it
will start to rain cats and dogs :(

Possible solution is to connect PandraColumnContainer directly to
Cassandra's row (or SupperColumn, I'll stick to row for now). Big
obstacle I see in this line of thinking is:

> How can I iterate over all the rows in a ColumnFamily?
> Simple but slow: Use get_range_slices, start with the empty string, and after each call use the last key read as the start key in the next iteration.

Lazyboy implementation of count:
def __len__(self):
"""Return the number of records in this view."""
return self._get_cas().get_count(
self.key.keyspace, self.key.key, self.key,
self.consistency)

So they are directly getting count from Cassandra. I'm sorry that I'm
using so much Python examples, but it seams that there are more
Cassandra wrappers currently being written in Python than beloved PHP :
(

As for index structure I have one more idea that is little bit of a
hack:

Follow:
dave:{timeUUID1:merlin, timeUUID2:peter},
peter:{timeUUID3:merlin},

merlin:{timeUUID4:dave},
ref_prefx_dave:{timeUUID4:merlin},
ref_prefx_peter:{timeUUID2:dave},
ref_prefx_merlin:{timeUUID1:dave,timeUUID3:peter}

This way all data will reside in same ColumnFamily but revers part
would be distinguished by prefix. Again this allows for data to be
retrieved both by object key and reference key.

utvara

utvara

unread,

May 2, 2010, 11:29:57 AM5/2/10

to pandra-dev

Here is example how iteration is handled in lazyboy, they are using
Python generator which only has implementation of next(), no previous,
reset ... as we have it in iterator. From what I can read data if
retrieved in chunks. I'm not sure what would the appropriate
equivalent in PHP be, especially having in mind the practice of PHP
developers.

def _cols(self, start_col=None, end_col=None):
"""Yield columns in the view."""
client = self._get_cas()
assert isinstance(client, Client), \
"Incorrect client instance: %s" % client.__class__
last_col = start_col or self.start_col or ""
end_col = end_col or ""
chunk_size = self.chunk_size
passes = 0
while True:
# When you give Cassandra a start key, it's included in
the
# results. We want it in the first pass, but subsequent
iterations
# need to the count adjusted and the first record dropped.
fudge = 1 if self.exclusive else int(passes > 0)

cols = client.get_slice(
self.key.keyspace, self.key.key, self.key,
SlicePredicate(slice_range=SliceRange(
last_col, end_col, self.reversed, chunk_size +
fudge)),
self.consistency)

if len(cols) == 0:
raise StopIteration()

for col in unpack(cols[fudge:]):
yield col
last_col = col.name

passes += 1

if len(cols) < self.chunk_size:
raise StopIteration()

utvara

mjpearson

unread,

May 4, 2010, 10:25:17 PM5/4/10

to pandra-dev

Nice! Just a few points I want to mention...

> If I get it right it will load the whole row/
> SupperColumn into array and deal with it in memory. While this will
> work if rows are profile data {username:"joe",address:"blabla",...}
> once you hit Indexes (lazyboy project refers to this as Views) it
> will start to rain cats and dogs :(

This was kind of the idea behind the 'autocreate' flag in column
container - if it's turned on, then all colums etc. will be returned,
otherwise if it's turned off then
only the columns defined in the row schema (via addColumn() etc) are
loaded. Used alongside the start/finish/limit predicate helpers it's
easy enough to paginate through data within a container without extra
overhead.

eg:

// an 'anonymous' column family
$cf = new PandraColumnFamily('key', 'keyspace', 'cf');
$cf->setAutoCreate(TRUE);
$start = '';
while ($cf->start($start)->limit(10)->load()) {
foreach ($cf as $column) {
// render the column for example
echo $column->getName();
}
$start = $cf->current()->getName();
}

Not saying that it's a complete solution, just that writing an index/
object relational model which can handle these big datasets may not be
too difficult :)

> Again this allows for data to be retrieved both by object key and reference key.

This is definitely the major limitation of the library, I'm working on
a yml schema parser to build the models but it needs some notations
for handling relations. I'll flick it through once ready and it can
all be integrated.

Sorry for the slow replies btw, has been one of those weeks!.

thanks :)
-michael

Reply all

Reply to author

Forward