Homebrew CF-indexing vs secondary indexing

6 views
Skip to first unread message

Ron Siemens

unread,
Feb 24, 2011, 6:27:33 PM2/24/11
to us...@cassandra.apache.org, hector...@googlegroups.com

I am doing some experimenting with indexing. My data CF has about 25000 rows around 1KB each. I set up a special column of boolean value to use as the secondary index. I also created my own index in a separate CF where each index is one row and the column names are the data keys.

The implementation is in Hector 0.7.0-27, and run options are -Xms64m -Xmx256m

Below are two sample runs, the first using the secondary index with IndexedSlicesQuery. The second using my homebrew CF index and createSliceQuery for the index followed by createMultigetSliceQuery for the data. The timing output is from result.getExecutionTimeMicro(), but it looks like ms. I'm not sure if its purpose is as I'm assuming and using here. By the way, THS is just the same of the index, which is a subset of 7293 rows of the some 25000.

Anyway, it looks like the custom index does significantly better. Is this expected? Why? I expected them to be about the same, having read the secondary index also uses a column family internally. But more disconcerting, the secondary index implementation runs out of space, while the custom one runs along with only a few notable slow downs. Both implementations are using the same column-processing/deserialization code so that doesn't seem to be to blame. What gives?

Ron


Sample run: Secondary index.

DEBUG Retrieved THS / 7293 rows, in 2012 ms
DEBUG Retrieved THS / 7293 rows, in 1956 ms
DEBUG Retrieved THS / 7293 rows, in 1843 ms
DEBUG Retrieved THS / 7293 rows, in 2295 ms
DEBUG Retrieved THS / 7293 rows, in 1828 ms
DEBUG Retrieved THS / 7293 rows, in 1740 ms
DEBUG Retrieved THS / 7293 rows, in 1899 ms
DEBUG Retrieved THS / 7293 rows, in 2266 ms
DEBUG Retrieved THS / 7293 rows, in 2310 ms
DEBUG Retrieved THS / 7293 rows, in 2395 ms
DEBUG Retrieved THS / 7293 rows, in 2829 ms
DEBUG Retrieved THS / 7293 rows, in 2725 ms
DEBUG Retrieved THS / 7293 rows, in 3752 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.nio.CharBuffer.wrap(CharBuffer.java:350)
at java.nio.CharBuffer.wrap(CharBuffer.java:373)
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:138)
at java.lang.StringCoding.decode(StringCoding.java:173)
at java.lang.String.<init>(String.java:443)
at me.prettyprint.cassandra.serializers.StringSerializer.fromByteBuffer(StringSerializer.java:40)
at me.prettyprint.cassandra.serializers.StringSerializer.fromByteBuffer(StringSerializer.java:13)
at me.prettyprint.cassandra.serializers.AbstractSerializer.fromBytes(AbstractSerializer.java:38)
at me.prettyprint.cassandra.model.HColumnImpl.<init>(HColumnImpl.java:48)
at me.prettyprint.cassandra.model.ColumnSliceImpl.<init>(ColumnSliceImpl.java:27)
at me.prettyprint.cassandra.model.RowImpl.<init>(RowImpl.java:32)
at me.prettyprint.cassandra.model.RowsImpl.<init>(RowsImpl.java:33)
at me.prettyprint.cassandra.model.OrderedRowsImpl.<init>(OrderedRowsImpl.java:30)
at me.prettyprint.cassandra.model.IndexedSlicesQuery$1.doInKeyspace(IndexedSlicesQuery.java:143)
at me.prettyprint.cassandra.model.IndexedSlicesQuery$1.doInKeyspace(IndexedSlicesQuery.java:131)
at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
at me.prettyprint.cassandra.model.IndexedSlicesQuery.execute(IndexedSlicesQuery.java:130)

Sample run: Homebrew CF-indexing

DEBUG CFIndex THS / 7293 read in 262 ms
DEBUG Retrieved THS / 7293 rows, in 1579 ms
DEBUG CFIndex THS / 7293 read in 44 ms
DEBUG Retrieved THS / 7293 rows, in 1771 ms
DEBUG CFIndex THS / 7293 read in 38 ms
DEBUG Retrieved THS / 7293 rows, in 1275 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1364 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1590 ms
DEBUG CFIndex THS / 7293 read in 22 ms
DEBUG Retrieved THS / 7293 rows, in 1118 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1280 ms
DEBUG CFIndex THS / 7293 read in 21 ms
DEBUG Retrieved THS / 7293 rows, in 1466 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1589 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1772 ms
DEBUG CFIndex THS / 7293 read in 20 ms
DEBUG Retrieved THS / 7293 rows, in 1660 ms
DEBUG CFIndex THS / 7293 read in 20 ms
DEBUG Retrieved THS / 7293 rows, in 1931 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1626 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1750 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1557 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 9409 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1709 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1750 ms
DEBUG CFIndex THS / 7293 read in 45 ms
DEBUG Retrieved THS / 7293 rows, in 1629 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1596 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1879 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1597 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1662 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9362 ms
DEBUG CFIndex THS / 7293 read in 26 ms
DEBUG Retrieved THS / 7293 rows, in 1900 ms
DEBUG CFIndex THS / 7293 read in 22 ms
DEBUG Retrieved THS / 7293 rows, in 1972 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1631 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1579 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1606 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1582 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1784 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9522 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1628 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1551 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1627 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1539 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1563 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1623 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1804 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9010 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1444 ms
DEBUG CFIndex THS / 7293 read in 41 ms
DEBUG Retrieved THS / 7293 rows, in 1528 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1451 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1558 ms
DEBUG CFIndex THS / 7293 read in 16 ms
DEBUG Retrieved THS / 7293 rows, in 1585 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1659 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1708 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9195 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1590 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1572 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1582 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1568 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1689 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1810 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1556 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 8922 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1549 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1782 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1824 ms
DEBUG CFIndex THS / 7293 read in 16 ms
DEBUG Retrieved THS / 7293 rows, in 1579 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1531 ms
DEBUG CFIndex THS / 7293 read in 22 ms
DEBUG Retrieved THS / 7293 rows, in 1576 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1533 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9533 ms
DEBUG CFIndex THS / 7293 read in 48 ms
DEBUG Retrieved THS / 7293 rows, in 1544 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1467 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1557 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1714 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1888 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1588 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1612 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9529 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1653 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1813 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1650 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1572 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1646 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1566 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1727 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9480 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1577 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1529 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1566 ms
DEBUG CFIndex THS / 7293 read in 18 ms
DEBUG Retrieved THS / 7293 rows, in 1555 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1570 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1550 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 1455 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 10318 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1566 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1576 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1572 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1654 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1578 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 1571 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1710 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 9903 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1571 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1596 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1556 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1607 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1655 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1882 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 1535 ms
DEBUG CFIndex THS / 7293 read in 19 ms
DEBUG Retrieved THS / 7293 rows, in 8502 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1538 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1578 ms
DEBUG CFIndex THS / 7293 read in 24 ms
DEBUG Retrieved THS / 7293 rows, in 1540 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1561 ms
DEBUG CFIndex THS / 7293 read in 56 ms
DEBUG Retrieved THS / 7293 rows, in 1745 ms
DEBUG CFIndex THS / 7293 read in 17 ms
DEBUG Retrieved THS / 7293 rows, in 1454 ms

Ron Siemens

unread,
Feb 24, 2011, 6:39:31 PM2/24/11
to us...@cassandra.apache.org, hector...@googlegroups.com

I failed to mention: this is just doing repeated data retrievals using the index.

> ...


>
> Sample run: Secondary index.
>
> DEBUG Retrieved THS / 7293 rows, in 2012 ms
> DEBUG Retrieved THS / 7293 rows, in 1956 ms
> DEBUG Retrieved THS / 7293 rows, in 1843 ms

...

Ron Siemens

unread,
Feb 25, 2011, 1:23:24 PM2/25/11
to Ron Siemens, us...@cassandra.apache.org, hector...@googlegroups.com

I updated the cassandra version in the hector package from 7.0 to 7.2. The occasional slow-down in the CF-index went away. I then upped the heap to 512MB, and the secondary-indexing then works. Seems awfully memory hungry for my small dataset. Even the CF-index was faster with more heap. These are the times with Cassandra-0.7.2 and 512M heap. Slightly different testing: I'm varying the index used which give different data size results. It still surprises me that the CF index does substantially better.

Secondary Index

DEBUG Retrieved THS / 7293 rows, in 1051 ms
DEBUG Retrieved TRS / 7289 rows, in 1448 ms
DEBUG Retrieved BCS / 7788 rows, in 1553 ms
DEBUG Retrieved ARS / 7426 rows, in 1479 ms
DEBUG Retrieved CHS / 7290 rows, in 1575 ms
DEBUG Retrieved MS / 4523 rows, in 766 ms
DEBUG Retrieved PRS / 562 rows, in 40 ms
DEBUG Retrieved GGF / 1162 rows, in 122 ms
DEBUG Retrieved VET / 7313 rows, in 1193 ms
DEBUG Retrieved AUT / 7287 rows, in 1746 ms
DEBUG Retrieved LIT / 7291 rows, in 1331 ms

CF Index

DEBUG Retrieved THS / 7293 rows, in 17 + 759 ms
DEBUG Retrieved TRS / 7289 rows, in 19 + 734 ms
DEBUG Retrieved BCS / 7788 rows, in 23 + 736 ms
DEBUG Retrieved ARS / 7426 rows, in 23 + 1448 ms
DEBUG Retrieved CHS / 7290 rows, in 18 + 638 ms
DEBUG Retrieved MS / 4523 rows, in 32 + 622 ms
DEBUG Retrieved PRS / 562 rows, in 2 + 50 ms
DEBUG Retrieved GGF / 1162 rows, in 3 + 79 ms
DEBUG Retrieved VET / 7313 rows, in 17 + 686 ms
DEBUG Retrieved AUT / 7287 rows, in 17 + 758 ms
DEBUG Retrieved LIT / 7291 rows, in 17 + 745 ms

Ed Anuff

unread,
Feb 25, 2011, 2:14:06 PM2/25/11
to hector...@googlegroups.com
It's nice to see some testing in this regard, however, it's worth pointing out something that gets lost in CF index vs secondary index discussions.  What you're really proving is that get_slice (across columns) is faster than get_indexed_slices (across keys).  For up to a certain size (and it would be nice if there were some emperical testing to determine what that size is), get_slice should be one of the most performant operations Cassandra can do.  CF index approaches are basically all about getting your data into a situation where you can use get_slice to quickly perform the search.  The reasons for using Cassandra's built in secondary index support, IMHO, is that (1) it's easy to use whereas CF indexes are managed by the client  and (2) there's concern about how large an index you'd be able to effectively store in a CF index row.  The first point is more about Cassandra being easier for newcomers, the latter point is something I'd like to see some more data around.  Maybe you want to run your tests up to much larger sizes and see if there's a point where the results change?  FWIW, I recently switched back to CF-based indexes from secondary indexes, largely for the flexibility in the types of queries that became possible, but it's nice to see there's some performance benefit.  The other thing would be good to look at is timing the overhead of what it takes to update your index as you change the values that are being indexed.

Ed Anuff

unread,
Feb 25, 2011, 2:18:54 PM2/25/11
to hector...@googlegroups.com, Ron Siemens, us...@cassandra.apache.org
It's nice to see some testing in this regard, however, it's worth pointing out something that gets lost in CF index vs secondary index discussions.  What you're really proving is that get_slice (across columns) is faster than get_indexed_slices (across keys).  For up to a certain size (and it would be nice if there were some emperical testing to determine what that size is), get_slice should be one of the most performant operations Cassandra can do.  CF index approaches are basically all about getting your data into a situation where you can use get_slice to quickly perform the search.  The reasons for using Cassandra's built in secondary index support, IMHO, is that (1) it's easy to use whereas CF indexes are managed by the client  and (2) there's concern about how large an index you'd be able to effectively store in a CF index row.  The first point is more about Cassandra being easier for newcomers, the latter point is something I'd like to see some more data around.  Maybe you want to run your tests up to much larger sizes and see if there's a point where the results change?  FWIW, I recently switched back to CF-based indexes from secondary indexes, largely for the flexibility in the types of queries that became possible, but it's nice to see there's some performance benefit.  The other thing would be good to look at is timing the overhead of what it takes to update your index as you change the values that are being indexed.



On Fri, Feb 25, 2011 at 10:23 AM, Ron Siemens <rsie...@greatergood.com> wrote:
Reply all
Reply to author
Forward
0 new messages