Pseudo-table proposal

Doug Judd

unread,

Mar 5, 2013, 10:14:45 AM3/5/13

to hypertable-user, hyperta...@googlegroups.com

This is a proposal for the introduction of pseudo-tables into Hypertable. This idea came about when trying to come up with an inexpensive way to discover large rows in a table. We zeroed in on the CellStore indexes because they contain information that can be used to estimate large rows cheaply. However, the next question was how do we provide access to the CellStore indexdes through the API? Instead of adding some special-purpose ReadCellStoreIndexes API, I propose that we use the existing API as-is and surface the CellStore index information via a pseudo-table. A pseudo-table is a virtual table with no real table behind it. When a query comes in for the CellStore index pseudo table, the CellStore indexes will get read directly to satisfy the query. This approach is exactly analogous to the /proc filesystem in Linux.

The pseudo-table that represents the CellStore indexes for a given table, foo, would have the name foo^.cellstore.index and the following schema:

create table foo^.cellstore.index (

Size,

CompressedSize,

KeyCount

);

For each column family, there would be one qualified column for each block in the CellStore indexes. The column qualifier would have the format: <filename>:<hex-offset>. Also, the row key would be the same as the row key in the CellStore index entries (we assume that's what most people will want to aggregate this info on). So for example, the CellStore index block entry for file 2/2/default/ZwmE_ShYJKgim-IL/cs103 at offset 0x28A61 might generate the following keys:

help...@premiermiles.com Size:2/2/default/ZwmE_ShYJKgim-IL/cs103:0000000000028A61 171728

help...@premiermiles.com CompressedSize:2/2/default/ZwmE_ShYJKgim-IL/cs103:0000000000028A61 65231

help...@premiermiles.com KeyCount:2/2/default/ZwmE_ShYJKgim-IL/cs103:0000000000028A61 281

To query the cellstore.index pseudo-table for table foo to find an estimate of large rows, you would issue a query along the lines of the following:

SELECT sum(Size) FROM foo^.cellstore.index WHERE sum(Size) > 100000000;

Please respond with feedback or if you have any questions. Thanks!

- Doug

Doug Judd

unread,

Mar 5, 2013, 12:30:50 PM3/5/13

to hypertable-user, hyperta...@googlegroups.com

Comments inline ...

On Tue, Mar 5, 2013 at 9:15 AM, ddorian <dorian...@gmail.com> wrote:

why do people want to discover large rows?

Hypertable (and other Bigtable based DBs) have a limitation on the maximum size of a row. A row cannot grow larger than a single range. The default range split size is 256MB. Once a row has hit this limit, it will stop accepting updates and send back a ROW_OVERFLOW error. Some applications want to anticipate this situation before it happens and handle large rows differently.

if this will have overhead make it optional ?

It will have no overhead. Pre-existing applications (that don't use the feature) will see no impact.

other ways that pseudo-tables will be used?

I can imagine using the pseudo-table interface for exposing CellStore trailer data as well as in-memory statistics from AccessGroups, Ranges, and RangeServers.

very happy to see the first signs of advanced queries

Me too. The 0.9.8 release and beyond will mostly be focused on advanced query capabilities. We'll get a roadmap document in place soon.

- Doug

Doug Judd

unread,

Mar 5, 2013, 5:51:15 PM3/5/13

to hypertable-user, hyperta...@googlegroups.com

Yes to both questions. When it comes to API change proposals, it's good to get feedback from people on both lists. Aggregate functions, like sum(), will be evaluated as far down as possible to avoid unnecessarily passing data around.

- Doug

On Tue, Mar 5, 2013 at 2:41 PM, ddorian <dorian...@gmail.com> wrote:

Looks like when i reply (and you too) it gets posted at both hypertable-user and hypertable-dev (intentional?).

I hope that the sum() and new other operations happen first at the rangeserver and then(if necessary) at the thriftclient.

--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--

Doug Judd

CEO, Hypertable Inc.

Christoph Rupp

unread,

Mar 6, 2013, 11:54:39 AM3/6/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com

It reminds me of the /proc system, but also of SQL Views. Basically we provide additional metadata, and there's a View on that data which looks and feels like a regular HQL table. I always found Views very useful to give applications a consistent view of a table even if the underlying table structure changes between different versions.

In Hypertable a View would just be a dispatcher to either the regular column families or to the pseudo-tables. And later we could maybe implement "real" Views if the need comes up.

bye
Christoph

2013/3/5 Doug Judd <do...@hypertable.com>

Doug Judd

unread,

Mar 6, 2013, 8:32:11 PM3/6/13

to hyperta...@googlegroups.com, hypertable-user

I agree that views are useful and we should add support for them at some point. However, a view is just a query and is orthogonal to this pseudo table idea. A view could reference normal tables and/or pseudo tables.

- Doug

--
You received this message because you are subscribed to the Google Groups "Hypertable Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-de...@googlegroups.com.

To post to this group, send email to hyperta...@googlegroups.com.

Visit this group at http://groups.google.com/group/hypertable-dev?hl=en.

For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward