Improved FAQ

Doug Judd

unread,

Feb 21, 2013, 5:43:00 PM2/21/13

to hypertable-user, hyperta...@googlegroups.com

I've re-formatted the Hypertable FAQ page to be more useful. There are two improvements worth noting:

1. It's now all on one page so you can easily search through the entire document by typing ctrl-f in your browser.
There is now a Table of Contents at the top with links to each individual entry. You can obtain a direct link to any entry by clicking on it in the Table of Contents and then copying the URL from the address window in your browser.

I've also added the following two new entries:

Q: The RangeServer machines are swapping, how do I fix this?

Q: How do I get Hypertable to run on top of CDH4?

If you have any suggestions for how to improve the FAQ, including any entries that should be added, please let me know.

- Doug

ddorian

unread,

Feb 21, 2013, 7:29:54 PM2/21/13

to hyperta...@googlegroups.com

I would like from the faq to answer more general questions (now are mostly issues). For example:

How indexing works:

indexing data is stored in another range (not like mongo for example where each range has indexing for it's own data),
currently only "=" and "^=" is supported in indexing, not "<" and ">"?
possible to change indexing block size,compression?
ascending and descending index?
since the index and the data can be on different range servers, how is consistency guaranteed?
no indexing on COUNTER! any alternative?

Can't delete ranges of rows?

Can hypertable work on windows? With a link to the windows distribution download. And link to windows build faq.

Hypertable can fullfill many queries: which are fast and which are slow (large skips? some regexes? value searching without indexes...,all queries that result in table scans)

Can i store large blobs? How large optimally?
Is there a gui admin?
When i insert a batch of cells to multiple rangeservers: what happens if some ranges fail, what if the thrift client fails(no cell will be inserted?some?)?
Ability to specify multiple thrift clients in the connection code?
What about cell typing?Not yet.
Any roadmap? From what i've read is that the hardest part of the whole hypertable development was the failover so maybe since it's now implemented it is easier to create an official roadmap?
How long does it take for the failover of one server(This varies on many factors, which factors and show some examples.)?What happens during failover to the client(doesn't see data?)?What about inserts to that rangeserver?
I read somewhere that block cache(caching uncompressed data in ram) is now disabled because it was slower than caching compressed blocks(memory mapped?). Put it in faq. Maybe even change the Last section on the architecture page.
When a cellcache is being inserted into disk, does it check for expired cells? Maybe check only on those cells that have(TTL<x seconds)?
Can i iterate rows on reverse order?
Should i use ECC RAM?Yes!

Doug Judd

unread,

Feb 21, 2013, 8:21:54 PM2/21/13

to hypertable-user

Thank you, Dorian! Responses inline ...

On Thu, Feb 21, 2013 at 4:29 PM, ddorian <dorian...@gmail.com> wrote:

I would like from the faq to answer more general questions (now are mostly issues). For example:

How indexing works:

I'll add a FAQ entry for this.

indexing data is stored in another range (not like mongo for example where each range has indexing for it's own data),
currently only "=" and "^=" is supported in indexing, not "<" and ">"?

Just filed issue 1021 for this.

possible to change indexing block size,compression?

Issue 845 has already been filed to add support for this.

ascending and descending index?

I think this could be handled with a reverse scanner (see below)

since the index and the data can be on different range servers, how is consistency guaranteed?

I'll add a FAQ entry for this.

no indexing on COUNTER! any alternative?

There's no easy way to do this with our current COUNTER implementation.

Can't delete ranges of rows?

This will get fixed in the next major release (0.9.8).

Can hypertable work on windows? With a link to the windows distribution download. And link to windows build faq.

I'll add a FAQ entry for this.

Hypertable can fullfill many queries: which are fast and which are slow (large skips? some regexes? value searching without indexes...,all queries that result in table scans)

This really deserves its own document because it's a big subject. I just filed issue 1023 to track it.

Can i store large blobs? How large optimally?

I'll add a FAQ entry for this.

Is there a gui admin?

See Monitoring System document. I'll add a FAQ entry pointing to this document.

When i insert a batch of cells to multiple rangeservers: what happens if some ranges fail, what if the thrift client fails(no cell will be inserted?some?)?

Let me think a bit about where this information should be presented in the documentation. A FAQ entry might be appropriate.

Ability to specify multiple thrift clients in the connection code?

There is a bug in Thrift that is currently preventing this. I'll add a FAQ entry.

What about cell typing?Not yet.

Cell typing will be immediately after the 1.0 release.

Any roadmap? From what i've read is that the hardest part of the whole hypertable development was the failover so maybe since it's now implemented it is easier to create an official roadmap?

I'll add a Roadmap page under the Community section of the www.hypertable.com site.

How long does it take for the failover of one server(This varies on many factors, which factors and show some examples.)?What happens during failover to the client(doesn't see data?)?What about inserts to that rangeserver?

We recently added a Machine Failure page that describes some of this. I'll flesh it out more.

I read somewhere that block cache(caching uncompressed data in ram) is now disabled because it was slower than caching compressed blocks(memory mapped?). Put it in faq. Maybe even change the Last section on the architecture page.

This information might be more appropriate in a "Performance Optimization" page. We'll tackle it with issue 1023.

When a cellcache is being inserted into disk, does it check for expired cells? Maybe check only on those cells that have(TTL<x seconds)?

Yes, it purges expired cells (TTL, MAX_VERSIONS, and deletes) when written to disk. The system also periodically estimates the amount of garbage that has accumulated and once it exceeds a threshold, it will do a "Garbage Collection" compaction. I've filed issue 1024 to track adding documentation for how this works.

Can i iterate rows on reverse order?

Just filed issue 1022 to track this.

Should i use ECC RAM?Yes!

I'll add a FAQ entry for this.

On Thursday, February 21, 2013 11:43:00 PM UTC+1, Doug Judd wrote:

I've re-formatted the Hypertable FAQ page to be more useful. There are two improvements worth noting:
1. It's now all on one page so you can easily search through the entire document by typing ctrl-f in your browser.

There is now a Table of Contents at the top with links to each individual entry. You can obtain a direct link to any entry by clicking on it in the Table of Contents and then copying the URL from the address window in your browser.
I've also added the following two new entries:

Q: The RangeServer machines are swapping, how do I fix this?

Q: How do I get Hypertable to run on top of CDH4?

If you have any suggestions for how to improve the FAQ, including any entries that should be added, please let me know.

- Doug

--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--

Doug Judd

CEO, Hypertable Inc.

dorian i

unread,

Feb 21, 2013, 10:20:02 PM2/21/13

to hyperta...@googlegroups.com

Some others:

What if when i insert an indexed column and only the rangeserver that has the indexing data is currently down?

What about a new type of group commit (probably makes no sense):

CURRENT WAY: ( ADD to faq that data loss is possible)

old_group_commit_interval = 1second
many clients send data/second
hypertable directly acknowledges and returns as success to the client
every second writes to the log

NEW WAY: (when i read the docs i thought that it worked like this,)

new_group_commit_interval = 1second
many clients send data to hypertable/second
every second write to the log
return success to all the clients that are waiting

Also add the roadmap to the faq.

If i use the c++ interface i don't need a thrift client because my client becomes the thrift client? since the thrift_client is written in c++ can't you just import it and run it?

In streaming map_reduce, is each job run on the same rangeserver from where it is getting data?

When to use hypertable, when not to use?

Datacenter replication?

What about large number of tables?Efficient or how much overhead per table?

Does hypertable have a problem like hbase about the low number of column families?No. Explain? (this is the reason in hbase?, hypertable works different?)

Don't forget to create an issue on hypertable-develop for people to discuss different typing implementations?

There are many questions that can be answered with simple links, i myself like a 'fat' faq like mongodb.

Doug Judd

unread,

Feb 22, 2013, 12:39:06 AM2/22/13

to hypertable-user

Hi Dorian,

Comments inline ...

On Thu, Feb 21, 2013 at 7:20 PM, dorian i <dorian...@gmail.com> wrote:

Some others:

What if when i insert an indexed column and only the rangeserver that has the indexing data is currently down?

Prior to RangeServer failover, the insert would fail leaving the index consistent. Now with RangeServer failover, the insert would be delayed until the RangeServer is recovered and then complete successfully.

What about a new type of group commit (probably makes no sense):

CURRENT WAY: ( ADD to faq that data loss is possible)

old_group_commit_interval = 1second
many clients send data/second
hypertable directly acknowledges and returns as success to the client
every second writes to the log
NEW WAY: (when i read the docs i thought that it worked like this,)

new_group_commit_interval = 1second
many clients send data to hypertable/second
every second write to the log
return success to all the clients that are waiting

The group commit has always worked the NEW WAY. There has never been a scenario where successful inserts via group commit reported back to clients would result in data loss.

Also add the roadmap to the faq.

Ok I'll add that to the FAQ.

If i use the c++ interface i don't need a thrift client because my client becomes the thrift client? since the thrift_client is written in c++ can't you just import it and run it?

There are two C++ interfaces, the native interface and the Thrift interface. There are reasons why you might want to use one vs. the other. I'll add a FAQ for this.

In streaming map_reduce, is each job run on the same rangeserver from where it is getting data?

Yes. I'll fatten up the MapReduce documentation to mention this.

When to use hypertable, when not to use?

Hypertable should always be used. ;) I'll add a FAQ entry for this.

Datacenter replication?

We're nearly finished with the implementation. It will be announced soon.

What about large number of tables?Efficient or how much overhead per table?

There is definitely some overhead to creating a table. We'll need to quantify it and then add a FAQ entry.

Does hypertable have a problem like hbase about the low number of column families?No. Explain? (this is the reason in hbase?, hypertable works different?)

No, Hypertable doesn't have a problem with low number of column families.

Don't forget to create an issue on hypertable-develop for people to discuss different typing implementations?

We'll definitely engage the community before we settle on a design for column typing.

PapyRef

unread,

Feb 22, 2013, 2:40:17 AM2/22/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

I would like from the faq to answer more general questions about requesting.
For example :
- search some rows (row key, column, ...).
- multi-criteria search

Doug Judd

unread,

Feb 22, 2013, 10:29:32 AM2/22/13

to hypertable-user, hyperta...@googlegroups.com

Thanks for the feedback. I think this warrants its own section. I just added issue 1025 to track it. BTW, someone contacted me off-list with a great FAQ question: Is there a single point of failure in Hypertable? The short answer is "no" and we describe it in some detail under Machine Failure, but we'll add a FAQ entry for it as well.

- Doug

--
You received this message because you are subscribed to the Google Groups "Hypertable User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.
To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Message has been deleted

sambus

unread,

Feb 22, 2013, 7:17:20 PM2/22/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

Fixing typos damn laptop

what maybe should also mentioned, for people who use php, that if you retrieve big amount of rows, that php slows down, cause php takes ages to create a object. Before my script was running 24/h and one execution took upto 30 - 40 mins, to work the queue down and was always behind. Now i made a dirty php / c++ solution, this script runs every 45 mins, to make the double / triple work within 5 - 10 mins and empties the queue in one run. I want say, that people (new php users) should have an idea whose fault it is. Its slow, cause of php not of hypertable.

PapyRef

unread,

Feb 22, 2013, 10:53:55 PM2/22/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

Sambus,

I use PHP and I have this difficulty that php slows down.
I am interested in your solution php / c++.
Is it possible to have examples?

irXpoder

unread,

Feb 23, 2013, 6:22:48 AM2/23/13

to hyperta...@googlegroups.com

ddorian: yes, unless i use to get the cells unserialized. so i have only binary data, which is not a option. Php is slow when it comes to serialize, and creating objects.

PapyRef:

I explain, what i have done, i have one main PHP script: which uses exec and executes 2 C++ apps

<?php

... does something ...

exec('/bla/cc1app > text') // Is just a C++ HQL statement to retrieve the data http://pastebin.com/1niAvi1P <--- SAMPLE

file_get_cotents('text')

... convert the text file to an array...(i did it with explode and pregmatch)... work with that array... create a new TEXT file with the needed HQLs

file_put_content('2ndfile');

exec('/bla/cc2app > 2ndoutputt) # Retrieve again the data; uses the the data from 2ndfile <--- http://pastebin.com/RamQ9iRW

...work with the data...

?>

What you may need to change is the HQL statements where the function starts
"void test_hql(Thrift::Client *client, std::ostream &out) {"

The 1 app does only an HQL statement to retrieve the data and outputs to the shell, so i just execute it and pass the output to a file and read it than from php.

The 2 app does read a file full of HQL statements and executes it. and outputs it. again.

As i said dirty and simple. So in short i use C++ to retrieve the data instead of thrift-php. Here if you need more help. http://hypertable.com/documentation/code_examples/cpp/

--

sambus

unread,

Feb 23, 2013, 6:36:14 AM2/23/13

to hyperta...@googlegroups.com, irxp...@googlemail.com

One more thing PapyRef: which also speeds up, if use php only, make small results, like Limit 1000 instead of 10000. and use loops and dynamic sqls with limit and offset, this gives also performance, but as bigger your loop gets it gets again slower.

e.g. http://pastebin.com/DfMkVqJ5

PapyRef

unread,

Feb 23, 2013, 9:21:08 AM2/23/13

to hyperta...@googlegroups.com, irxp...@googlemail.com

irXpoder & sambus, Thanks for your help.

Christian.

PapyRef

unread,

Feb 24, 2013, 4:50:12 AM2/24/13

to hyperta...@googlegroups.com, irxp...@googlemail.com

sambus,

I re-write our PHP code with Thrift API scanner functions like this :

$limit = 2;
for($i=0; $i<=100; $i++) {
    $offset = $limit * $i;
    // SELECT crawl_status FROM Urls where VALUE REGEXP '^0$' OFFSET " . $offset . " LIMIT " . $limit
    $HTscanner = $HTclient->scanner_open($HTnamespace, "Urls", new Hypertable_ThriftGen_ScanSpec(array('cell_offset'=>$offset,
                                                                                                       'cell_limit'=>$limit,
                                                                                                       'columns'=>array('crawl_status'),
                                                                                                       'value_regexp'=>'^0$')));
    $HTcells = $HTclient->scanner_get_cells($HTscanner);
    if(empty($HTcells)) { break; }

    foreach($HTcells as $cell) {
..... Do some works with $cell->key->timestamp, $cell->key->row, $cell->key->column_qualifier, .....
    }
}

dorian i

unread,

Feb 28, 2013, 4:12:29 PM2/28/13

to hyperta...@googlegroups.com

Another faq entry:

Hypertable stores compressed blocks if there is a good compression ratio for that block. What is the ratio? And is it configurable ?

ddorian

unread,

Mar 7, 2013, 8:53:32 PM3/7/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

Another:

set_cell_as_arrays cannot be used for deletes?

On Thursday, February 21, 2013 11:43:00 PM UTC+1, Doug Judd wrote:

irXpoder

unread,

Mar 8, 2013, 3:59:46 AM3/8/13

to hyperta...@googlegroups.com

no it can be used for delete

--
You received this message because you are subscribed to a topic in the Google Groups "Hypertable User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hypertable-user/wHnuB9DktwA/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to hypertable-us...@googlegroups.com.

ddorian

unread,

Mar 12, 2013, 10:48:39 PM3/12/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

@dough

Maybe you can implement the datacenter replication in a flexible way like couchbase (they use datacenter replication to serve changes to elastic search, mongodb does too with it's replication).

@irXpoder

The api is the same as deleting will Cells and cells_as_arrays?

On Thursday, February 21, 2013 11:43:00 PM UTC+1, Doug Judd wrote:

Doug Judd

unread,

Mar 13, 2013, 12:03:37 AM3/13/13

to hypertable-user, hyperta...@googlegroups.com

That's a great idea. We'll definitely consider it.

- Doug

--

You received this message because you are subscribed to the Google Groups "Hypertable User" group.

To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.

To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

ddorian

unread,

Mar 13, 2013, 10:44:56 AM3/13/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

@doug:

irXpoder sad that it is possible to delete cells using cells_as_arrays (without creating Cell and Key objects?)?

I tried some examples but it didn't work. What should be the format of the array when deleting, in which index should be the cell flag? (using python)

On Thursday, February 21, 2013 11:43:00 PM UTC+1, Doug Judd wrote:

irXpoder

unread,

Mar 13, 2013, 12:43:19 PM3/13/13

to hyperta...@googlegroups.com

Ddoarin: check this and read the #keyflag

foreach($a as $b){
$key = new Hypertable_ThriftGen_Key(array('row'=> $tcell[0], 'flag' => 'DELETE_ROW'));
$cellt[] = new Hypertable_ThriftGen_Cell(array('key' => $key, 'value'=> ''));
}
$client->mutator_set_cells($mutatorurls, $cellt);

http://hypertable.com/documentation/reference_manual/thrift_api/#keyflag

--
You received this message because you are subscribed to a topic in the Google Groups "Hypertable User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hypertable-user/wHnuB9DktwA/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to hypertable-us...@googlegroups.com.

dorian i

unread,

Mar 13, 2013, 12:52:23 PM3/13/13

to hyperta...@googlegroups.com

That works for me too. Notice the difference between mutator_set_cells and mutator_set_cells_as_arrays . Your code, translated in python works but you are still creating objects for the Key and Cell. I asked if it was possible to just use arrays (without creating objects). Example:

client.mutator_set_cell(mutator, Cell(Key(k, None, None, None, None, 0))] (here we are creating objects Cell and Key)

client.mutator_set_cell_as_array(mutator, ['row','column','qualifier','value']) (notice, [ ] is a python list, like an array in php, we are not creating Cell and Key objects)

irXpoder

unread,

Mar 13, 2013, 1:02:11 PM3/13/13

to hyperta...@googlegroups.com

ah sorry, than, i think its than not possible.

dorian i

unread,

Mar 13, 2013, 1:16:01 PM3/13/13

to hyperta...@googlegroups.com

That's what i thought too and why i sad that it need to be put in the faq.

Doug Judd

unread,

Mar 13, 2013, 7:49:28 PM3/13/13

to hypertable-user

We'll add it to the docs. However, with the 0.9.8 release, we're going to some API overhaul and we'll be sure to improve the "as_arrays" interfaces to support deletes.

- Doug

You received this message because you are subscribed to the Google Groups "Hypertable User" group.

To unsubscribe from this group and stop receiving emails from it, send an email to hypertable-us...@googlegroups.com.

To post to this group, send email to hyperta...@googlegroups.com.
Visit this group at http://groups.google.com/group/hypertable-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

--

ddorian

unread,

Mar 16, 2013, 2:40:49 PM3/16/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

The timestamp can be supplied by the application at insert time, or can be auto-generated (default).

Is the timestamp generated on the Thriftclient or Rangeserver or client_side?

On Thursday, February 21, 2013 11:43:00 PM UTC+1, Doug Judd wrote:

Doug Judd

unread,

Mar 16, 2013, 5:36:13 PM3/16/13

to hypertable-user, hyperta...@googlegroups.com

Currently its on the RangeServer, but this will probably be changed in the near future to be inside the Client library (e.g. ThriftBroker).

- Doug

Alex Kashirin

unread,

Mar 18, 2013, 12:45:31 PM3/18/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

PapyRef,

can you refer to POST - https://groups.google.com/forum/?fromgroups=#!topic/hypertable-user/9Qv49Y-424I

do you know much about PHP extension compiling

That should-definitely help the thrift execution.

Thanks,

Alex

ddorian

unread,

Apr 6, 2013, 9:15:06 AM4/6/13

to hyperta...@googlegroups.com, hyperta...@googlegroups.com, do...@hypertable.com

Add to the Community page a subpage called Papers/WhitePapers/Slides/Presentations/Videos and merge it with the case studies page under Customers?

Reply all

Reply to author

Forward