Zookeeper as lock manager for opentsdb.

Andrey Stepachev

unread,

Feb 6, 2013, 2:07:51 AM2/6/13

to open...@googlegroups.com

Hi,

Three month ago https://github.com/OpenTSDB/opentsdb/pull/132 was written. This solution helps me to solve scaling problem due of high amount of new metrics comming to opentsdb in my clusters.

I think, that usage of hbase locks is a very bad idea, such locks they don't scale and easily kill hbase region server in case of storm of new metrics.

Zookeeper is more robust solution, and it don't bring additional dependencies: zookeeper already used by hbase.

Does this pull request has any chance to be applied? It is really boring to backport such useful feature every time.

Let think, how we can adopt this pull request, if it not fit well to opentsdb? Additional jar dependency can be removed of course.

--
Andrey.

ManOLamancha

unread,

Feb 7, 2013, 8:24:31 PM2/7/13

to open...@googlegroups.com

On Wednesday, February 6, 2013 2:07:51 AM UTC-5, Andrey Stepachev wrote:

Does this pull request has any chance to be applied? It is really boring to backport such useful feature every time.

I like this feature too but if 2.0 includes a backend abstraction layer, you'll want to port it for that. These ZK locks certainly helped our prod instance when we first turned it up, though even with a heap of 4GB we were killing our ZK instances. Thanks!

tsuna

unread,

Feb 14, 2013, 4:00:21 AM2/14/13

to Andrey Stepachev, ManOLamancha, open...@googlegroups.com

On Tue, Feb 5, 2013 at 11:07 PM, Andrey Stepachev <oct...@gmail.com> wrote:
> I think, that usage of hbase locks is a very bad idea, such locks they don't
> scale and easily kill hbase region server in case of storm of new metrics.

Yes it was a design mistake to store all the MAX_ID in the same row,
and to use an explicit row lock. In addition to this, there was a
limitation in HBase that made this whole dance more complicated than
necessary.

I believe it's possible to rewrite the code using atomicIncrement()
and compareAndSet() to have a lock-free, ZooKeeper-free version of the
code.

Algorithm would be roughly as follows:
0. Check if the name doesn't have a UID already.
1. Do an atomicIncrement() on the MAX_ID to get a new UID.
2. compareAndSet() the forward mapping (UID => name).
3. compareAndSet() the reverse mapping (name => UID).

If we die between step 1 and 2: we just waste an UID.
If we die between step 2 and 3: we just waste an UID and have an orphan UID.
Step 1 is atomic so no race condition possible here.
Step 2 technically doesn't need to be a compareAndSet() because the ID
we're trying to assign is guaranteed unique by the atomicIncrement(),
however it will help be absolutely sure we don't unintentionally
overwrite an existing cell (which would only happen only in really
weird / data corruption cases).
Step 3 is racy, because if two TSDs attempt to assign an UID to the
same name at the same time, only one of them will have its
compareAndSet() succeed. The other one will have to retry from step 0
and find that the UID has been assigned, so the one it attempted to
assign will just be wasted.

We can always have the "tsdb uid fsck" command find the unused and
orphaned UIDs and put them in some kind of a free list for later use,
so that no UID is "wasted".

Does that sound reasonable? It would be a good excuse to rewrite the
UniqueId code to be fully asynchronous / non-blocking.

On Thu, Feb 7, 2013 at 5:24 PM, ManOLamancha <clars...@gmail.com> wrote:
> I like this feature too but if 2.0 includes a backend abstraction layer,
> you'll want to port it for that. These ZK locks certainly helped our prod
> instance when we first turned it up, though even with a heap of 4GB we were
> killing our ZK instances. Thanks!

Uhh, how on earth were you able to drive ZK out of memory by assigning
UIDs with TSD? That sounds just wrong.

--
Benoit "tsuna" Sigoure

Andrey Stepachev

unread,

Feb 14, 2013, 5:20:07 AM2/14/13

to tsuna, ManOLamancha, open...@googlegroups.com

On Thu, Feb 14, 2013 at 1:00 PM, tsuna <tsun...@gmail.com> wrote:

On Tue, Feb 5, 2013 at 11:07 PM, Andrey Stepachev <oct...@gmail.com> wrote:
> I think, that usage of hbase locks is a very bad idea, such locks they don't
> scale and easily kill hbase region server in case of storm of new metrics.

Yes it was a design mistake to store all the MAX_ID in the same row,
and to use an explicit row lock. In addition to this, there was a
limitation in HBase that made this whole dance more complicated than
necessary.

I believe it's possible to rewrite the code using atomicIncrement()
and compareAndSet() to have a lock-free, ZooKeeper-free version of the
code.

I think, that zookeeper can be helpful in other way, such as concurrent compactions.

But it is not bad at all to don't mix zookeeper here until absolutely necessary.

Algorithm would be roughly as follows:
0. Check if the name doesn't have a UID already.
1. Do an atomicIncrement() on the MAX_ID to get a new UID.
2. compareAndSet() the forward mapping (UID => name).
3. compareAndSet() the reverse mapping (name => UID).

If we die between step 1 and 2: we just waste an UID.
If we die between step 2 and 3: we just waste an UID and have an orphan UID.
Step 1 is atomic so no race condition possible here.
Step 2 technically doesn't need to be a compareAndSet() because the ID
we're trying to assign is guaranteed unique by the atomicIncrement(),
however it will help be absolutely sure we don't unintentionally
overwrite an existing cell (which would only happen only in really
weird / data corruption cases).
Step 3 is racy, because if two TSDs attempt to assign an UID to the
same name at the same time, only one of them will have its
compareAndSet() succeed. The other one will have to retry from step 0
and find that the UID has been assigned, so the one it attempted to
assign will just be wasted.

We can always have the "tsdb uid fsck" command find the unused and
orphaned UIDs and put them in some kind of a free list for later use,
so that no UID is "wasted".

Does that sound reasonable? It would be a good excuse to rewrite the
UniqueId code to be fully asynchronous / non-blocking.

Looks reasonable for me. I can implement this within nearest spare timespan,

if someone will not implement this earlier, for example in tsdb2.0.

On Thu, Feb 7, 2013 at 5:24 PM, ManOLamancha <clars...@gmail.com> wrote:
> I like this feature too but if 2.0 includes a backend abstraction layer,
> you'll want to port it for that. These ZK locks certainly helped our prod
> instance when we first turned it up, though even with a heap of 4GB we were
> killing our ZK instances. Thanks!

Uhh, how on earth were you able to drive ZK out of memory by assigning
UIDs with TSD? That sounds just wrong.

+1 to question: how it was possible?

--
Benoit "tsuna" Sigoure

--
Andrey.

tsuna

unread,

May 30, 2013, 5:56:51 PM5/30/13

to Andrey Stepachev, ManOLamancha, open...@googlegroups.com

On Thu, Feb 14, 2013 at 1:00 AM, tsuna <tsun...@gmail.com> wrote:
> I believe it's possible to rewrite the code using atomicIncrement()
> and compareAndSet() to have a lock-free, ZooKeeper-free version of the
> code.
>
> Algorithm would be roughly as follows:
> 0. Check if the name doesn't have a UID already.
> 1. Do an atomicIncrement() on the MAX_ID to get a new UID.
> 2. compareAndSet() the forward mapping (UID => name).
> 3. compareAndSet() the reverse mapping (name => UID).
>
> If we die between step 1 and 2: we just waste an UID.
> If we die between step 2 and 3: we just waste an UID and have an orphan UID.
> Step 1 is atomic so no race condition possible here.
> Step 2 technically doesn't need to be a compareAndSet() because the ID
> we're trying to assign is guaranteed unique by the atomicIncrement(),
> however it will help be absolutely sure we don't unintentionally
> overwrite an existing cell (which would only happen only in really
> weird / data corruption cases).
> Step 3 is racy, because if two TSDs attempt to assign an UID to the
> same name at the same time, only one of them will have its
> compareAndSet() succeed. The other one will have to retry from step 0
> and find that the UID has been assigned, so the one it attempted to
> assign will just be wasted.

I ended up implementing the approach described above. Benchmark and
code are linked off https://gist.github.com/tsuna/5681407
Bottom line: with this approach the number of UIDs we can allocate
concurrently is dramatically higher and also scales quasi-linearly.

I believe this should address the issue various people ran into with
the UID allocation dance.

--
Benoit "tsuna" Sigoure

Reply all

Reply to author

Forward