How to implement deduplication in elliptics?

Sergey S.

unread,

Feb 7, 2014, 4:06:39 AM2/7/14

to rever...@googlegroups.com

Hi.
I want to try Elliptics in some project.
Over 50 millions files, ~ 250 000 bytes per each.
I want to try rift http://doc.reverbrain.com/rift:rift and have some questions:

1.How to implement deduplication (for all storage, not in 1 bucket).
Or I should implement it in my application (calculate hash from content, create file list, alculate links for each file, etc.)

2. As I understand it, groups for replication I choose in bucket configuration(–data-groups)
How other elliptics software(for example dnet_recovery script) can get information about which group in which replica, for example to start Replica recovery process?

Evgeniy Polyakov

unread,

Feb 7, 2014, 3:39:42 PM2/7/14

to Sergey S., rever...@googlegroups.com

Hi

07.02.2014, 13:06, "Sergey S." <ses...@gmail.com>:

> I want to try Elliptics in some project.
> Over 50 millions files, ~ 250 000 bytes per each.
> I want to try rift http://doc.reverbrain.com/rift:rift and have some questions:
>
> 1.How to implement deduplication (for all storage, not in 1 bucket).
> Or I should implement it in my application (calculate hash from content, create file list, alculate links for each file, etc.)

Bucket by its nature is a guarding wall between different clients.
It is possible (and desirable in cases of infinitely growing storage) to store different buckets in different groups.

Thus objects are deliberately unconnected from each other when stored in different buckets.

You can use index-by-id method, when you provide not object name but its numeric ID.
One has to turn authentication off to allow it, which bypasses security and bucket's scalability.

> 2. As I understand it, groups for replication I choose in bucket configuration(–data-groups)
> How other elliptics software(for example dnet_recovery script) can get information about which group in which replica, for example to start Replica recovery process?

You can read this info via rift_meta_ctl and provide it to the recovery scripts

Anton Putau

unread,

Oct 7, 2016, 8:45:36 AM10/7/16

to reverbrain, ses...@gmail.com, z...@ioremap.net

Hi,

sorry, still have some misunderstood.

Does deduplication possible?

Can I achieve it using backrunner?

Regards, Anton.

Evgeniy Polyakov

unread,

Oct 7, 2016, 12:37:38 PM10/7/16

to Anton Putau, reverbrain, ses...@gmail.com

Hi Anton

07.10.2016, 15:45, "Anton Putau" <xuse...@gmail.com>:

> sorry, still have some misunderstood.
>
> Does deduplication possible?
> Can I achieve it using backrunner?

To implement deduplication one has to assign keys according to data content, so that the same content would have the same keys.
Backrunner doesn't do this itself, it uses keys provided by the client.

If client will use string keys which somehow correlate with content client uploads, deduplication will work transparently.

Reply all

Reply to author

Forward