deduplicating file systems: VDO with Debian?

hw

unread,

Nov 6, 2022, 9:10:06 PM11/6/22

to

Hi,

I discovered that Redhat has VDO[1] to take care of deduplicating file systems.
Aptitude didn't find any packages towards that.

Is there no VDO in Debian, and what would be good to use for deduplication with
Debian? Why isn't VDO in the stardard kernel? Or is it?

I'm not looking for deduplication that happens some time after files have
already been written like btrfs would allow: There is no point in deduplicating
backups after they're done because I don't need to save disk space for them when
I can fit them in the first place.

[1]:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/deduplicating_and_compressing_storage/deploying-vdo_deduplicating-and-compressing-storage#doc-wrapper

Anders Andersson

unread,

Nov 7, 2022, 3:20:06 AM11/7/22

to

You could always buy Red Hat Enterprise Linux license, sign up for a support contract, and ask if they could start supporting other operating systems? ("Each branch on this project is intended to work with a specific release of Enterprise Linux").

I would be more worried if my backup storage didn't have enough room for at least a full fresh and unique backup from one client.

- If it doesn't and something unexpected happens (user fills the whole disk with something, malware encrypts all data = changes everything to unique files, etc) then it will fill up the disk and ruin every other backup.

- If you *do* have room for one client but not many more, you can always deduplicate after each client backup which should regain everything if nothing changed.

hw

unread,

Nov 7, 2022, 4:40:05 AM11/7/22

to

On Mon, 2022-11-07 at 09:14 +0100, Anders Andersson wrote:
> On Mon, Nov 7, 2022 at 3:04 AM hw <h...@adminart.net> wrote:
>
> > Hi,
> >
> > I discovered that Redhat has VDO[1] to take care of deduplicating file
> > systems.
> > Aptitude didn't find any packages towards that.
> >
> > Is there no VDO in Debian, and what would be good to use for deduplication
> > with
> > Debian? Why isn't VDO in the stardard kernel? Or is it?
> >
> > I'm not looking for deduplication that happens some time after files have
> > already been written like btrfs would allow: There is no point in
> > deduplicating
> > backups after they're done because I don't need to save disk space for
> > them when
> > I can fit them in the first place.
> >
> >
> > [1]:
> >
> > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/deduplicating_and_compressing_storage/deploying-vdo_deduplicating-and-compressing-storage#doc-wrapper
> >
> >
> You could always buy Red Hat Enterprise Linux license, sign up for a
> support contract, and ask if they could start supporting other operating
> systems? ("Each branch on this project is intended to work with a specific
> release of Enterprise Linux").

Huh? What would that accomplish?

> I would be more worried if my backup storage didn't have enough room for at
> least a full fresh and unique backup from one client.
> - If it doesn't and something unexpected happens (user fills the whole
> disk with something, malware encrypts all data = changes everything to
> unique files, etc) then it will fill up the disk and ruin every other
> backup.
> - If you *do* have room for one client but not many more, you can always
> deduplicate after each client backup which should regain everything if
> nothing changed.

None of this applies in this case. Are you saying that deduplication is not
possible with Debian?

didier gaumet

unread,

Nov 7, 2022, 5:40:05 AM11/7/22

to

Le 07/11/2022 à 10:30, hw a écrit :

Hello,

Disclaimer: I am really almqst ignorant about deduplication

> On Mon, 2022-11-07 at 09:14 +0100, Anders Andersson wrote:
>> On Mon, Nov 7, 2022 at 3:04 AM hw <h...@adminart.net> wrote:

[...]

>> You could always buy Red Hat Enterprise Linux license, sign up for a
>> support contract, and ask if they could start supporting other operating
>> systems? ("Each branch on this project is intended to work with a specific
>> release of Enterprise Linux").
>
> Huh? What would that accomplish?

I think that what Anders tries to underline is that each VDO release is
specific to Redhat, and further, is specific to a particular Redhat
release. By definition that would complicate potential VDO integration
in Debian.

[...]

> Are you saying that deduplication is not
> possible with Debian?

I may be mistaken, but I think there is a confusion here about a
deduplication at filesystem level and at backup tool level.

At (linux) filesystem level, I think in-line deduplication is only
provided by ZFS (and perhaps, out-of-tree, BTRFS)

I do not know precisely your usecase, but if it is to prevent
duplication during backup, just use a deduplicating backup tool, it just
do that: avoid duplicating backup objects before it could occur.
searching for deduplicating software packaged in Debian ('apt search
dedup' in a terminal) and sorting backup ones would give you clues.

Curt

unread,

Nov 7, 2022, 8:10:05 AM11/7/22

to

On 2022-11-07, hw <h...@adminart.net> wrote:
>
> None of this applies in this case. Are you saying that deduplication is not
> possible with Debian?
>
>

curty@einstein:~$ apt-cache search deduplication
backuppc - high-performance, enterprise-grade system for backing up PCs
borgbackup - deduplicating and compressing backup program
borgbackup-doc - deduplicating and compressing backup program (documentation)
jdupes - identify and delete or link duplicate files
libsxclient-dev - Scalable public and private cloud storage
libsxclient3 - Scalable public and private cloud storage
sx - Scalable public and private cloud storage
zbackup - Versatile deduplicating backup tool

I had to look up the word deduplicate (I was going to say, "That isn't even a
word!"), which reveals my extensive knowledge of the matter.

Dan Ritter

unread,

Nov 7, 2022, 9:50:05 AM11/7/22

to

didier gaumet wrote:
>
> I may be mistaken, but I think there is a confusion here about a
> deduplication at filesystem level and at backup tool level.
>
> At (linux) filesystem level, I think in-line deduplication is only provided
> by ZFS (and perhaps, out-of-tree, BTRFS)

ZFS deduplication is a special beast that usually does not make
people happy. It is an enterprise feature that really only works
for special cases, and requires a lot of RAM - 1GB per 1TB of
storage - to work. Worst of all, it cannot be gracefully turned
off.

As you say, deduplication in backup systems is quite common, and works
pretty well. There's also an on-disk non-filesystem utility, rdfind,
which is packaged in Debian. It can discover identical files and make
them hardlinks.

-dsr-

hede

unread,

Nov 7, 2022, 10:30:05 AM11/7/22

to

Am 07.11.2022 02:57, schrieb hw:
> Hi,

>
> Is there no VDO in Debian, and what would be good to use for
> deduplication with
> Debian? Why isn't VDO in the stardard kernel? Or is it?

I have used vdo in Debian some time ago and didn't remember big
problems. AFAIR I did compile it myself - no prebuild packages.

I switched to btrfs for other reasons. Not even for performance. The VDO
Layer eats performance, yes, but compared to naked ext4 even btrfs is
slow.

> I'm not looking for deduplication that happens some time after files
> have
> already been written like btrfs would allow: There is no point in
> deduplicating
> backups after they're done because I don't need to save disk space for
> them when
> I can fit them in the first place.

That's only one point. And it's not really some valid one, I think, as
you do typically not run into space problems with one single action
(YMMV). Running multiple sessions and out-of-band deduplication between
them works for me.

In-band deduplication (that's the one you want) has some drawbacks, too:
High Ressource usage. You need plenty of RAM (up to several Gigabytes
per Terabyte Storage) and write success is delayed (-> slow direct i/o).

For Out-of-Band deduplication there are multiple different
implementations. File based dedup on directory basis can be very fast
and resource economical, for example via rdfind or jdupes. Block based
like via bees for btrfs (that's the one I use) is more close to in-band
deduplication (including high RAM usage). Bees can be switched off and
on at any time (for example if it's a small home-system which runs more
demanding tasks from time to time) and switching it on again resumes at
the last state (it starts at the last transaction id which was processed
-> btrfs knows its transactions).

regards
hede

hede

unread,

Nov 7, 2022, 11:10:05 AM11/7/22

to

Am 07.11.2022 16:29, schrieb hede:
> Am 07.11.2022 02:57, schrieb hw:
>> Hi,
>>
>> Is there no VDO in Debian, and what would be good to use for
>> deduplication with
>> Debian? Why isn't VDO in the stardard kernel? Or is it?
>
> I have used vdo in Debian some time ago and didn't remember big
> problems.

Btw. please keep in mind: VDO is transparent to the filesystem on-top.
And deduplication (likewise compression) is some non-deterministic task.

Where btrfs' calculation of the real free space is tricky if compression
and/or dedup is in use, it's quite impossible for a filesystem ontop of
VDO. It's much wore with VDO. The filesystem on top sees a "virtual"
size of the device which is a vague guess at best and is predefined on
creation time. You need to carefully monitor the actual disk usage of
the VDO device and stop writing data to the filesystem if it fills up.
It stalls if the filesystem wants to write more data than is available.
(At least if I remember correctly. Please correct me if I'm wrong here.)

So if you are expecting issues with space, there's some risk in damaging
your (file-)system.

With something like btrfs or ZFS there's less risk in that. Both do know
the free space and even if this was indeed a Problem in first days*,
rebalancing and filled up filesystems are (AFAIK) no longer a problem
with btrfs.

*) running out of space on btrfs could render filesystems read-only,
deleting files was no longer possible. COW means even deleting a file
needs some space so it got broken. This is AFAIK resolved. For deleting
files there's always some reserved space.

regards
hede

rhkr...@gmail.com

unread,

Nov 7, 2022, 2:00:05 PM11/7/22

to

> didier gaumet wrote:
> > I may be mistaken, but I think there is a confusion here about a
> > deduplication at filesystem level and at backup tool level.

I didn't (and don't) know much about deduplication (beyond what you might
deduce from the name), so I google and found this article which was helpful to
me:

* [[https://www.linkedin.com/pulse/lets-know-vdo-virtual-data-optimizer-
ganesh-gaikwad][Lets know about VDO (virtual data optimizer)]]

--
rhk

If you reply: snip, snip, and snip again; leave attributions; avoid HTML;
avoid top posting; and keep it "on list". (Oxford comma included at no
charge.) If you change topics, change the Subject: line.

Writing is often meant for others to read and understand (legal agreements
excepted?) -- make it easier for your reader by various means, including
liberal use of whitespace and minimal use of (obscure?) jargon, abbreviations,
acronyms, and references.

If someone else has already responded to a question, decide whether any
response you add will be helpful or not ...

A picture is worth a thousand words -- divide by 10 for each minute of video
(or audio) or create a transcript and edit it to 10% of the original.

hw

unread,

Nov 7, 2022, 10:50:05 PM11/7/22

to

On Mon, 2022-11-07 at 11:32 +0100, didier gaumet wrote:
> Le 07/11/2022 à 10:30, hw a écrit :
>
> Hello,
>
> Disclaimer: I am really almqst ignorant about deduplication
>
> > On Mon, 2022-11-07 at 09:14 +0100, Anders Andersson wrote:
> > > On Mon, Nov 7, 2022 at 3:04 AM hw <h...@adminart.net> wrote:
> [...]
> > > You could always buy Red Hat Enterprise Linux license, sign up for a
> > > support contract, and ask if they could start supporting other operating
> > > systems? ("Each branch on this project is intended to work with a specific
> > > release of Enterprise Linux").
> >
> > Huh? What would that accomplish?
>
> I think that what Anders tries to underline is that each VDO release is
> specific to Redhat, and further, is specific to a particular Redhat
> release. By definition that would complicate potential VDO integration
> in Debian.

At least in theory, it should be in Centos, but if it's so specific, who knows
if it causes combatiliy issues ...

> [...]
> > Are you saying that deduplication is not
> > possible with Debian?
>
> I may be mistaken, but I think there is a confusion here about a
> deduplication at filesystem level and at backup tool level.
>
> At (linux) filesystem level, I think in-line deduplication is only
> provided by ZFS (and perhaps, out-of-tree, BTRFS)

That's what it seems like, except VDO. Unfortunately, ZFS is said to need 5--
6GB of RAM for each 1TB of data, and that would require upgrading my server.

> I do not know precisely your usecase, but if it is to prevent
> duplication during backup, just use a deduplicating backup tool, it just
> do that: avoid duplicating backup objects before it could occur.
> searching for deduplicating software packaged in Debian ('apt search
> dedup' in a terminal) and sorting backup ones would give you clues.

Actually that's a good idea I didn't think of. But thinking about it, is that a
good idea:

When I want to have 2 (or more) generations of backups, do I actually want
deduplication? It leaves me with only one actual copy of the data which seems
to defeat the idea of having multiple generations of backups at least to some
extent.

The question is then if it makes a difference. It also creates the question if
I need (want) multiple generations of backups, especially when I end up with
only one copy anyway. Hmm ...

hw

unread,

Nov 7, 2022, 11:10:06 PM11/7/22

to

On Mon, 2022-11-07 at 09:30 -0500, Dan Ritter wrote:
> didier gaumet wrote:
> >
> > I may be mistaken, but I think there is a confusion here about a
> > deduplication at filesystem level and at backup tool level.
> >
> > At (linux) filesystem level, I think in-line deduplication is only provided
> > by ZFS (and perhaps, out-of-tree, BTRFS)
>
> ZFS deduplication is a special beast that usually does not make
> people happy. It is an enterprise feature that really only works
> for special cases, and requires a lot of RAM - 1GB per 1TB of
> storage - to work. Worst of all, it cannot be gracefully turned
> off.

Only 1GB/1TB? The FreeBSD handbook says 5--6GB per 1TB. I could live with 1:1,
and I wouldn't need to turn it off. The idea, in this case, is to make two
generations of backups of the "same" data without having all the disk space
needed for both of them.

> As you say, deduplication in backup systems is quite common, and works
> pretty well. There's also an on-disk non-filesystem utility, rdfind,
> which is packaged in Debian. It can discover identical files and make
> them hardlinks.

Well, if I had all the disk space to hold 2 full copies of the data to be able
to deduplicate it only later, I wouldn't need to deduplicate anything.

And how would pretending there are two backups while there's actually only one
because it got deduplicated be better than having only one backup to begin with?
(Yeah I haven't thought of that before ...)

Maybe use a snapshot to create the 2nd backup? Or what?

hw

unread,

Nov 7, 2022, 11:20:05 PM11/7/22

to

On Mon, 2022-11-07 at 13:57 -0500, rhkr...@gmail.com wrote:
>
>
> I didn't (and don't) know much about deduplication (beyond what you might
> deduce from the name), so I google and found this article which was helpful to
> me:
>
> * [[https://www.linkedin.com/pulse/lets-know-vdo-virtual-data-optimizer-
> ganesh-gaikwad][Lets know about VDO (virtual data optimizer)]]

That's a good pointer, but I still wonder how VDO actually works. For example,
if I have a volume with 5TB of data on it and I write a 500kB file to that
volume a week later or whenever, and the file I'm writing is identical to
another file somewhere within the 5TB of data alreading on the volume, how does
VDO figure out that both files are identical? ZFS does it by keeping lots of
data in memory so it can look it up right away, but VDO? Will it write the new
file at first and check it later in the background and re-use the space later,
or will it delay the write to check it first? Or does it do something else?

hw

unread,

Nov 7, 2022, 11:40:05 PM11/7/22

to

On Mon, 2022-11-07 at 16:29 +0100, hede wrote:
> Am 07.11.2022 02:57, schrieb hw:
> > Hi,
> >
> > Is there no VDO in Debian, and what would be good to use for
> > deduplication with
> > Debian? Why isn't VDO in the stardard kernel? Or is it?
>
> I have used vdo in Debian some time ago and didn't remember big
> problems. AFAIR I did compile it myself - no prebuild packages.

Cool, I could give that a try, ty.

> I switched to btrfs for other reasons. Not even for performance. The VDO
> Layer eats performance, yes, but compared to naked ext4 even btrfs is
> slow.

Really? I never noticed that btrfs would be slow. But then, it's been a long
time that I used ext4 ...

> > There is no point in
> > deduplicating
> > backups after they're done because I don't need to save disk space for
> > them when
> > I can fit them in the first place.
>
> That's only one point.

What are the others?

> And it's not really some valid one, I think, as
> you do typically not run into space problems with one single action
> (YMMV). Running multiple sessions and out-of-band deduplication between
> them works for me.

That still requires you to have enough disk space for at least two full backups.
I can see it working for three backups because you can deduplicate the first
two, but not for two. And why would I deduplicate when I have sufficient disk
space.

> In-band deduplication (that's the one you want) has some drawbacks, too:
> High Ressource usage. You need plenty of RAM (up to several Gigabytes
> per Terabyte Storage) and write success is delayed (-> slow direct i/o).

Well, if it takes 5 days or so to make a backup, that won't be very useful. It
takes more than long enough already because my discs can only sustain so much.

> For Out-of-Band deduplication there are multiple different
> implementations. File based dedup on directory basis can be very fast
> and resource economical, for example via rdfind or jdupes. Block based
> like via bees for btrfs (that's the one I use) is more close to in-band
> deduplication (including high RAM usage). Bees can be switched off and
> on at any time (for example if it's a small home-system which runs more
> demanding tasks from time to time) and switching it on again resumes at
> the last state (it starts at the last transaction id which was processed
> -> btrfs knows its transactions).

Hm. I wouldn't mind running it from time to time, though I don't know that I
would have a lot of duplicate data other than backups. How much space might I
expect to gain from using bees, and how much memory does it require to run?

hw

unread,

Nov 7, 2022, 11:50:05 PM11/7/22

to

On Mon, 2022-11-07 at 17:01 +0100, hede wrote:
> Am 07.11.2022 16:29, schrieb hede:
> > Am 07.11.2022 02:57, schrieb hw:
> > > Hi,
> > >
> > > Is there no VDO in Debian, and what would be good to use for
> > > deduplication with
> > > Debian? Why isn't VDO in the stardard kernel? Or is it?
> >
> > I have used vdo in Debian some time ago and didn't remember big
> > problems.
>
> Btw. please keep in mind: VDO is transparent to the filesystem on-top.
> And deduplication (likewise compression) is some non-deterministic task.
>
> Where btrfs' calculation of the real free space is tricky if compression
> and/or dedup is in use, it's quite impossible for a filesystem ontop of
> VDO. It's much wore with VDO. The filesystem on top sees a "virtual"
> size of the device which is a vague guess at best and is predefined on
> creation time. You need to carefully monitor the actual disk usage of
> the VDO device and stop writing data to the filesystem if it fills up.

Yes, I figured that would be a problem.

> It stalls if the filesystem wants to write more data than is available.
> (At least if I remember correctly. Please correct me if I'm wrong here.)

Like NFS? And then what? There isn't really a way to resolve that problem once
you ran into it, is there?

> So if you are expecting issues with space, there's some risk in damaging
> your (file-)system.

Even damage it? That would really suck.

> With something like btrfs or ZFS there's less risk in that. Both do know
> the free space and even if this was indeed a Problem in first days*,
> rebalancing and filled up filesystems are (AFAIK) no longer a problem
> with btrfs.

Well, I'm finding btrfs somewhat disappointing since it doesn't support
deduplication like ZFS does, and even RAID56 is still broken. It feels like the
available file systems haven't been up to the task for almost a decade and might
never catch up.

David Christensen

unread,

Nov 8, 2022, 12:50:06 AM11/8/22

to

On 11/7/22 19:49, hw wrote:
> On Mon, 2022-11-07 at 11:32 +0100, didier gaumet wrote:

>> At (linux) filesystem level, I think in-line deduplication is only
>> provided by ZFS (and perhaps, out-of-tree, BTRFS)
>
> That's what it seems like, except VDO. Unfortunately, ZFS is said to need 5--
> 6GB of RAM for each 1TB of data, and that would require upgrading my server.

On my ZFS storage and backup servers, ZFS seems to grab the majority of
available memory. I have been unable to figure out a way to measure
memory consumed by deduplication.

> When I want to have 2 (or more) generations of backups, do I actually want
> deduplication? It leaves me with only one actual copy of the data which seems
> to defeat the idea of having multiple generations of backups at least to some
> extent.
>
> The question is then if it makes a difference. It also creates the question if
> I need (want) multiple generations of backups, especially when I end up with
> only one copy anyway. Hmm ...

I put rsync based backups on ZFS storage with compression and
de-duplication. du(1) reports 33 GiB for the current backups (e.g.
uncompressed and/or duplicated size). zfs-auto-snapshot takes snapshots
of the backup filesystems daily and monthly, and I take snapshots
manually every week. I have 78 snapshots going back ~6 months. du(1)
reports ~3.5 TiB for the snapshots. 'zfs list' reports 86.2 GiB of
actual disk usage for all 79 backups. So, ZFS de-duplication and
compression leverage my backup storage by 41:1.

ZFS compression and de-duplication also works well for jails/ VM's.

For general data, I use compression alone.

For compressed and/or encrypted archives, image, etc., I do not use
compression or de-duplication

The key is to only use de-duplication when there is a lot of duplication.

And, to a lesser extend, to only use compression on uncompressed data
(lz4 detects compressed data and does not try to compress it further).

My ZFS pools are built with HDD's. I recently added an SSD-based vdev
as a dedicated 'dedup' device, and write performance improved
significantly when receiving replication streams.

David

hw

unread,

Nov 8, 2022, 2:20:05 AM11/8/22

to

On Mon, 2022-11-07 at 21:46 -0800, David Christensen wrote:
> On 11/7/22 19:49, hw wrote:
> > On Mon, 2022-11-07 at 11:32 +0100, didier gaumet wrote:
>
> > > At (linux) filesystem level, I think in-line deduplication is only
> > > provided by ZFS (and perhaps, out-of-tree, BTRFS)
> >
> > That's what it seems like, except VDO. Unfortunately, ZFS is said to need
> > 5--
> > 6GB of RAM for each 1TB of data, and that would require upgrading my server.
>
>
> On my ZFS storage and backup servers, ZFS seems to grab the majority of
> available memory. I have been unable to figure out a way to measure
> memory consumed by deduplication.

Are you deduplicating? Apparently some people say bad things happen when ZFS
runs out of memory from deduplication.

> > The question is then if it makes a difference. It also creates the question
> > if
> > I need (want) multiple generations of backups, especially when I end up with
> > only one copy anyway. Hmm ...
>
>
> I put rsync based backups on ZFS storage with compression and
> de-duplication. du(1) reports 33 GiB for the current backups (e.g.
> uncompressed and/or duplicated size). zfs-auto-snapshot takes snapshots
> of the backup filesystems daily and monthly, and I take snapshots
> manually every week. I have 78 snapshots going back ~6 months. du(1)
> reports ~3.5 TiB for the snapshots. 'zfs list' reports 86.2 GiB of
> actual disk usage for all 79 backups. So, ZFS de-duplication and
> compression leverage my backup storage by 41:1.

I'm unclear as to how snapshots come in when it comes to making backups. What
if you have a bunch of snapshots and want to get a file from 6 generations of
backups ago? I never figured out how to get something out of an old snapshot
and found it all confusing, so I don't even use them.

33GB in backups is far from a terrabyte. I have a lot more than that.

> ZFS compression and de-duplication also works well for jails/ VM's.
>
>
> For general data, I use compression alone.
>
>
> For compressed and/or encrypted archives, image, etc., I do not use
> compression or de-duplication

Yeah, they wouldn't compress. Why no deduplication?

> The key is to only use de-duplication when there is a lot of duplication.

How do you know if there's much to deduplicate before deduplicating?

> And, to a lesser extend, to only use compression on uncompressed data
> (lz4 detects compressed data and does not try to compress it further).
>
>
> My ZFS pools are built with HDD's. I recently added an SSD-based vdev
> as a dedicated 'dedup' device, and write performance improved
> significantly when receiving replication streams.

Hm, with the ZFS I set up a couple years ago, the SSDs wore out and removing
them without any replacement didn't decrease performance.

I'm not too fond of ZFS, especially not when considering performance. But for
backups, it won't matter.