Busy to create the most exciting backup with s3ql, need help

149 views
Skip to first unread message

Amos T

unread,
Sep 9, 2021, 3:55:35 AM9/9/21
to s3ql
I am busy with an exciting setup create a nice approach to
have a huge cloud drive.  I think a lot of us will need this seen
some cloud services shut down their unlimited space. (Google Drive)

This time I want to do it right, this means:
1) encryption
2) compression
3) deduplication (to save my wallet)
4) sync on the fly

After many try outs both on windows and linux, failures with 'cloud drive' solutions
such as tvtdrive, raidrive, expandrive, mountain duck + cryptomator
(and losing some data due to bad design, crash computer, losing cache etc)

Then I did consider to use Mega.nz seen they have a good offer for a lot of cloud space,
with zeroknowledge encryption (is it?).  When I did see no way to check hashes, rclone has no 2FA for mega (yet) seen outdated go library,  I need to trust Go (Rclone) - but I prefer python of course and even worse, it is based on symmetric encryption where you put your key in their client !  And I saw no way to replicate/backup the storage. I was done with that one as well.

So finally found an excellent s3 cloud provider, wasabi, seen I hate to pay egress and this
matters when You use s3ql !  So no payment for up and downloads. And there are s3 providers in Europe as well, the place I live.

So I did end up with s3ql + wasabi
And it is the best thing I ever could do so far !

s3ql and the team of Nikolaus Rath is way too humble in my fair opinion on this project !
Belief me.

Ok. Now we go to my setup:

My setup is as follow. I do use s3 cloud space wasabi. On top of that, I run s3ql so I have a posix filesystem, having encryption, compression and deduplication ! I use a permanent and big cache on a 16TB zfs filesytem on FreeBSD. God, I like ZFS. That filesystem ROCKS. No single filesystem come close to that piece of gold.

So in this setup, my ZFS secures me against bitrot somehow, however it stores the cache only anyway of my files of s3ql.

s3ql provides to me a mountable space to s3 wasabi, so I have a posix filesystem. It does compress, deduplicate and encrypt. It syncs if needed the blocks not in the cache on the fly. When it does crash, s3ql checks the cache against the files in s3... so I like that as well...

Once it is mounted I provide a share on the network. using NFS or SMB...

The big cache is used to backup with borgbackup or to access files that are highly needed.... So I do something crazy here, I use a HUGE s3ql cache so I cache my bucket
so files do not need to be downloaded from s3 bucket.

So in my setup my source files are in s3 cloud, the local storage is used as cache for those files.  Not the other way, I use a NAS to store my files and sync them with s3
Needless to say I need to have high trust on s3ql filesystem for this, so for now I test with unimportant data I can find back on the internet.

So now I need some help, because I have now my cloud drive, of course I can replicate and backup using s3 my bucket, but I am still very prudent and I want to have my loval backup. I want tripple security.

For this my eye is on borgbackup, seen it has deduplication. It has NOT s3 endpoints, however seen s3ql mounts my data in my linux tree, this is not an issue.
I do like borgbackup because it has also deduplication and it is pretty fast.

However the manual speaks about inodes.
To create a borg repo, there are 4 options (man borg create)
https://borgbackup.readthedocs.io/en/stable/usage/create.html

Backup speed is increased by not reprocessing files that are already part of existing archives and weren’t modified.  The detection of unmodified files is done by comparing multiple file metadata values with previous values kept in the files cache.

This comparison can operate in different modes as given by --files-cache:

    ctime,size,inode (default)
    mtime,size,inode (default behaviour of borg versions older than 1.1.0rc4)
    ctime,size (ignore the inode number)
    mtime,size (ignore the inode number)
    rechunk,ctime (all files are considered modified - rechunk, cache ctime)
    rechunk,mtime (all files are considered modified - rechunk, cache mtime)
    disabled (disable the files cache, all files considered modified - rechunk)

At the moment, as said,  I use a very large cache on my S3QL files. My cache equals (almost) stored data. A few TB. Which is not pretty space efficient, locally
however files are of course fast to access. For that I use a huge 10TB. 

Not sure if that need to be really be a raid or ZFS to avoid data loss as well...

In fact what happens if the cache got corrupted?  how does s3ql detects corrupted cached files? Suppose I put this on a single drive, not on raid, and cache gets corrupted.

Seen I use a large cache with a big timeframe not to timeout
Would my source files be corrupted?  Does s3ql repairs my corrupted cache?
Do i need to schedule a command to check my local cache against corruption?
Will my corrupted cache been served until I perform some steps?

So why such huge cache.  I do use that because I do not want files are downloaded each time from S3 space ....when I do backup. The files need to be compared if they are changed or not, however my files stored in s3ql rarely change.

Borgbackup states, if INODES are stable, you can leave checking by default (csize, size,inodes) They say SMB is not stable...

So are inodes stable in S3QL ? I see this is a fuse filesystem, but what about inodes, once mounted, can they be considered be stable for both cached and not cached files? Can it be that only the cached files do have stable inodes?

What if I make S3QL to use a much smaller (default size) cache, seen I rarely access and use the files...

How should I compare the uncached files with the borg repository in such way, if files are not cached or not changed, they are not downloaded?  Do I need to exclude the inode option check?

What about cached files that are not changed?

So I need some guidance here.

Thanks.










Nikolaus Rath

unread,
Sep 9, 2021, 4:49:11 AM9/9/21
to s3...@googlegroups.com
On Sep 09 2021, Amos T <amtr...@gmail.com> wrote:
> In fact what happens if the cache got corrupted? how does s3ql detects
> corrupted cached files? Suppose I put this on a single drive, not on raid,
> and cache gets corrupted.

S3QL does not detect corruption on the cache. It would just upload/use
the corrupted data. You're expected to put the cache on a sufficiently
reliable filesystem.

> So are inodes stable in S3QL ?

Yes. A files inode never changes, no matter if it's cached or not, or
remounted.

Best,
-Nikolaus

--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

»Time flies like an arrow, fruit flies like a Banana.«

Amos T

unread,
Sep 9, 2021, 5:39:46 AM9/9/21
to s3ql
That is a very important thing to mention when people do s3ql seen a corrupt cache
due to a bad local storage (bitrot, corruption,...) can lead to the loss of data.

Inside the posix file system of s3ql of course you can protect files with parity (par2)

But whatabout the metadata and data of s3ql?  Do you provide some parity there?
If not, can that be extended by s3qlcmd so in a very worse case scenaria, when
metadata and data of the s3ql files happens, it can recover?

Op donderdag 9 september 2021 om 10:49:11 UTC+2 schreef Niko...@rath.org:

Amos T

unread,
Sep 9, 2021, 5:57:28 AM9/9/21
to s3ql
So in my setup, if I want to properly backup my files to local space,so I can stick with
the default ctime,size,inode (default)check on files.

But seen Your story of the critical cache, it looks to me I should have rather a small cache
for those files that need to be uploaded or to be accessed quickly.
In case the file is not there, sync occurs in the cache of that file and the cache becomes
primary source ! Not s3ql stored blocks of that file.

But now is my final question.  If a lot of files are NOT CACHED, so they are in the data files of
s3ql in the s3 bucket...  What happens if I do run borgbackup?  It will check on the metadata
ctime, size, inode but the actually data of the file is not stored locally...

So in that case my a safely assume when borgbackup checks for data modification, the
file in question is not downloaded ?  Because in that case, my backup will generate a lot of traffic...



Op donderdag 9 september 2021 om 11:39:46 UTC+2 schreef Amos T:

Daniel Jagszent

unread,
Sep 9, 2021, 8:57:24 AM9/9/21
to s3ql

[...] But whatabout the metadata and data of s3ql?  Do you provide some parity there?

If not, can that be extended by s3qlcmd so in a very worse case scenaria, when
metadata and data of the s3ql files happens, it can recover?
The metadata of S3QL is a SQLite database. No parity or error correction here. But you can of course use the SQLite tools to try to recover a corrupt database. I would guess that this has a high change of success for a single bit-flip (bitrod). If your HDD has corrupt sectors, you might be out of luck.

If you use ZFS anyways, use Z1 or Z2 to protect your data against these kinds of problems and do regular scrubs to detect and correct bitrod or other hard-drive failures. If you only have one drive, you could use ditto blocks ( https://en.wikipedia.org/wiki/ZFS#Additional_capabilities ) and regular scrubs to secure the metadata (put it on its own sub-filesystem).

If the worst case happens and you will loose your local metadata, S3QL keeps several backups in the backend storage. The default is to upload the metadata every 24h hours to the backend. You would of course loose up to 24h of filesystem changes.
[...] But seen Your story of the critical cache, it looks to me I should have rather a small cache
for those files that need to be uploaded or to be accessed quickly.[...]
Regardless of the size of the cache S3QL will persist dirty (i.e. changed) data as soon as possible to the backend storage (Wasabi in your case).
[...] So in that case my a safely assume when borgbackup checks for data modification, the
file in question is not downloaded ?  [...]
Yes. A simple stat will not trigger a download. That can be handled solely with the local metadata.

Nikolaus Rath

unread,
Sep 9, 2021, 9:30:15 AM9/9/21
to s3...@googlegroups.com
On Sep 09 2021, Amos T <amtr...@gmail.com> wrote:
> But whatabout the metadata and data of s3ql? Do you provide some parity
> there?

No, S3QL relies completely on the filesystem where this data is being
stored.


> If not, can that be extended by s3qlcmd so in a very worse case
> scenaria, when metadata and data of the s3ql files happens, it can
> recover?

Well, in principle everything is possible. I don't think anyone is
planning to do that work though.

It's also not clear to me why people worried about filesystem corruption
can't just use a file system that provides the necessary guarantees
instead of relying on individual applications to make up for that...

Best,

Nikolaus Rath

unread,
Sep 9, 2021, 1:04:59 PM9/9/21
to s3...@googlegroups.com
On Sep 09 2021, Amos T <amtr...@gmail.com> wrote:
> But now is my final question. If a lot of files are NOT CACHED, so they
> are in the data files of
> s3ql in the s3 bucket... What happens if I do run borgbackup? It will
> check on the metadata
> ctime, size, inode but the actually data of the file is not stored
> locally...
>
> So in that case my a safely assume when borgbackup checks for data
> modification, the
> file in question is not downloaded ? Because in that case, my backup will
> generate a lot of traffic...

I do not know how borgbackup checks if a file has been modified. If it
just compares ctime, mtime and file size the file will not be downloaded
for that. If it tries to read the file contents, S3QL obviously has no
other choice than to download the data.

Amos T

unread,
Sep 9, 2021, 3:46:44 PM9/9/21
to s3ql
The endpoint of s3ql can point to local space, not only s3.
I know that corruption is almost not existant with cloud space and zfs, but still, better safe then sorry.

So to have some little redundancy on those files with some parity could safe a lot of trouble
to rescue an s3ql filesystem.

I still hope it can be an option in the future.

About borgbackup You did answer my question.  It does check by default on the metatags said
to speed up backup, I will run some test and see if it is otherwise and traffic floods my router.

Thanks for your time.




Op donderdag 9 september 2021 om 15:30:15 UTC+2 schreef Niko...@rath.org:

David Gasaway

unread,
Sep 10, 2021, 5:37:23 PM9/10/21
to s3ql
On Thu, Sep 9, 2021 at 2:57 AM Amos T <amtr...@gmail.com> wrote:
 
So in that case my a safely assume when borgbackup checks for data modification, the
file in question is not downloaded ?  Because in that case, my backup will generate a lot of traffic...

The documentation for `borg create` explains the methodologies for detecting changed files.  Basically, with the defaults, it caches the metadata of the files it backed up, into a local borg cache.  It doesn't even need to read anything at all from the destination (s3ql) file system so long as the cache is intact.  If this cache is lost or corrupt, borg has to read (download, in the case of s3ql) the entire repository to rebuild the cache.
 
--
-:-:- David K. Gasaway
-:-:- Email: da...@gasaway.org

David Gasaway

unread,
Sep 10, 2021, 5:42:07 PM9/10/21
to s3...@googlegroups.com
On Thu, Sep 9, 2021 at 10:05 AM Nikolaus Rath <Niko...@rath.org> wrote:
I do not know how borgbackup checks if a file has been modified. If it
just compares ctime, mtime and file size the file will not be downloaded
for that. If it tries to read the file contents, S3QL obviously has no
other choice than to download the data.

For the record, borg blocks and chunks the files similar to S3QL, so I don't think the metadata for the chunk files really corresponds to the metadata of the original files.
 

David Gasaway

unread,
Sep 10, 2021, 5:46:09 PM9/10/21
to s3ql
I may have to take some of that back.  It sounds like borg does keep components of the cache on the remote file system.  So a local cache rebuild doesn't need to download the whole repository.  `borg check` operations frequently do, however.  The point I was trying to make is, under normal circumstances, only the local cache is used for change detection.

Amos T

unread,
Sep 12, 2021, 7:41:30 AM9/12/21
to s3ql
I try to figure out if I go into the right direction.  I am a HUGE fan of ZFS filesystem because if you use that filesystem with ECC memory and take a JBOD of disks and put them in a
redundancy setup (ZFS3 or ZFS2) you have not only snapshots, compression but also deduplication.  The cache is hold in ZIL ssd/nvme to speed up file access... however that
cache should always have redundancy as well ! The only point I do miss also in ZFS is versioning....

So in my store I see a few issues:

1) it looks to me there is no versioning in s3ql either.  This means that feels very uncomfortable to me seen the new range of virii

2) the cache (this counts for s3ql as borgbackup as well) should always be on a stable filesystem so if this is run on a single platter that is a very weak point, seen that cache is synced in priority
    Of course there is a way to mount readonly and use immutable trees, but if it is mounted in a regular way, corrupt files (bitrot etc) are synced upstream

3) there is no parity on the s3ql_data_* files , I like to see some error detection and correction there in the form of s3ql_data_NNNNN.par2 (for each data block)
    and additional s3qlcmd to verify and if needed repair on that level in a worse case scenario...

Compression saves me some space, deduplication not for the most part... I think the latter is only usefull in big corporations with potential a lot of duplicated files (blocks)
Both take a toll on the cpu as well.

For me encryption is mandatory and I am a huge fan of asymmetric encryption in favor of symmetric encryption seen in the first case you use the public key.

So I still need to find the right approach in this but seen there is no versioning.....I have a problem.
I can not snapshot every hour the FS



Op vrijdag 10 september 2021 om 23:46:09 UTC+2 schreef dgas...@gmail.com:

David Gasaway

unread,
Sep 12, 2021, 5:11:40 PM9/12/21
to s3ql
On Sun, Sep 12, 2021 at 4:41 AM Amos T <amtr...@gmail.com> wrote:
So in my store I see a few issues:

1) it looks to me there is no versioning in s3ql either.  This means that feels very uncomfortable to me seen the new range of virii

Perhaps I misunderstand your meaning for "versioning".  s3ql is a file system, not a backup tool.  If you're after a backup (providing earlier versions of files) stored in your s3ql file system, then use a backup tool that writes into the s3ql file system.  Have you read the following?


3) there is no parity on the s3ql_data_* files , I like to see some error detection and correction there in the form of s3ql_data_NNNNN.par2 (for each data block)
    and additional s3qlcmd to verify and if needed repair on that level in a worse case scenario...

From the s3ql Features list: "Encryption. After compression (but before upload), all data can be AES encrypted with a 256 bit key. An additional SHA256 HMAC checksum is used to protect the data against manipulation."  Manipulation or corruption, a block will fail to decrypt if something goes wrong.
Reply all
Reply to author
Forward
0 new messages