I have a web-server (nginx + apache + mysql, FreeBSD 7.3) with many
sites. Every night it creates a backup of /home on a separate disk.
/home is a RAID1 mirror on Adaptec 3405 (128M write cache) with SAS
drives; /backup is a single SATA drive on the same controller.
Rsync creates backups using hardlinks, it stores 7 daily and 4 weekly
copies. Total amount of data is ~300G and 11M of files. The server is
under heavy web load every time (appox 100 queries/sec).
Every time backup starts server slows down significantly, disk
operations become very slow. It may take up to 10 seconds to stat() a
file that is not in filesystem cache. At the same time, rsync on
remote server does not affect disk load much, server works without
slowdown.
I think that problem can be caused by two reasons:
* either bulk of reads on SATA /backup drive, that fills OS
filesystem cache and many file access operations require real disk
read.
* or bulk of writes on /backup fills controller write cache and geom
disk operations queue grown, causing all disk operations to wait.
This is only my assumption of course, I may be wrong.
How can I find a real reason of these slowdowns, to either conclude
that it is not possible to solve this because of hardware/software
limits, or tune my software/hardware system to make this all work at
an acceptable speed?
Here is my current sysctl setup (what should I tune?):
kern.maxvnodes=500000
vfs.ufs.dirhash_maxmem=67108864
vfs.lookup_shared=1
kern.dirdelay=6
kern.metadelay=5
kern.filedelay=7
sysctl's counters (which others should I monitor?):
vfs.numvnodes: 407690
vfs.ufs.dirhash_mem: 27158118
I tried to enable async (in hope it will make rsync faster) or even
disable softupdates on /backup partition (in hope it will make rsync
slower and OS filesytem cache will not be flushed by backups), it did
not help. I also want to try to upgrate to Adaptec 5405 (it has 256M
of write cache) or move mysql databases on a separate SAS disk, but I
just not quite sure what will help better.
What would I start from to diagnose the issue?
Thanks in advance!
--
// cronfy
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"
rsync has standard options to limit the bandwidth it will consume.
Making it write through a narrow pipe will also slow down the rate of
disk accesses, so should help control the impact on other services on
the machine.
However, taking backups slowly makes it harder to ensure you have a
consistent backup, so I recommend you investigate snapshotting the
filesystem (well supported for UFS, trivially easy for ZFS) and then
backup the snapshot as slowly as you like.
Cheers,
Matthew
--
Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard
Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
JID: mat...@infracaninophile.co.uk Kent, CT11 9PW
I'm not sure snapshots are so well supported for UFS.
>From sys/ufs/ffs/README.snapshot:
"As is detailed in the operational information below, snapshots are
definitely alpha-test code and are NOT yet ready for production use."
--
Bruce Cran
I took a look at the "cache management for the dump command" project on the
freebsd.org project ideas page.
http://www.freebsd.org/projects/ideas/ideas.html#p-extenddump
It appears that modifying dump to use a shared cache in a very simple way
(move the control structures to the shared segment and perform simple locking)
yields substantial speed increases.
A patch implementing this is attached.
Some numbers follow. The test system is an intel core i5 750, with recent
SATA disks. The tests all record an improvement with the shared cache, but
the values vary widely from 7% to 236%. It would be interesting to have
more tests on different configurations.
Would someone be interested in reviewing the patch and/or perform
more tests ?
Regards,
J.F. Dockes
Some tests results
===================
The command used in all cases is "dump -0aC XX -f /dev/null filesystem"
The current dump actually uses 5 times the value of the -C option for
cache. The patched version uses a single shared memory segment. So "olddump
-C 10" and "newdump -C 50" are equivalent in terms of cache memory usage.
---------------
Tests performed on a small slice (3.7GB/4GB). The filesystem is quite full, and
has been pushed beyond full then partially pruned a few times to simulate
one which would actually had a life. It contains a mix of /home/ user
files (avg size 68 kB). Tests were run both in single disk and mirror mode.
Mirrored slice
Split cache -C 10: 18 MB/s
Shared cache -C 10 : 42 MB/s (+133%)
Same slice, without the mirroring
Split cache -C 10: 11 MB/s
Shared cache -C 50: 37 MB/s (+236%)
--------
Tests on /var (500 MB / 5 GB). Mirrored slice
Split cache -C 10: 15 MB/s
Shared cache -C 50: 28 MB/s (+86%)
-----------
Tests on a bigger slice (24 GB / 43 GB) with mostly big files. Single disk
Split cache -C 10: 15 MB/s
Shared cache -C 50: 35 MB/s (+133%)
-----------
Tests on /usr (464 GB / 595 GB), mirrored
Split cache, -C 50: 57 MB/s
Shared cache, -C 250: 63 MB/s (+10%)
Level 1 tests (5GB dump)
Split cache, -C 50: 38 MB/s
Shared cache, -C 250: 41 MB/s (+7%)
Indeed. That's better than I expected.
>Would someone be interested in reviewing the patch and/or perform
>more tests ?
I've mostly convered to ZFS but still have UFS root (which is basically
a full base install without /var but including /usr/src - 94k inodes
and 1.7GB). I've run both the 8-stable ("stable") and patched ("jfd") dump
alternately 4 times with 50/250MB cache with the following results:
x stable
+ jfd
+------------------------------------------------------------+
| +|
| +|
| x +|
|x xx +|
||AM A|
+------------------------------------------------------------+
N Min Max Median Avg Stddev
x 4 9413 9673 9568 9555.5 107.12143
+ 4 15359 15359 15359 15359 0
Difference at 95.0% confidence
5803.5 +/- 131.063
60.7347% +/- 1.3716%
(Student's t, pooled s = 75.7463)
--
Peter Jeremy
On Sunday 24 October 2010 15:15:53 cronfy wrote:
> Hello,
>
> I have a web-server (nginx + apache + mysql, FreeBSD 7.3) with many
> sites. Every night it creates a backup of /home on a separate disk.
> /home is a RAID1 mirror on Adaptec 3405 (128M write cache) with SAS
> drives; /backup is a single SATA drive on the same controller.
>
> Rsync creates backups using hardlinks, it stores 7 daily and 4 weekly
> copies. Total amount of data is ~300G and 11M of files. The server is
> under heavy web load every time (appox 100 queries/sec).
>
> Every time backup starts server slows down significantly, disk
> operations become very slow. It may take up to 10 seconds to stat() a
> file that is not in filesystem cache. At the same time, rsync on
> remote server does not affect disk load much, server works without
> slowdown.
>
> I think that problem can be caused by two reasons:
> * either bulk of reads on SATA /backup drive, that fills OS
> filesystem cache and many file access operations require real disk
> read.
> * or bulk of writes on /backup fills controller write cache and geom
> disk operations queue grown, causing all disk operations to wait.
>
> This is only my assumption of course, I may be wrong.
Try "gstat -a" to see which one it is. I guess you'll see bulk reads on /home
and bulk reads on /backup mostly.
When rsync starts, it will index the source and the destination directory
structures using readdir() and stat() calls to see what files have changed
(and need to be copied later on).
rsync offers the "--bwlimit" option to lower the network bandwidth between an
rsync server and a client, but this won't change the stress the stat() calls
generate when rsync() indexes the directories.
> How can I find a real reason of these slowdowns, to either conclude
> that it is not possible to solve this because of hardware/software
> limits, or tune my software/hardware system to make this all work at
> an acceptable speed?
You could try the patch below to rsync's "syscall.c" file, which will pause
rsync for short periods of time every second to reduce the IO pressure it
creates.
Changing "500" to an even lower value, should almost linearly scale the 'busy'
percentage "gstat -a" shows to even lower levels.
--- syscall.c.org 2010-10-26 22:47:20.000000000 +0200
+++ syscall.c 2010-10-26 22:47:33.000000000 +0200
@@ -215,8 +215,19 @@
#endif
}
+void tiny_pause(void)
+{
+ struct timeval tv;
+
+ // only work in the first half of every second.
+ gettimeofday(&tv, NULL);
+ if (tv.tv_usec > 500 * 1000)
+ usleep(1000 * 1000 - tv.tv_usec);
+}
+
int do_stat(const char *fname, STRUCT_STAT *st)
{
+ tiny_pause();
#ifdef USE_STAT64_FUNCS
return stat64(fname, st);
#else
@@ -226,6 +237,7 @@
int do_lstat(const char *fname, STRUCT_STAT *st)
{
+ tiny_pause();
#ifdef SUPPORT_LINKS
# ifdef USE_STAT64_FUNCS
return lstat64(fname, st);
@@ -239,6 +251,7 @@
int do_fstat(int fd, STRUCT_STAT *st)
{
+ tiny_pause();
#ifdef USE_STAT64_FUNCS
return fstat64(fd, st);
#else
Regards,
--
Daan Vreeken
VEHosting
http://VEHosting.nl
tel: +31-(0)40-7113050 / +31-(0)6-46210825
KvK nr: 17174380
9413 what? Puppies?
DES
--
Dag-Erling Smørgrav - d...@des.no
Ooops, sorry - KB/sec as reported in the dump summary.
--
Peter Jeremy
Thank you :)
> Every time backup starts server slows down significantly, disk
> operations become very slow. It may take up to 10 seconds to stat() a
> file that is not in filesystem cache. At the same time, rsync on
> remote server does not affect disk load much, server works without
> slowdown.
Thank you all for the answers.
Matthew, yes, I know about --bwlimit in rsync, but this will mostly
slow down the transfer of data and not stat() operations that are
used for comparing files. I afraid bwlimiting will not make any better
(I tried it some time ago without success).
Daan, thanks for the patch! I will try it.
A lot of impact also produced by rm -rf of old backups. I assume that
low performance is also related to a large numbers of hardlinks. There
was a moment when I had ~15 backups hardlinked by rsync, and rm -rf of
single backup was VERY slow and slowed down the server dramatically. I
had no choice except installing a new clean disk for backups, limiting
number of future backups and issuing a newfs over old backup disk.
When there is less number of hardlinked copies, backups cleanup works
much better. Can large number of hardlinks produce such an impact on
filesystem operations?
May be it is possible to increase disk performance somehow? Server has
a lot of memory. At this time vfs.ufs.dirhash_maxmem = 67108864 (max
monitored value for vfs.ufs.dirhash_mem was 52290119) and
kern.maxvnodes = 500000 (max monitored value for vfs.numvnodes was
450567). Can increasing of these (or other) sysctls help? I ask
because (as you can see) these tunables are already incremented, and I
am not sure further increment really makes sense.
Also, is it possible to limit disk operations for rm -rf somehow? The
only idea I have at the moment is to replace rm -rf with 'find |
slow_down_script | xargs rm' (or use similar patch as for rsync)...
Or, probably, it is possible to limit the IO bandwith for a particular
device somehow? I would then limit disk operations for SATA backup
disk to prevent it affecting the rest of services that work on SAS
mirror.
And also, maybe there are other ways to create incremental backups
instead of using rsync/hardlinks? I was thinking about generating
list of changed files with own script and packing it with tar, but I
did not find a way to remove old backups with such an easy way as it
is with hardlnks..
Yes, hardlinked backups pretty much destroy performance, mainly
because it destroys all locality of reference on the storage media
when files are slowly modified and get their own copies, mixed with
other 'old' files which have not been modified. But theoretically
that should only effect the backup target storage and not the server's
production storage.
Here is what I would suggest: Move the backups off the production
machine and onto another totally separate machine, then rsync between
the two machines. That will solve most of your problems I think.
If the backup disk is a single drive then just use a junk box lying
around somewhere for your backup system with the disk installed in it.
--
The other half of the problem is the stat()ing of every single file
on the production server (whether via local rsync or remote rsync).
If your original statement is accurate and you have in excess of
11 million files then the stat()ing will likely force the system vnode
cache on the production system to cycle, whether it has a max of
100,000 or 500,000... doesn't matter, it isn't 11 million so it will
cycle. This in turn will tend to cause the buffer and VM page caches
(which are linked to the vnode cache) to get blown away as well.
The vnode cache should have code to detect stat() style accesses and
avoid blowing away unrelated cached vnodes which have cached data
associated with them, but it's kinda hit-or-miss how well that works.
It is very hard to tune those sorts of algorithms and when one is
talking about a inode:cache ratio of 22:1 even a good algorithm will
tend to break down.
Generally speaking when caches become inefficient server throughput
goes to hell. You go from e.g. 10uS to access a file to 6mS to
access a file, a 1:600 loss.
:May be it is possible to increase disk performance somehow? Server has
:a lot of memory. At this time vfs.ufs.dirhash_maxmem = 67108864 (max
:monitored value for vfs.ufs.dirhash_mem was 52290119) and
:kern.maxvnodes = 500000 (max monitored value for vfs.numvnodes was
:450567). Can increasing of these (or other) sysctls help? I ask
:because (as you can see) these tunables are already incremented, and I
:am not sure further increment really makes sense.
I'm not sure how this can be best dealt with in FreeBSD. If you are
using ZFS it should be possible to localize or cache the meta-data
associated with those 11 million+ files in some very fast storage
(i.e. like a SSD). Doing so will make the stat() portion of the rsync
go very fast (getting it over with as quickly as possible).
With UFS the dirhash stuff only caches the directory entries, not the
inode contents (though I'm not 100% positive on that), so it won't help
much. The directory entries are already linear and unless you have
thousands of files in each directory ufs dirhash will not save much
in the way of I/O.
:Also, is it possible to limit disk operations for rm -rf somehow? The
:only idea I have at the moment is to replace rm -rf with 'find |
:slow_down_script | xargs rm' (or use similar patch as for rsync)...
No, unfortunately there isn't much you can do about this due to
the fact that the files are hardlinked, other than moving the backup
storage entirely off the production server or otherwise determining
why disk I/O to the backup storage is effecting your primary storage
and hacking a fix.
The effect could be indirect... the accesses to the backup
storage are blowing away the system caches and causing the
production storage to get overloaded with I/O. I don't think
there is an easy solution other than to move the work off
the production server entirely.
:And also, maybe there are other ways to create incremental backups
:instead of using rsync/hardlinks? I was thinking about generating
:list of changed files with own script and packing it with tar, but I
:did not find a way to remove old backups with such an easy way as it
:is with hardlnks..
:
:Thanks in advance!
:...
:--
:// cronfy
Yes. Use snapshots. ZFS is probably your best bet here in FreeBSDland
as ZFS not only has snapshots it also has a streaming backup feature
that you can use to stream changes from one ZFS filesystem (i.e. on
your production system) to another (i.e. on your backup system).
Both the production system AND the backup system would have to be
running ZFS to make proper use of the feature.
But before you start worrying about all of that I suggest taking the
first step, which is to move the backups entirely off the production
system. There are many ways to handle LAN backups. My personal
favorite (which doesn't help w/ the stat problem but which is easy
to set up) is for the backup system to NFS mount the production system
and periodically 'cpdup' the production system's filesystems over to
the backup system. Then create a snapshot (don't use hardlinks),
and repeat. As a fringe benefit the backup system does not have to
rely on backup management scripts running on the production system...
i.e. the production system can be oblivious to the mechanics of the
backup. And with NFS's (NFSv3 here) rdirplus scanning the production
filesystem via NFS should go pretty quickly.
It is possible for files to be caught mid-change but also fairly
easy to detect the case if it winds up being a problem. And, of
course, more sophisticated methodologies can be built on top.
-Matt
Matthew Dillon
<dil...@backplane.com>
> And also, maybe there are other ways to create incremental backups
> instead of using rsync/hardlinks?
Yes. Use dump(8) -- that's what it's for. It reads the inodes,
directories, and files directly from the disk device, thereby
eliminating stat() overhead entirely.
Any replication mechanism -- rsync, tar, even dd -- can be used
as a backup mechanism, but dump was specifically designed for the
purpose.
Well, dump is 25+ years old and has some serious issues when it
comes to reliably backing data up. On a modern (large) filesystem
you are virtually guaranteed to get corruption due to the asynchronous
nature of the dump.
This can be partially mitigated by using a block level snapshot on
the UFS source filesystem and then dumping the snapshot instead of
the live filesystem, but that opens up a pandora's box of other
issues (such as whether issuing the snapshot itself will destabilize
the machine), plus the live system will stall while it is making the
snapshot, assuming you want a consistent snapshot, plus the snapshot
itself may not be entirely consistent, depending on how it is made.
Plus dump uses mtime to detect changes, which is unreliable, and the
output produced by dump is not live-accessible whereas a snapshot /
live filesystem copy is. That makes the dump fairly worthless for
anything other than catastrophic recovery. From experience, most
people need to access backups to pull out small bits of information
rather than whole filesystems, such as to restore a user's home
directory or web pages, and dump/restore is really unsuitable for
that sort of thing.
Live backups are far, far, far superior to dump/restore.
-Matt
> Well, dump is 25+ years old ...
Why are you running BSD if you prefer newer (=> less mature) stuff?
Switch to Linux!
> ... On a modern (large) filesystem you are virtually guaranteed
> to get corruption due to the asynchronous nature of the dump.
>
> This can be partially mitigated by using a block level snapshot on
> the UFS source filesystem and then dumping the snapshot instead of
> the live filesystem ...
IOW by using "dump -L"
> Plus dump uses mtime to detect changes, which is unreliable, ...
Are you sure about that? Last I knew it used ctime.
> and the output produced by dump is not live-accessible whereas a
> snapshot / live filesystem copy is. That makes the dump fairly
> worthless for anything other than catastrophic recovery.
Ever heard of "restore -i"?
Have you ever tried to restore a single file from a 2 Terrabyte dump
file ? Or even better, if you are using incremental dumps, try
restoring a single file from 6 dump files.
I'm not saying that dump/restore is completely unusable, I'm saying
that it MOSTLY unusable for the use cases people have today for backups.
There is a certain convenience to being able to restore a file from
a live backup in a few seconds verses having to struggle with large
multi-layered incremental dump/restore files that were designed to be
spooled off to tape units.
-Matt
Matthew Dillon
<dil...@backplane.com>
I'd argue that if you're routinely restoring single files, you aren't managing
your time or your users' expectations properly.
Backups are /for/ catastrophic recovery, imo, and users shouldn't expect
systems staff to be routinely restoring single files they've inadvertently
deleted. Users need to realise that when you delete something it goes away:
that's what delete does, which is why you're usually asked to confirm it.
Restoring single files for individual users should be very much a special case
and not a routine service; otherwise you risk being snowed under with file
recovery requests.
Jonathan
Isn't that the purpose of periodic snapshots anyhow (restoring a
minimal number of files)?
Thanks,
-Garrett
On Saturday 30 October 2010 23:48:45 cronfy wrote:
> Hello.
>
> > Every time backup starts server slows down significantly, disk
> > operations become very slow. It may take up to 10 seconds to stat() a
> > file that is not in filesystem cache. At the same time, rsync on
> > remote server does not affect disk load much, server works without
> > slowdown.
>
> Thank you all for the answers.
..
> Also, is it possible to limit disk operations for rm -rf somehow? The
> only idea I have at the moment is to replace rm -rf with 'find |
> slow_down_script | xargs rm' (or use similar patch as for rsync)...
Yes there is. You could use the same 'trick' I've added to rsync and limit the
amount of I/O-creating system calls an application creates.
You could even create a small wrapper library that does this for a specific
application, without having to recompile or change the application.
You can find a working proof of concept in "slowdown.c" here :
http://vehosting.nl/pub_diffs/
The library can be compiled with :
gcc -Wall -fPIC -shared -o slowdown.so slowdown.c
Then start the application you want to I/O-limit with something like :
(
export LD_PRELOAD=slowdown.so
export LD_LIBRARY_PATH=.:${LD_LIBRARY_PATH}
ls -R /a/random/huge/directory/
)
(Assuming you start the application from withing the directory
where "slowdown.so" resides.)
This should work with rsync, ls and rm "out of the box", without changing the
source of the applications.
Regards,
--
Daan Vreeken
VEHosting
http://VEHosting.nl
tel: +31-(0)40-7113050 / +31-(0)6-46210825
KvK nr: 17174380
> Might gsched(8) help ?
I am using 7.3, there is no gsched as far as I know..
I am going to try gjournal instead - there was a suggest that gjoural
may help here to scale huge io requests.
--
// cronfy
Thanks again.
> Yes, hardlinked backups pretty much destroy performance, mainly
> because it destroys all locality of reference on the storage media
> when files are slowly modified and get their own copies, mixed with
> other 'old' files which have not been modified. But theoretically
> that should only effect the backup target storage and not the server's
> production storage.
That is what surprised me when I did experiment with backups. If I
move backup off from the production server (to another less loaded
production server indeed), server that shuld be backed up starts to
run fine while backups are created. I think it means that problem is
not with vnode/dir caches..
At the other side the server who received backups became very slow. So
the problem looks to be related to writes or file creation/hardlinking
somehow...
At the moment I do not have server with ZFS, but I will think in this
direction. But I heard that ZFS has less performance than UFS, is it
really like this? I mean I have seen benchmarks and system
requirements, but would like to hear about your own experience.
--
// cronfy
it actually works just fine there, just take the code from
http://info.iet.unipi.it/~luigi/geom_sched/
cheers
luigi