Recovery unnecessarily slow due to copycontroller exclude_and_protect

77 views
Skip to first unread message

Alan Jackson

unread,
Aug 2, 2021, 7:55:50 PM8/2/21
to Barman, Backup and Recovery Manager for PostgreSQL
Hi,

We have a barman server which we use to restore a copy of our production database to a test system, using rsync + ssh. The database is only 17GB in size.

The basic recover operation takes around 70 minutes to complete, with the vast majority of this time spent in a single-threaded rsync process.

The basic command we run is 
barman recover project_name latest -j 2 --remote-ssh-command "ssh post...@192.168.2.2" /var/lib/postgresql/11/main/

We've tracked this time spent down to the use of the rsync '--filter merge /tmp/barman-okldjbe8/pgdata_exclude_and_protect.filter' argument. When this option is excluded (via commenting out the option in the copy controller source), the full restore takes only 5 minutes, and appears perfectly valid.

Thhe pgdata_exclude_and_protect.filter file which is generated by barman consists of over 900,000 lines - which must be evaluated on each of the files being rsynced (which is also several hundred thousand files). Rsync spends an inordinate amount of time parsing and evaluating these filters and exclusion lists before even copying the first byte of data to the target.

In our use case, the target postgres environment is wiped (or at least, can be treated as a clean slate), so the purpose of this exclude_and_protect filter is very unclear.

There is no option in barman to disable this filter, but we are unsure as to whether commenting the lines (970 and 971 in https://github.com/EnterpriseDB/barman/blob/master/barman/copy_controller.py) would cause other issues - perhaps this method is used both for backups as well as restores, for instance. From the source code, the _RsyncCopyItem looks like it could be instantiated without a exclusion list, but I don't see hooks further up to disable its creation or use. 

Does anyone have any guidance or thoughts? Perhaps we have a larger number of files in the database than usual, but I'd be surprised if this slowness hadn't been observed before. The 5 minute restore is a much more reasonable speed than 70 minutes for a 17GB database!

Regards,
--Alan

Abhijit Menon-Sen

unread,
Aug 3, 2021, 11:16:35 AM8/3/21
to pgba...@googlegroups.com
On Tue, Aug 3, 2021 at 5:25 AM Alan Jackson <lan...@gmail.com> wrote:
>
> We've tracked this time spent down to the use of the rsync '--filter merge /tmp/barman-okldjbe8/pgdata_exclude_and_protect.filter' argument. When this option is excluded (via commenting out the option in the copy controller source), the full restore takes only 5 minutes, and appears perfectly valid.
>
> Thhe pgdata_exclude_and_protect.filter file which is generated by barman consists of over 900,000 lines - which must be evaluated on each of the files being rsynced (which is also several hundred thousand files). Rsync spends an inordinate amount of time parsing and evaluating these filters and exclusion lists before even copying the first byte of data to the target.

Hi Alan.

Thanks for the report and your analysis. I agree that taking 70
minutes to restore 17GB into an empty PGDATA is unfortunate, but i'll
have to spend some time looking at the code before I can formulate an
approach to deal with this.

> There is no option in barman to disable this filter, but we are unsure as to whether commenting the lines (970 and 971 in https://github.com/EnterpriseDB/barman/blob/master/barman/copy_controller.py) would cause other issues - perhaps this method is used both for backups as well as restores, for instance.

The code is indeed used for both backups and restores, essentially to
prevent copying files within tablespace directories along with the
main PGDATA (because they are copied separately).

> Perhaps we have a larger number of files in the database than usual, but I'd be surprised if this slowness hadn't been observed before.

I think the reason you might be especially badly affected is that most
people don't have tablespaces whose location is within PGDATA—this is
a configuration that is strongly discouraged by Postgres, which will
even issue a warning if you try to create such a tablespace
("tablespace location should not be inside the data directory"). So if
I'm reading the code correctly, most people won't have much of a
filter at all, where you have ~900k lines.

-- Abhijit

Alan Jackson

unread,
Aug 4, 2021, 4:59:51 PM8/4/21
to Barman, Backup and Recovery Manager for PostgreSQL
Thanks for looking into this.

It's a standard debian10 postgres11 install with very few configuration changes (only related to replication), so I don't believe we're doing anything unorthodox with the data directory/tablespace locations.

One thing that has become evident since talking with the app developers, is that there are currently ~200 databases in this data set, and each have ~100 tables. So we are seeing ~450,000 files under /var/lib/postgresql/11/main/base - which seems to be where the list of files in the pgdata_exclude_and_protect.filter comes from (duplicated - one with a 'P' prefix, then another with a '-', hence the 900,000 lines). The resulting cross-product look ups are what cause the elongated single threaded lookup time within rsync.

Regards,
--Alan

Abhijit Menon-Sen

unread,
Aug 5, 2021, 1:18:36 AM8/5/21
to pgba...@googlegroups.com
On Thu, Aug 5, 2021 at 2:29 AM Alan Jackson <lan...@gmail.com> wrote:
>
> It's a standard debian10 postgres11 install with very few configuration changes (only related to replication), so I don't believe we're doing anything unorthodox with the data directory/tablespace locations.

Could you please share the output of \db+ in psql?

-- Abhijit

Michael Wallace

unread,
Aug 19, 2021, 7:52:48 AM8/19/21
to pgba...@googlegroups.com
Hi Alan,

We spent a bit more time digging into this one and confirmed it is indeed simply the number of files in PGDATA which is responsible for the excessively large filter file. I've filed https://github.com/EnterpriseDB/barman/issues/377 for the underlying issue and https://github.com/EnterpriseDB/barman/issues/378 for a short-term solution which would address the "restoring to an empty directory" case by skipping the expensive RSync call when copying to empty target directories.

Issue 378 is targeted for release 2.14.

Regards,

Mike

--
--
You received this message because you are subscribed to the "Barman for PostgreSQL" group.
To post to this group, send email to pgba...@googlegroups.com
To unsubscribe from this group, send email to
pgbarman+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/pgbarman?hl=en?hl=en-GB

---
You received this message because you are subscribed to the Google Groups "Barman, Backup and Recovery Manager for PostgreSQL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pgbarman+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/pgbarman/CANG1dXVDKPjZf54b2zEFU0eQK5P%2B08EoRuAXm5xODyXgmjNAtg%40mail.gmail.com.

Alan Jackson

unread,
Aug 19, 2021, 3:38:03 PM8/19/21
to Barman, Backup and Recovery Manager for PostgreSQL
Fantastic, thanks very much for looking into this!

Regards,
--Alan

Reply all
Reply to author
Forward
0 new messages