Backing up the db using an asynchronous Python-Pandas plugin

71 views
Skip to first unread message

Guido Cioni

unread,
Jul 28, 2022, 2:45:43 AM7/28/22
to weewx-development
I've been looking for a solution to backup my db directly inside weewx without calling external cronjobs. The backup will be done on a SQL database running on another server.
For this reason I don't want to make a physical copy of the DB but instead copy only the rows that are not on the remote machine but on the local one using the epoch time as key: this will decrease the amount of time spent reading, passing and writing the data.

As far as I understand the SQLBackup and RSYNC skin work differently, so my idea was to write a Python script that is called directly after a record has been written by weewx into the DB and does that:
  1. get maximum epoch time on local copy of the db
  2. compare with maximum epoch time on remote copy of the db
  3. get rows in local db that have epoch time > epoch time on remote
  4. append rows to the remote copy of the db
(this of course assumes that both dbs are ordered by epoch time, which should be the case)
I will use Pandas to read the data from SQL. Every time that this Python script is called the local db will be loaded into memory (only the rows that need to be appended actually) so it will not be modified by newer appends done by weewx and will be left free for weewx to work on it. 

Is this a good strategy or does it have some pitfalls? I should be able to implement it fairly easy (I already did something similar in the past). Only need to understand how to package it as a weewx plugin. 

Vince Skahan

unread,
Jul 28, 2022, 10:23:04 AM7/28/22
to weewx-development
Can you get there with doing mysql replication at the database level ?

Guido Cioni

unread,
Jul 28, 2022, 10:27:43 AM7/28/22
to weewx-development
From what I understand weewx uses sqlite because of the default, and with sqlite the only possibility is to dump the whole database every time (https://github.com/glennmckechnie/weewx-sqlbackup/wiki/SQLBackup-README), which is obviously time consuming.

With Pandas you can get around that by doing queries and requesting only the data that you need to transfer every time, and the data gets saved into the memory allocated by Python so the db is left untouched. Of course one does not necessarily have to use Pandas, it's just one of the framework that I'm familiar with. 

Plus, it seems that none of the extensions in the wiki allows for saving data directly into a DB but only for making a physical copy of the DB (again, from my understanding). 

Doug Jenkins

unread,
Jul 28, 2022, 10:31:55 AM7/28/22
to Guido Cioni, weewx-development
Guido:

I was going to recommend if this is that important to you to configure a database cluster. MariaDB has Galera Clustering (What is MariaDB Galera Cluster? - MariaDB Knowledge Base) which will perform database replication and distribution for you. You need at least 3 nodes/servers to have a proper cluster.

Keep in mind that WeeWX only writes to the database during the archive operation, so if you archive every minute, you are only writing 1440 records to the archive table (and another 1440+ records to the summary tables per measurement). that is literally kilobytes of data per day. With today's modern networks, performance is not an issue.

You can spend the time to build an incremental backup solution, but I think that is a lot of work for a solution that writes so little amount of data each day.

Just my $0.02 

DDJ

--
You received this message because you are subscribed to the Google Groups "weewx-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to weewx-developm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/weewx-development/391c83da-fb02-447e-bcee-fed1b42bbbccn%40googlegroups.com.

Guido Cioni

unread,
Jul 28, 2022, 10:35:48 AM7/28/22
to weewx-development
Hey Doug, 
thanks for chiming in.
As a matter of fact I'm not really worried about the data to be transmitted incrementally which, as you say, is not large by any mean.

What I'm trying to avoid is having to transfer the whole database dump every time that a backup is performed, because that can become demanding when the archive grows for many years: I have another station running a similar DB and after more than 10 years the full DB has become about 400 MB in size...

Doug Jenkins

unread,
Jul 28, 2022, 10:56:34 AM7/28/22
to Guido Cioni, weewx-development
Ok, I am getting the gist of your concern.

A possible approach here is if you are running MariaDB on your WeeWX server, you install MariaDBs client tools on the target server where you store your backups and run their Maria Backup solution to do incremental/differential backups. Here is an article to review : Incremental Backup and Restore with Mariabackup - MariaDB Knowledge Base

This configuration gives you the ability to backup the instance over the wire and keep incremental backups until you perform a full backup again.

 If you do this approach, make sure you run periodic full backups between incremental backups as it can be tedious to restore a database with a large number of incremental backups to apply.

Personally, I just run mysqldump on each of my databases each day, compress the export, and copy it over to my NAS. Not the most efficient process, but it is simple for me and has been helpful for those situations I needed to restore a backup.

DDJ

Vince Skahan

unread,
Jul 28, 2022, 3:08:11 PM7/28/22
to weewx-development
On Thursday, July 28, 2022 at 7:35:48 AM UTC-7 guido...@gmail.com wrote:
What I'm trying to avoid is having to transfer the whole database dump every time that a backup is performed, because that can become demanding when the archive grows for many years: I have another station running a similar DB and after more than 10 years the full DB has become about 400 MB in size...


Ummm - so what if it's 400MB in size these days ?  Disk is cheap and your old archives aren't worth much to you anyway after your datalogger runs out of space anyway.  Just keep the last few and gzip them before saving.

My weewx.sdb is 393 MB but it gzips down to 84 MB.  That's nothing these days.

I do the same thing Doug mentioned for mysql, but I use the default sqlite3 db for simplicity reaons.

You're jumping to a solution, looking for a way to backup only diffs (why?) without using cron (again, why?) when you really have no problem to begin with.  Just dump your db and compress the result.  Check your backups for being good for restores too. There are lots of threads in the weewx-user group with links to code folks have written to do that.
 

Glenn McKechnie

unread,
Jul 28, 2022, 8:28:01 PM7/28/22
to Guido Cioni, weewx-development
On 29/07/2022, Guido Cioni wrote:
> From what I understand weewx uses sqlite because of the default, and with
> sqlite the only possibility is to dump the whole database every time
> (https://github.com/glennmckechnie/weewx-sqlbackup/wiki/SQLBackup-README),
> which is obviously time consuming.

Looks like the author (me) forgot to add a note to that section of the README.
I'd better go and sync the sqlbackupREADME.html with the github README.md !

The extension was modified to also do partial backups of the sqlite
database. So in that respect there is, now, no difference between
Mysql / MariaDB and sqlite when using this skin.

The skin.conf does default to partial sqlite backup and has comments
in that file regarding changing those settings.

I wrote this to scratch an itch. I wanted it all handled by weewx, and
to run during a time when no writes were occuring. The weewx report
cycle is perfect for that; providing we don't ask weewx to handle too
much extra, in which case the reports may be delayed.
I've done full database backups using this skin and it works... for
me. I do that on a occasional basis (by using Report_timing on
another, uniquely named instance) so that there is an updated, more
recent baseline to start the recovery from. That means I interfere
with report generation during that time, but it recovers.

Mine was set to daily (the default), it now only does weekly (because
I have a replicated, off site, mysql database.) It runs seamlessly
with those settings. YMMV.

The recovery using sqldump files is a laborious process, but it's
covered in the sqlbackupREADME.html and it worked when I wrote those
notes. Anyone that uses this skin needs to test the recovery process
themselves, as you would for ANY backup method.

Like I said. This scratched my itch. There's an abundance of other methods.

If you do find a problem with this method, one that needs fixing -
then let me know.

--


Cheers
Glenn

rorpi - read only raspberry pi & various weewx addons
github.com/glennmckechnie

Guido Cioni

unread,
Aug 7, 2022, 3:43:17 AM8/7/22
to weewx-development
Well...space and bandwidth may be an issue in remote installations, like the one that I have, where everything get through a metered connection.
That's why I'm trying to find a way to transmit only the data that I have to transmit every time to backup: this way I could backup more frequently. 
Imagine sending a 400 MB file every hour through a connection that has a 4 GB monthly cap :) 
Even if it's compressed I still don't see the reason to send the entire db every time just because "it's easy and disk is cheap". 

About cron: it's not like that I don't want to use it, but I'm trying to find a solution that will work inside weewx directly so that it's more portable. I have already other processes running in cron and it would be easier to manage it if it's handled by weewx directly. 

Anyway, thanks for the input. 

Vince Skahan

unread,
Aug 7, 2022, 8:33:07 PM8/7/22
to weewx-development
Then do a full 'occasionally' and incrementals more often in between.  That's pretty typical backup strategy.   How many incrementals dictates how painful a restore would be.

Some example data:
  • my 15+ years of data is 68 MB gzipped if I dump just the archive table and gzip it
  • a 'day' of incremental additions to the archive table is 25 KB gzipped - so a month of those is under 1 MB more
So you could easily do a weekly full (4 * 68) and incrementals the other days (0.025 * 26) and be far under 300 MB total backup size.

While you 'could' do so in weewx, I would still strongly suggest an external utility 'and' cron.  That way you could leave weewx completely alone and tune how often you do full and incrementals to your liking.

Ryan Stasel

unread,
Aug 7, 2022, 8:51:48 PM8/7/22
to Vince Skahan, weewx-development
Maybe I’m crazy, but I just use cron, shut down weewx every night, and do a gzip of the db to a fileshare then start weewx back up. Super lazy, but when like 15y of data is only 80mb or something gzipped, not a huge deal. I think I keep 30 days worth of backups. 

On Aug 7, 2022, at 17:33, Vince Skahan <vince...@gmail.com> wrote:

Then do a full 'occasionally' and incrementals more often in between.  That's pretty typical backup strategy.   How many incrementals dictates how painful a restore would be.
--
You received this message because you are subscribed to the Google Groups "weewx-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to weewx-developm...@googlegroups.com.

Graham Eddy

unread,
Aug 7, 2022, 10:56:45 PM8/7/22
to weewx-de...@googlegroups.com
in practice i do same (copy of 180MB uncompressed db in cron-controlled daily backups to local file share), but it is so quick i just time it to miss a weewx archive update without stopping it. but i also copy the db itself offsite once daily (to internet web server)
Reply all
Reply to author
Forward
0 new messages