[Cross-posting to news:comp.unix.shell, for the related
questions are raised there from time to time.]
More than occasionally, there's a need to create and maintain an
"inventory" of the filesystem contents, to facilitate backup, or to
maintain a history of file addition and removal, or for other
purposes. Below, I describe my current view on a possible design
for a tool aiming to solve such a task (tentantively named FCCS, for
Filesystem Content Checking (Control?) System), and relate my
experiences with implementing and using of the versions of this tool
relying on SQLite and a custom ASN.1-based file format for history
storage.
First try: SQLite
Somewhat recently, I've given this task one more try. My first
approach was to use a SQLite database to record the current
state and the history of filesystem changes. The basic "units"
of the filesystem state being tracked are:
* the results of the stat(2) system call;
* the message digests computed over (regular) file contents.
Each time a "filesystem scan" is performed, the software records
a "session", which binds filenames to stat(2) results and
contents' digests. Should there be a "previous" session, its
data is "linked" to the newly created one, and is then used to
avoid computing the digests for the unchanged files.
Specifically, the digest is computed if the stat(2) results
obtained for the file (sans the st_atime and st_dev fields) are
/not/ bound to any (filename, digest) pair within the current or
"previous" session.
On the other hand, if there /is/ such a binding, the digest may
be recomputed anyway if the binding is itself too old. For this
reason, the bindings are themselves timestamped. (And so are
the sessions.)
(For anyone wishing to try this version, its sources could be
found at [1, 2].)
Trying a bunch of ASN.1-based files instead
Unfortunately, despite my attempts to implement a
space-efficient database schema, the volume of the resulting
database is not only enormous by itself (about 360 bytes per
file being tracked for the first session) but also grows rather
rapidly (about 86 bytes per file being tracked per session.)
Also, as the database grows, the addition of new records begins
to take growing amounts of time. For the third session of the
database tracking about 300000 files, the rate could be as slow
as a few regular files per second on a fairly decent hardware.
Indeed, such a session may take a whole day to complete!
This, combined with that there seem to be no free software
extensions for transparent SQLite database file compression,
prompted me to change my strategy.
Namely, I've decided that instead of using a single database
file for all the sessions, I'd use one binary "log" file per
session, while allowing the tool to both read the files
containing previous sessions (so to compute digests only for the
new and changed files) and to /reference/ them from within the
resulting file (so to conserve space.)
As there's no need anymore to allow concurrent access to a
single file (files containing previously recorded sessions can
safely be examined while the new file is being written), the
in-file database indices are becoming less relevant.
Additionally, as the files in question are to be read mainly
sequentially, it becomes feasible to compress them even when
they're still used. Also, as there're distinct files for
distinct sessions, it's easy to compress those no longer needed,
or move them to some cheaper storage (slow HDD's and DVD+R's.)
Now, to the numbers. For the test run, I've processed 47151
files (41285 regular, 2.6 GiB in total), and it took only 6:21
to produce the corresponding 251978 records (10 MiB in total.)
The time is comparable to raw sha1sum(1) (about 4:20), and the
space demands are roughly 222 bytes per file on average (compare
to the SQLite-based version's 360 above.)
I: file_bind=251978, file_rec=251977, digest=251976
132.12user 13.32system 6:21.35elapsed 38%CPU (0avgtext+0avgdata 233008maxresident)k
4096210inputs+8outputs (4major+14613minor)pagefaults 0swaps
Unfortunately, the use of the previously recorded sessions is
not implemented at this moment, so I don't have the space
requirements for the "incremental update" case at hand, but I
expect them to be times lower than those for the SQLite-based
version. Combined with the ability to easily compress or move
away the older sessions, I hope this may finally get the tool to
a usable and useful state.
(For anyone wishing to try this version, its sources could also
be found at [1, 2], under the fccs-2012-03-asn.1 branch.)
Miscellaneous facts
The software in question is written in Perl and was primarily
tested on Debian GNU/Linux 6.0 "Squeeze" systems. It depends on
a few Perl packages, such as Digest::SHA and UUID, and also on
either DBD::SQLite and DBI (SQLite-based version), or
Convert::ASN1 (ASN.1-based one.)
It's my intention to make the software available under the
GNU General Public License version 3 or later, and the related
data schemata will probably be released into the public domain.
References
[1]
http://gray.am-1.org/~ivan/archives/git/gitweb.cgi?p=fc-2012.git
[2]
http://gray.am-1.org/~ivan/archives/git/fc-2012.git/
--
FSF associate member #7257