Re: filesystem content tracking

Ivan Shmakov

unread,

Mar 19, 2012, 7:03:17 AM3/19/12

to

>>>>> Ivan Shmakov <onei...@gmail.com> writes:

[Cross-posting to news:comp.software.testing, too.]

[...]

> Now, to the numbers. For the test run, I've processed 47151 files
> (41285 regular, 2.6 GiB in total), and it took only 6:21 to produce
> the corresponding 251978 records (10 MiB in total.)

More precisely, 251978 (or, rather, 251979) is the number of
/object definition/ records only.

The distribution of the record types is as follows:

251979 objectDef
48253 blob
47192 filename
47151 fileBind
42883 solidStat
42883 fileRecord
23615 digest
1 uuid
1 digestKind
47151 objectConfirm
2559 filePadding
381 timestamp
1 sessionHeader
1 sessionTrailer

(Please also note that for the testing purposes, the block
padding size was set to 4 KiB, which is a rather low value. I
guess that the reasonable default for the released version of
the code would be around 128 KiB, and the number of filePadding
records will be proportionally lower.)

> The time is comparable to raw sha1sum(1) (about 4:20), and the space
> demands are roughly 222 bytes per file on average (compare to the
> SQLite-based version's 360 above.)

> I: file_bind=251978, file_rec=251977, digest=251976
> 132.12user 13.32system 6:21.35elapsed 38%CPU (0avgtext+0avgdata 233008maxresident)k
> 4096210inputs+8outputs (4major+14613minor)pagefaults 0swaps

> Unfortunately, the use of the previously recorded sessions is not
> implemented at this moment, so I don't have the space requirements
> for the "incremental update" case at hand, but I expect them to be
> times lower than those for the SQLite-based version. Combined with
> the ability to easily compress or move away the older sessions, I
> hope this may finally get the tool to a usable and useful state.

And here they are:

I: loaded, id-start=0, id-next=251979
I: session
...
I: file_bind=251978, file_rec=251977, confirmed
128.16user 4.96system 2:37.34elapsed 84%CPU (0avgtext+0avgdata 2020096maxresident)k
35824inputs+0outputs (0major+127316minor)pagefaults 0swaps

This time, only 207 digests were computed (preasumably for the
new and changed files in the set), and most of the time was
seemingly spent reading the "previous" session.

The resulting file is less than 774 KiB, contains 61858 records
total, 47393 of whose are "objectConfirm" ones (whose number
should be equal to the number of files processed.) The average
space consumption is thus less than 17 bytes per file, which is
a huge win when compared to the 86 bytes per file per session in
the previous, SQLite-based version.

The distribution of the record types in this "incremental" file
is as follows:

47393 objectConfirm
14218 objectDef
6578 fileRecord
6578 fileBind
369 blob
272 solidStat
243 filename
177 digest
1 file
193 filePadding
51 timestamp
1 objectMapDef
1 sessionHeader
1 sessionTrailer

The objectDef records here should be mainly fileRecord and
fileBind ones, due to the changes in Unix access times (a-times)
of the files processed. Should a-time recording be disabled (or
the filesystem be mounted with a-time updates disabled), I
expect much fewer of these records.

> (For anyone wishing to try this version, its sources could also be
> found at [1, 2], under the fccs-2012-03-asn.1 branch.)

... Though I'm yet to upload the recent changes there.

[...]

> [1] http://gray.am-1.org/~ivan/archives/git/gitweb.cgi?p=fc-2012.git
> [2] http://gray.am-1.org/~ivan/archives/git/fc-2012.git/

--
FSF associate member #7257

Ivan Shmakov

unread,

Mar 20, 2012, 12:37:30 PM3/20/12

to

Just a couple more tests.

Case A: 37 GiB, 1.58 MiN file archive

$ df -- /mnt
Filesystem 1K-blocks Used Available Use% Mounted on
XXXX 41274541 38857603 320938 100% /mnt
$ df -i -- /mnt
Filesystem Inodes IUsed IFree IUse% Mounted on
XXXX 5242880 1661012 3581868 32% /mnt
$ find /mnt -xdev -print0 \
| LC_ALL=C /usr/bin/time fc-scan -o fc-asn > fc-asn.log 2>&1
$ tail -n3 < fc-asn.log
I: file_bind=9817000, file_rec=9816999, digest=9816998
5145.29user 340.74system 3:23:05elapsed 45%CPU (0avgtext+0avgdata 15393312maxresident)k
61162914inputs+0outputs (4768major+1013779minor)pagefaults 0swaps
$

The resulting ASN.1 (BER) "log" file and the program's own text
log ("debug output") are as follows:

410017823 Mar 20 03:50 fc-asn
292657597 Mar 20 03:50 fc-asn.log

Which gives roughly 247 bytes and 5.91 records per filesystem's
inode (or 10.6 bytes and 0.26 records per KiB), at 6.9 MiB and
303 inodes processed per 1 second CPU time.

Case B: 15.9 GiB file archive

$ df -- /mnt
Filesystem 1K-blocks Used Available Use% Mounted on
XXXX 24770940 16678072 6834580 71% /mnt
$ find /mnt -xdev -print0 \
| LC_ALL=C /usr/bin/time fc-scan -o fc-asn > fc-asn.log 2>&1
$ tail -n3 < fc-asn.log
I: file_bind=919966, file_rec=919965, undigestible
494.76user 28.93system 20:31.95elapsed 42%CPU (0avgtext+0avgdata 1495824maxresident)k
7575336inputs+0outputs (4major+93549minor)pagefaults 0swaps
$

The resulting ASN.1 (BER) "log" file and the program's own text
log ("debug output") are as follows:

38083266 Mar 19 23:42 fc-asn
24922215 Mar 19 23:42 fc-asn.log

Which gives roughly 2.28 bytes and 0.06 records per KiB, at
31.1 MiB processed per 1 second CPU time.

Ivan Shmakov

unread,

Mar 22, 2012, 12:41:12 AM3/22/12

to

>>>>> Ivan Shmakov <onei...@gmail.com> writes:
>>>>> Ivan Shmakov <onei...@gmail.com> writes:

[...]

>> Unfortunately, the use of the previously recorded sessions is not
>> implemented at this moment, so I don't have the space requirements
>> for the "incremental update" case at hand, but I expect them to be
>> times lower than those for the SQLite-based version. Combined with
>> the ability to easily compress or move away the older sessions, I
>> hope this may finally get the tool to a usable and useful state.

> And here they are:

[...]

One thing to note, however, is that there's no proper support
for "incrementally updating" a session which is in turn an
"incremental update." IOW, if one records session A, and then
records an "update" B, it's not currently possible to use both A
and B combined as the basis for a further update C.

As a work-around, the C session may be recorded as an update to
A instead. (Thus getting all the files changed or added since A
rehashed for both B, C, and all the subsequent sessions based on
A.) Like:

A --> B --> C --> D currently not implemented;

A possible, though not as time and space
|--> B efficient.
|--> C
`--> D

The support for such a mode of operation shouldn't be that hard
to implement, but I wish to make some clean-ups to the code
before that. (In particular: split the code into separate Perl
modules, commit the "dump" utility source to the Git repository,
provide a CPAN package, allow for a-times and other not that
important metadata to be omitted from the record.)

>> (For anyone wishing to try this version, its sources could also be
>> found at [1, 2], under the fccs-2012-03-asn.1 branch.)

> ... Though I'm yet to upload the recent changes there.

Done.