> Anyway, let us know what you find; we are certainly going to need
> such predictions over here.
Sorry, didn't have time to deal with this problem until now. Your
two passes solution (isi smartpools apply --dont-restripe --recurse ... then,
isi get -r -D) seems nice. Did you give it a try ?
For now I'm using
http://sourceforge.net/apps/trac/robinhood over a NFSv3
non-root-squash export. It's clearly more tailored for Lustre filesystem
(using live information from the transaction log), but can handle "generic"
filesystem quite well also. It is not as powerful as InsightIQ could be,
but it's working for me now, which is quite an advantage =)
It basically scans the filesystem (using an efficient multithread scanning
algorithm) and put information in a mysql database. After that, you've got
cli utilities and a web GUI to retrieve information from it.
Sample outputs from my data :
========
# rbh-report -i 2>/dev/null
type , count, volume, avg_size
symlink , 24873, 1.80 MB, 76
dir , 2313600, 3.94 GB, 1.79 KB
file , 109642670, 365.35 TB, 3.49 MB
fifo , 4, 0, 0
sock , 5, 0, 0
Total: 111981152 entries, 401715151109476 bytes (365.36 TB)
# rbh-report -a 2>/dev/null
Filesystem scan activity:
Current scan interval: 7.0d
Last filesystem scan:
status: done
start: 2014/03/04 00:10:08
end: 2014/03/06 06:01:59
duration: 2d 5h 51min 51s
Statistics:
entries scanned: 111982086
errors: 1
timeouts: 0
# threads: 8
average speed: 577.62 entries/sec
Storage usage has never been checked
No purge was performed on this filesystem
========
(I'm filtering out stderr due to warning concerning my Filesets definition, see below).
So, two days for scanning 365 TB and more than 100 millions entries
(HUGE number of small files. During scan, I was oscillating between
6000 entries/s and 20 entries/s. Filer is in production).
It results in a 40 GB mysql database.
I've defined some policies based on last access time in the configuration file.
To give you an idea :
Filesets
{
FileClass between_0m_and_1m
{
definition
{
last_access <= 30day
}
}
FileClass between_1m_and_3m
{
definition
{
last_access > 30day
and
last_access <= 90day
}
}
FileClass between_3m_and_6m
{
definition
{
last_access > 90day
and
last_access <= 180day
}
}
FileClass more_than_6m
{
definition
{
last_access > 180day
}
}
}
FYI, Filesets need to be referenced in the purge_policies (even if you don't purge anything) :
purge_policies {
ignore_fileclass = between_0m_and_1m;
ignore_fileclass = between_1m_and_3m;
ignore_fileclass = between_3m_and_6m;
ignore_fileclass = more_than_6m;
}
Which give me this kind of output, after 10 minutes of processing
(I'm clearly not a database expert, it's running on the VM with 8GB of ram,
NFS datastore on the cluster in production I'm actually crawling... There
is some space for performance improvement here :D)
# rbh-report --class-info 2>/dev/null
[10 minutes later]
purge class , count, spc_used, volume, min_size, max_size, avg_size
between_0m_and_1m , 12421765, 80.48 TB, 63.41 TB, 0, 135.22 GB, 5.35 MB
between_1m_and_3m , 26631999, 68.34 TB, 50.29 TB, 0, 137.86 GB, 1.98 MB
between_3m_and_6m , 7976866, 100.27 TB, 79.80 TB, 0, 4.00 TB, 10.49 MB
more_than_6m , 62636922, 226.09 TB, 171.86 TB, 0, 1.60 TB, 2.88 MB
So now I've got some metrics (and some histograms/pie charts on the web gui) to estimate how
much a new smartpool tiering policy will move around, and some nice CLI tools to retrieve information
from the database (rbh-find, rbh-du), which is nice. You have to keep in mind obvious things like the
access time is relative to the time when the robinhood scan engine reach a given file, but this still
give you a quite precise view on the state of your data.
The full admin guide here should give you a good view of what is possible :
http://sourceforge.net/apps/trac/robinhood/wiki/Doc
Nice generic alternative to InsightIQ. Active project (last release a few weeks ago), active community.
At least, it is enough for my initial need exposed in this thread. Of course InsightIQ is a much more
powerful solution when your FSAnalyze job has succeeded ;)
Jean-Baptiste