BeeGFS Data Management
by Peter Braam, CTO
January 2021
Welcome to the BeeGFS blog. I intend to occasionally publish some of our architectural thoughts, specifically to solicit feedback, publicly in blog comments or on our mailing list. Alternatively you can send an email to cto <at> thinkparq <dot> com. Probably we’ll find that posts here will be most useful during our early consideration of features, and perhaps there will be 2-4 posts per year.
Many partners ask about managing data across collections of servers in their clusters. For example, how can data be rebalanced over emptier, new servers and fuller, old servers? Technically we use the word pool, e.g. old pool and new pool, to refer to the available storage on the older and newer servers. There are variations on this question, for example, how can we leverage servers with NVMe drives that form a fast storage tier? Here the word tier is another technical term designating a collection of storage services. Utilizing tiers means that we want to move files to that tier to have them ready for fast transfer to clients and move them back to a pool on a slower tier when they have been idle for some time and merely consume capacity on the expensive, fast devices. In yet another scenario we would want to move files that have not been accessed for some time, or that belong to a finished project into an S3 object store.
Figure 1: BeeGFS cluster requiring space rebalancing and fast tier management
During discussions with our partners, several other features were requested and we began to see how closely related they are. The following features appear to be at the top of the wishlist:
handling tiers - placing or moving data into faster or slower tiers
data rebalancing - within and between pools
data movement - parallel data movement within, into, and out of BeeGFS
feeds into Irods, S3-like object storage, HSM systems, and auditing systems
directory subtree quotas
ThinkParQ prefers to keep the BeeGFS file system fast, nimble, and easy to use, and in this spirit, the architecture proposed here combines minor features in the file system with a few external utilities, some of which exist as open source software or can be derived from it.
Consider the following file system interfaces and external utilities:
BeeGFS event logging - including read-only file access and quota events
Atomically swap the data between two BeeGFS files
Block or abort concurrent writes on a single file
A powerful data-movement tool, like Los Alamos’ pftool
A sub-tree aware file system inventory database
Directory quota management through client file system directory tree
In this blog post, we are going to briefly discuss these interfaces and utilities and see how they will enable the features above.
Let’s start with a summary. All of the feature requests share a common requirement, namely having knowledge about collections of files. Such knowledge can be extracted from the existing BeeGFS event log which provides file access and modification events on a socket, in real-time, on a file by file basis. For historical knowledge about larger collections of files, we will use a subtree-aware inventory database. Below, we will illustrate what these databases can do and how they are maintained. Most of the features under discussion involve data movement and we include a powerful parallel file moving tool as one of the desirable utilities.
Now, let’s look at the additional logic to provide the core functionality of the features.
If we want to exploit a fast tier dynamically, a dynamic-fast-tier (DFT) utility should observe the event log to detect common access patterns. For example, it can anticipate that an application is going to iterate through all the files in a directory when it sees that a few have already been opened.
Figure 2: event log and dynamic adaptation for fast tiers
When the DFT utility observes such a pattern, it will invoke the data mover to copy the file to a faster tier. LANL has created a sophisticated parallel data mover called pftool, which carefully manages work queues across a collection of data mover nodes and includes interfaces that can perform I/O with object stores. We will see below why a powerful utility will be useful.
But there is a subtlety, and just copying the file to the fast tier isn’t quite enough. During this copy operation, we may want to abort the copy or block concurrent writes to the file by another process. This is necessary to keep the data migration process consistent, and it is not a complicated feature. We simply abort our copy when another process opens the migrating file for writing. When the copy is done, the original file should point to the data on the new tier. A convenient interface for this is to swap the data in the original and copied file, which is a quick metadata-only operation. After this, we can delete the inode to which we copied, which now has the data on the old tier.
Is it reasonable to expect that these relatively simple interfaces and a copy tool can give a reasonable solution to the tiering problem?
The other requested features are similar in spirit, and most are somewhat simpler than the tiering problem we just discussed. One difference to note is that in many cases the event logging system cannot provide the right stream of filenames upon which to act. This is where an inventory database comes into play.
We want the inventory database to record for each directory and for a set of conditions (predicates) the number of and aggregate bytes consumed by the files which satisfy the condition in the subtree of that directory. When we introduce no condition at all we record the file count and bytes in the subtree of each directory. This is already quite useful because it gives a snapshot of the sub-tree quota (often called directory quota). A near instantaneous lookup will produce these numbers, which are normally obtained by a run of a utility like “du”.
Another condition might state that the file must reside on the fast tier of storage. For this condition we record the count of and bytes consumed by files on that tier in the subtree of each directory, allowing us to see how full the tier is and initiate migration from it. Alternatively, we can look at files accessed or modified before or after a certain date, or by files owned by a particular user or group. If extended attributes are used, additional user managed predicates can be leveraged as well. In fact, tracking storage location in BeeGFS is done with extended attributes. Figure 3 illustrates how the directory summaries that we record in the inventory add up what is in their subtree.
Figure 3: Subtree inventory database records add up file counts and bytes consumed by files in every subtree satisfying a property
But before discussing the database further, let’s see how we can implement a directory quota system and the other features mentioned above.
A possible way to enforce directory quota is to introduce two extended attributes for directories subject to directory quota. One is the administrators choice of the maximum allowed quota, and the other attribute contains the consumed quota derived from the inventory database. Enforcement can proceed, for example, by not allowing files under the directory to be opened for writing when the consumed quota exceeds the maximum quota (for rigorous enforcement quota changes arising from already opened files must be discussed also, which we omit from the discussion here). The BeeGFS client file system can enforce this in the open system call with modest overhead. I would traverse from the file that is being opened to the ancestor in the dentry tree that has the quota attributes. When a user encounters such a denial to open a file, and when it subsequently removes some files to free up some space, the client could create a file system event which is consumed by a directory-quota (DQ) utility. The DQ utility updates the inventory database and resets the quota consumed, and the user can try again.
The data movement cases for rebalancing pools, servers, and migration to and from the cloud or HSM mentioned above are simpler. A pool-rebalancing (PR) utility will use the following logic. Periodically after the inventory database has been updated, it queries the inventory to see if too much space is consumed in a particular pool. If that is the case, it uses the inventory to find directories (using a fast logarithmic search) in which candidate files for migration can be found. Note that a nice predicate for the search could select files that reside in the pool, are of reasonably large size, and haven’t been accessed too recently.
A common scenario for pool rebalancing is after servers have been added, and in this case perhaps many files have to be migrated. Here a powerful parallel file mover can maintain migration jobs evenly and efficiently across a set of mover nodes.
Figure 4: Space rebalancing and fast tier management with pftool, inventory, and events
The use of an inventory database is not a new idea, it has been discussed as early as the 1990’s in academia, and falls in the general class of Merkel tree structures. More recently, Apple included some of these features in APFS, and Los Alamos’ grand unified file index (GUFI) was introduced in 2018 while perhaps the simplest inventory database was described at MSST in 2017. A key difference between the database we propose to use and an HSM database such as Robinhood is that there will only be entries for directories. There will be no database entries for every file, and this is a deliberate choice to keep the database much smaller than the file system metadata itself.
The creation of such a database requires a file system scan and one could propose to do this just occasionally at a low priority. However, the inventory database can often be updated very efficiently. Because the database has an additive structure, one can perform an update in a small sub-tree and propagate the changes to the root by adding them into ancestors. But even better, a database update can skip subtrees which were not accessed since the previous update. Particularly for interactive situations like we encountered for directory quota, there usually is a quick way to restore a user’s standing with respect to quota. If snapshots and differences between snapshots are available the updates to this database can be very efficient as they merely need to add and subtract counts and space used based on the differences.
The database will not be 100% accurate, and while that will not lead to any data loss, it will be possible to think of unlikely but unfortunate scenarios. For example, an out-of-date database may contain insufficiently many candidates to migrate out of a pool, because the pool was updated more recently than the database.
Surprisingly many features can be created without complex changes to the file system. We are excited about this possibility and curious if you see further applications or drawbacks.
Our next post will be about a very different topic, again driven by our users. We will explore how BeeOND, the much loved BeeGFS On Demand configuration tool can evolve further.
_________________________________________________________________________________________________________
Philipp Falk I System Engineer I p: +49 631 277576301
ThinkParQ GmbH I Trippstadter Strasse 110 I 67663 Kaiserslautern I Germany
CEO: Frank Herold I COB: Dr. Franz-Josef Pfreundt I Registered: Amtsgericht Kaiserslautern HRB 31565 I VAT-ID-No.:DE 292001792Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information.
OK, some feedback on this blog article. The first comment is on the mechanism for feedback. It is a blog right, which implies web pages and yet the feedback mechanism is an email list? Clunky, guys.
An inventory database is a good idea, although I think covering files as well would enable extra functionality. For instance, on a scratch filesystem a list of files and their access times would allow removal of inactive files easily. We already do this and have a pre-built list of candidate files to remove in the event the filesystem hits a high-water mark. The list is separate from the database. Before each file in the list is removed the atime at the point is also checked and the file is skipped if it is more recent than in the list. The list is worked through until the fs hits a low water mark.
I can see some use for rebalancing across pools, but in the case of adding servers, with a high data turnover and a random policy for file placement, coupled with targets becoming unavailable when they are too full, balance is achieved through normal use. We have a single pool, so this feature is not of much use to us. Maybe the s3 stuff may be one day.
I am more interested in bugs being fixed so the filesystem becomes more reliable. I don’t see a priority list mentioned in this blog but would like to see fixes (or code to provide workarounds) given the highest priority. We are a victim of buddy mirroring of metadata which is just totally unreliable. Despite a support contract ThinkParQ have yet to fix our bug/s or provide a workable solution to our up and down filesystem. It must be nearly 2 years now since we first logged a call on this. I would like to think those who have support contracts get preferential treatment on new feature requests too, but I don’t see any evidence of that.
Gazing into my crystal ball, I see more flash-based storage and in particular NVMe in servers. A recent article claimed there is now more flash storage produced in the world that traditional HDD, a fact I find hard to believe. Assuming it is true, the need for a fast, reliable filesystem that runs on this new storage platform is a must. Buddy mirroring on the storage side is much better reliability wise but a rather expensive solution with any storage medium. There is a bug here too, in that atime is only set on one of the buddy mirror targets (the primary.) We have a workaround for this from ThinkParQ, thankfully. I see better support for this new HW as more important that tiering. I think Intel do as well, looking at the IO500.
Greg
--
You received this message because you are subscribed to the Google Groups "beegfs-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
fhgfs-user+...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/fhgfs-user/CAOwbw0N67RMYYQsM8vAisWNbJ%2Bm5q_0vFa%2B9md77VeDwqAk6gQ%40mail.gmail.com.