NFS has been around for decades as the premier networked, clustered filesystem. If you're a unix/linux user, and you're storing a lot of files, you're probably using NFS right now, especially if you need multiple hosts accessing the same data.
If you're looking for high-performance NFS, NetApp's implementation is the best in the business. A lot of NetApp's market share was built on ONTAP's unique ability to deliver fast, easy-to-manage NFS storage for Oracle database workloads. It's an especially nice solution for Oracle RAC because it's an inherently clustered filesystem. The connected hosts are just reading and writing files. The actual filesystem management lives on the storage system itself. All the NFS clients on the hosts see the same logical data.
The NFSv3 specification was published in 1995, and that's still the version almost everyone is using today. You can store a huge number of files, it's easy to configure, and it's super-fast. There really wasn't much to improve, and as a result v3 has been the dominant version for decades.
Sometimes it's just perception. NFSv4 is newer, and 'newer' is often seen as 'better'. Most customers I see who are either migrating to NFSv4 or choosing NFSv4 for a new project honestly could have used either v3 or v4 and wouldn't notice a difference between the two. There are exceptions, though. There are subtle improvements in NFSv4 that sometimes make it a much better option, especially in cloud deployments.
This post is about the key practical differences between NFSv3 and NFSv4. I'll cover security improvements, changes in networking behavior, and changes in the locking model. It's especially critical you understand the section NFSv4.1 Locks and Leases. NFSv4 is significantly different from NFSv3. If you're running an application like an Oracle database over NFSv4, you need to change your management practices if you want to avoid accidentally crashing your database.
The most confusing part about the NFSv4 specification is the existence of optional features. The NFSv3 spec was quite rigid. A given client or server either supported NFSv3 or did not support NFSv3. In contrast, the NFSv4 spec is loaded with optional features.
Most of these optional NFSv4 features are disabled by default in ONTAP because they're not commonly used by sysadmins. You probably don't need to think about them, but there are some applications on the market that specifically require certain capabilities for optimum performance. If you have one of these applications, there should be a section in the documentation covering NFS that will explain what you need from your storage system and which options should be enabled.
If you plan to enable one of the options (delegations is the most commonly used optional feature), test it first and make sure your OS's NFS client fully supports the option and it's compatible with the application you're using. Some of the advanced features can be revolutionary, but only if the OS and application make use of those features. For more information on optional features, refer to the TR referenced above.
For example - locking. NFSv3 has some basic locking capabilities, but it's essentially an honor system lock. NFSv3 locks aren't enforced by the server. NFSv3 clients can ignore locks. In contrast, NFSv4 servers, including ONTAP, must honor and enforce locks.
That opens up new opportunities for applications. For example, IBM WebSphere and Tibco offer clusterable applications where locking is important. There's nothing stopping those vendors from writing application-level logic that tracks and controls which parts of the application are using which files, but that requires work. NFSv4 can do that work too, natively, right on the storage system itself. NFSv4 servers track the state of open and locked files, which means you can build clustered applications where individual files can be exclusively locked for use by a specific process. When that process is done with the file, it can release the lock and other processes can acquire the lock. The storage system enforces the locking.
That's a cool feature, but do you need any of that? If you have an Oracle database, it's mostly just doing reads and write of various sizes and that's all. Oracle databases already manage locking and file access synchronization internally. NetApp does a lot of performance testing with real Oracle databases, and we're not seeing any significant performance difference between NFSv3 and NFSv4. Oracle simply hasn't coded their software to make use of the advanced NFSv4 features.
With NFSv4, you have a single target port (2049) and the NFSv4 clients are required to renew leases on files and filesystems on regular basis. (more on leases below) This activity keeps the TCP session active. You can normally just open port 2049 through the firewall and NFSv4 will work reliably.
In contrast, NFSv3 is often impossible to run through a firewall. Among the problems experienced by customers trying to make it work is NFSv3 filesystems hanging for up to 30 minutes or more. The problem is that firewalls are almost universally configured to drop a network packet that isn't part of a known TCP session. If you have a lot of NFSv3 filesystems, one of them will probably have quiet periods where the TCP session has low activity. If your TCP session timeout limit on the firewall is set to 15 minutes, and an NFSv3 filesystem is quiet for 15 minutes, the firewall will make the TCP session stale and cease passing packets.
If the firewall rejected the packets, that would prompt the client to open a new session, but that's not how firewalls normally work. They'll silently drop the packets. You don't usually want a firewall rejecting a packet because that tells an intruder that the destination exists. Silently dropping an invalid packet is safer because it doesn't reveal anything about the other side of the firewall.
The result of silent packet drops with NFSv3 is the client will hang while it tries to retransmit packets over and over and over. Eventually it gives up and will open a fresh TCP session. The firewall will register the new TCP session and traffic will resume, but in the interim your OS might have been stalled out for 5, 10, 20 minutes or more. Most firewalls can't be configured to avoid this situation. You can increase the allowable timeout for an inactive TCP session, but there has to be some kind of timeout with fixed number of seconds.
We've had a few customers write scripts that did a repeated "stat" on an NFSv3 mountpoint in order to ensure there's enough network activity on the wire to prevent the firewall from closing the session. This is okay as a one-off hack, but it's not something I'd want to rely on for anything mission-critical and it doesn't scale well.
NFSv4 is inherently more secure than NFSv3. For example, NFSv4 security is normally based on usernames, not user ID's. The result is it's more difficult for an intruder to spoof credentials to gain access to data on an NFSv4 server. You can also easily tell which clients are actively using an NFSv4. It's often impossible to know for sure with NFSv3. You might know a certain client mounted a filesystem at some point in the past, but are they still using the files? Is the filesystem still mounted now? You can't know for sure with NFSv3.
In a nutshell, basic krb5 security means better, more secure authentication for NFS access. It's not encryption per se, but it uses an encrypted process to ensure that whoever is accessing an NFS resource is who they claimed to be. Think of it as a secure login process where the NFS client authenticates to the NFS server.
If you use krb5i, you add a validation layer to the payload of the NFS conversation. If a malicious middleman gained access to the network layer and tried to modify the data in transit, krb5i would detect and stop it. The intruder may be able to read data from the conversation, but they won't be able to intercept and tamper with the data.
In the field, few administrators use these options for a simple reason - what are the odds a malicious intruder is going to gain access to data center and start snooping on IP packets on the wire? If someone was able to do that, they'd probably be able to get actual login credentials to the database server itself. They'd then be able to freely access data as an actual user.
With increased interest in cloud, some customers are demanding that all data on the wire be encrypted, no exceptions, ever, and they're demanding krb5p. They don't necessarily use it across all NFS filesystems, but they want the option to turn it on. This is also an example of how NFSv4 security is superior to NFSv3. While some of NFSv3 could be krb5p encrypted, not all NFSv3 functions could be "kerberized". NFSv4, however, can be 100% encrypted.
NFSv4 with krb5p is still not generally used because the encryption/decryption work has overhead. Latency will increase and maximum throughput will drop. Most databases would not be affected to the point users would notice a difference, but it depends on the IO load and latency sensitivity. Users of a very active database would probably experience a noticeable performance hit with full krb5p encryption. That's a lot of CPU work for both the OS and the storage system. CPU cycles are not free.
If you're genuinely concerned about network traffic being intercepted and decoded in-transit, I would recommend looking at all available options. Yes, you could turn on krb5p, but you could also isolate certain NFS traffic to a dedicated switch. Many switches support private VLANs where individual network ports can communicate with the storage system, but all other port-to-port traffic is blocked. An outside intruder wouldn't be able to intercept network traffic because there would be no other ports on the logical network. It's just the client and the server. This option mitigates the risk of an intruder intercepting traffic without imposing a performance overhead.
c80f0f1006