It is designed to meet the needs of flexibility, feature extendability and userpayload friendly, etc. Apart from those, it is still kept as a simplerandom-access friendly high-performance filesystem to get rid of unneeded I/Oamplification and memory-resident overhead compared to similar approaches.
hope to minimize extra storage space with guaranteed end-to-end performanceby using compact layout, transparent file compression and direct access,especially for those embedded devices with limited memory and high-densityhosts with numerous containers.
Support transparent data compression as an option:LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; Inaddition, inplace decompression is also supported to avoid bounce compressedbuffers and unnecessary page cache thrashing.
The following git tree provides the file system user-space tools underdevelopment, such as a formatting tool (mkfs.erofs), an on-disk consistency &compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
The size of the optional xattrs is indicated by i_xattr_count in inodeheader. Large xattrs or xattrs shared by many different files can bestored in shared xattrs metadata rather than inlined right after inode.
All directories are now organized in a compact on-disk format. Note thateach directory block is divided into index and name areas in order to supportrandom file lookup, and all directory entries are _strictly_ recorded inalphabetical order in order to support improved prefix binary searchalgorithm (could refer to the related source code).
In order to support chunk-based data deduplication, a new inode data layout hasbeen supported since Linux v5.15: Files are split in equal-sized data chunkswith extents area of the inode metadata indicating how to get the chunkdata: these can be simply as a 4-byte block address array or in the 8-bytechunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for moredetails.)
There are use cases where extended attributes with different values can haveonly a few common prefixes (such as overlayfs xattrs). The predefined prefixeswork inefficiently in both image size and runtime performance in such cases.
When referring to a long xattr name prefix, the highest bit (bit 7) oferofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a wholerepresent the index of the referred long name prefix among all long nameprefixes. Therefore, only the trailing part of the name apart from the longxattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty ifthe full xattr name matches exactly as its long xattr name prefix.
All long xattr prefixes are stored one by one in the packed inode as long asthe packed inode is valid, or in the meta inode otherwise. Thexattr_prefix_count (of the on-disk superblock) indicates the total number oflong xattr name prefixes, while (xattr_prefix_start * 4) indicates the startoffset of long name prefixes in the packed/meta inode. Note that, long extendedattribute name prefixes are disabled if xattr_prefix_count is 0.
EROFS implements fixed-sized output compression which generates fixed-sizedcompressed data blocks from variable-sized input in contrast to other existingfixed-sized input solutions. Relatively higher compression ratios can be gottenby using fixed-sized output compression since nowadays popular data compressionalgorithms are mostly LZ77-based and such fixed-sized output approach can bebenefited from the historical dictionary (aka. sliding window).
In details, original (uncompressed) data is turned into several variable-sizedextents and in the meanwhile, compressed into physical clusters (pclusters).In order to record each variable-sized extent, logical clusters (lclusters) areintroduced as the basic unit of compress indexes to indicate whether a newextent is generated within the range (HEAD) or not (NONHEAD). Lclusters are nowfixed in block size, as illustrated below:
A physical cluster can be seen as a container of physical compressed blockswhich contains compressed data. Previously, only lcluster-sized (4KB) pclusterswere supported. After big pcluster feature is introduced (available sinceLinux v5.13), pcluster can be a multiple of lcluster size.
Containerization is a trend in the devops industry in recent years. Through containerization, we will create a fully packaged and self-contained computing environment. It enables software developers to create and deploy their applications quickly. However, for a long time, due to the limitation of the image format, the loading of the container startup image is slow. To understand more about the details of this issue, please refer to this article "Container Technology: Container Image". To accelerate the startup of containers, we can combine optimized container images with technologies such as P2P networks, which reduces the startup time of container deployment and ensures continuous and stable operation. The article, "Dragonfly Releases the Nydus Container Image Acceleration Service", discusses this topic.
In addition to startup speed, core features such as image layering, deduplication, compression, and on-demand loading are important in the container image field. However, since there is no native file system support, most of them chose the user-mode solution. So did Nydus initially. With the evolution of solutions and requirements, user-mode solutions have more challenges, such as large performance gaps compared with native file systems, and high resource overhead in high-density deployment scenarios. The main reason is that the image format parsing and on-demand loading in the user state will bring a lot of kernel/user state communication overhead.
Facing this challenge, the OpenAnolis community made an attempt. We designed and implemented the RAFS v6 format compatible with the kernel native EROFS file system, expecting to sink the container image scheme to the kernel state. We also try to push this scheme to the mainline kernel to benefit more people. Finally, with our continuous improvement, erofs over fscache on-demand loading technology was merged into the 5.19 kernel (see the link at the end of the article). Thus, the next-generation container image distribution scheme of the Nydus image service became clear. This is the first natively supported out-of-the-box container image distribution solution for the Linux mainline kernel. Thus, container images feature high density, high performance, high availability, and ease of use.
This article will introduce the evolution of this technology from Nydus architecture review, RAFS v6 image format, and EROFS over Fscache on-demand loading technology. It will show the excellent performance of the current solution by comparing the data. I hope you can enjoy the quick container startup experience as soon as possible!
The Nydus image acceleration service is an image acceleration service that optimizes the existing OCIv1 container image architecture, designs the RAFS (Registry Acceleration File System) disk format, and finally present the container image in file system format.
The basic requirement of container image is to provide container root directory (rootfs), which can be carried through file system or archive format. It can also be implemented by a custom block format, but anyway it needs to present as a directory tree, providing the file interface to containers.
Let's take a look at the OCIv1 standard image first. The OCIv1 format is an image format specification based on the Docker Image Manifest Version 2 Schema 2 format. It consists of manifest, image index (optional), a series of container image layers, and configuration files. For details, please refer to relevant documents. In essence, an OCI image is an image format based on layers. Each layer stores file-level diff data in tgz archive format, as follows.
Nydus is a container image storage solution based on file system. The data (blobs) and metadata (bootstrap) of the container image file system are separated so that the original image layer only stores the data part of the file. The file is divided by chunk granularity, and the blob in each layer stores the corresponding chunk data. Chunk granularity is adopted, so this refines the granularity of deduplication. Chunk-level deduplication makes it easier to share data between layers and between images, and it is also easier to load on demand. Metadata is separated into one place. Therefore, you do not need to pull the corresponding blob data to access the metadata. The amount of data to be pulled is much smaller and the I/O efficiency is higher. The following figure shows the Nydus RAFS image format.
Before the introduction of the RAFS v6 format, Nydus used a fully userspace implemented image format to provide services through the FUSE or virtiofs interface. However, the user-state file system solution has the following defects in design.
These problems are the natural limitations of the user-state file system solution. If the implementation of the container image format is sunk to the kernel state, these problems can be solved in principle. Therefore, we have introduced the RAFS v6 image format, which is a container image format based on the kernel EROFS file system and is implemented in the kernel state.
The EROFS file system has existed in the Linux mainline since the Linux 4.19 kernel. In the past, it was mainly used in the embedded and mobile devices fields, existing in the current popular distributions (such as Fedora, Ubuntu, Archlinux, Debian, Gentoo, etc.). The user state tool erofs-utils also exists in these distributions and OIN Linux system definition lists. The community is active.
In the past year, the Alibaba Cloud kernel team has made a series of improvements to the EROFS file system. Its usage scenarios under cloud-native are expanded to adapt it to the requirements of the container image storage system. Finally, it is presented as a container image format RAFS v6 implemented in the kernel state. In addition to sinking the image format to the kernel state, RAFS v6 also optimizes the image format, such as block alignment and streamlined metadata.