Download as a PDF

The Viva File System Eric H. Herrin II

Raphael A. Finkel

Technical Report Number 225-93 915 Patterson Oce Tower Department of Computer Science University of Kentucky Lexington, Kentucky 40506 [email protected], [email protected]

Abstract

This paper describes the Viva File System or VIFS, a technique for highperformance le allocation on disks. VIFS uses bitmaps to represent both free blocks on the disk and allocated blocks in each le. Allocation bitmaps provide a very fast method of nding blocks close to the last allocated block in a le. Fragments (partial blocks) are used to store the over ow from the last le block; the minimum size of a fragment is chosen when the le system is initialized. Conventional le systems, such as the Berkeley Fast File System (FFS), can store a le containing 96KB of data without using indirect blocks and around 16 MB of data for each indirect block. With the same block size, VIFS can store up to about 10 MB of data without using indirect blocks and up to 500 MB of data per indirect block. The design of VIFS allows some previously synchronous operations to occur asynchronously, resulting in signi cant speed improvements over FFS. VIFS provides multiple read-ahead to maintain its high speed when several processes are competing for disk accesses. This paper provides experimental measurements taken from an implementation of Viva in a BSD 4.3 kernel. The measurements show that VIFS is signi cantly faster than FFS for nearly all le system operations.

1 Introduction Disk access speeds have become the limiting factor in computing systems today. CPU speeds continue to grow quickly, while disk speeds grow much more slowly. Due to 1

this discrepancy in CPU versus disk speed, the throughput of a le system has become increasingly important. Some le systems, such as the Berkeley Fast File system (FFS) [3], use cylinder groups to increase the locality of a le's data blocks. FFS provides fragments for storing the part of the le that does not t into an integral number of blocks, allowing the use of larger data blocks without sacri cing ecient data storage. The Viva File System (VIFS) treats the entire disk partition as a single cylinder group and uses allocation bitmaps to provide fast allocation of data blocks. VIFS uses fragments, but reduces the number of fragments needed by storing a xed amount of data in each inode1 . Since reading an inode from disk also reads some data, accesses to small les are very fast. Locating free data blocks, fragments, and inodes is fast since each of these items is associated with a bitmap and free-item counts. The notion of using bitmaps to keep track of used data blocks is not new; le systems such as FFS use bitmaps to record used blocks within cylinder groups. However, these other systems do not use bitmaps to record blocks used by individual les, probably because a full bitmap for each le would be wasteful, and le blocks would have to be allocated in ascending order. The sparse bitmaps used by VIFS allow compression of the bitmaps and arbitrary ordering of blocks within a le. Each inode contains only sections of the full bitmap relevant to a particular le. We have constructed a kernel-resident VIFS prototype under a variant of BSD 4.3 Unix. The BSD 4.3 VFS (Virtual File System) interface is used, so VIFS and FFS may be used simultaneously (in dierent disk partitions). Our benchmarks show that VIFS is signi cantly faster than FFS for nearly all le operations. Although FFS performance can be improved by increasing the block size, VIFS is still faster and wastes less space. We describe the design of VIFS (Section 2), detail its prototype implementation (Section 3), then discuss benchmark results (Section 4). We close by comparing VIFS to related work (Section 5), discussing future work (Section 6), and drawing some conclusions from our experience with VIFS (Section 7).

2 Design of the Viva File System The initial design of the Viva le system was inspired by the following beliefs: In Unix jargon, the inode is the le descriptor stored on the disk that contains information on permissions, ownership, le statistics, and le layout. 1

2

Compressing disk block addresses in the inode (from 32 bits to one, if possible)

and enlarging the size of the inode will largely eliminate the need for indirect blocks on the average le system. The average size of a le has increased signi cantly over the years[1, 8, 12]. Keeping inodes and data close together only bene ts small les and synchronous creates/deletes. Optimizing placement of disk blocks based upon the geometry and speed of the physical disk does not work well when the disk controller provides a logical abstraction of the physical disk.

The data structures used by VIFS can provide compression of the disk block addresses kept in the inode anywhere from a factor of 1 to almost 322. Compression saves signi cant space without sacri cing speed (assuming a reasonably fast processor). Fewer disk reads are necessary, on average, because fewer les require indirect blocks. However, some restraint must be observed in allocating inodes for a new le system. Since the size of a VIFS disk inode is usually at least twice that of an FFS disk inode (VIFS inode size is set at le-system creation time), more kernel memory is used to store inodes, and more disk space is consumed by inodes. The remainder of this section summarizes the structures and algorithms used by VIFS. We discuss the layout of the le system on the physical disk, various data structures, allocation policies, and some novel features of VIFS.

2.1 Disk layout The physical layout of a Viva le system is portrayed in Figure 1. We have reverted to the BSD 4.0 method of keeping inode and data blocks in separate areas of the partition. File creation and deletion are not hindered by this placement, since we do not generally write inodes and directory blocks synchronously. Since le contents within the inode are read at the same time the inode is read, no extra seeks are necessary to read small les. The superblock contains global information about the le system and is not replicated, since it can easily be regenerated. Three primary bitmaps represent the allocation state of data blocks, inodes, and fragments (generically called items). VIFS inodes are similar to FFS inodes except for how they represent le layout. Instead of keeping a dozen 32-bit disk addresses, VIFS keeps a tunable amount of space for Taking into consideration the extra overhead needed by the sparse bitmaps, the actual maximum factor is approximately 31.5. 2

3

Super Block

Primary Bitmaps

Inodes

Data blocks and fragments

Figure 1: Disk layout of the Viva File System bitmaps that record disk block allocation. Inodes also contain a small xed number of pointers to indirect blocks.

2.2 Allocation bitmaps Allocation bitmaps allow VIFS to nd contiguous free items quickly if any are available. They are composed of the primary bitmap and derived information. A primary bitmap denotes the state of all items (1 for busy, 0 for free). All three primary bitmaps are stored in the bitmap area of the disk partition, and a copy is kept in memory. Each primary bitmap is logically partitioned into segments for the purpose of deriving information that makes allocation fast. The length of the segments in a primary bitmap is the square root of its total size (in bytes). For example, a 4GB le system with 16KB blocks has 218 blocks, p 15requiring 215 bytes of bitmap, so the primary data block bitmap is composed of d 2 e = 182 segments, each of 182 bytes (except the last segment)3 . For each segment, VIFS records the number of contiguous 0 bits occupying 1-, 2-, 4-, and 8-bit regions in the primary bitmap that are aligned to 1-, 2-, 4-, and 8-bit slots, respectively4 . These counts are kept in memory and in the superblock. The memory version is modi ed during allocation and deallocation, and the superblock version can be easily regenerated after a crash. Figure 2 shows the derived information for a sample 3-byte segment. Figure 3 shows the structure of an allocation bitmap.

2.3 Sparse bitmaps In order to represent an object composed of particular items, we could use a full bitmap, which would be like a primary bitmap restricted to the items used by this Our implementation is limited to 512 segments per primary bitmap and 215 bytes of primary bitmap for data blocks. 4 A compile-time constant allows the maximum region to be modi ed; for example, one might also wish to record 16-bit contiguous regions. 3

4

Segment Information Example using a 3 byte piece of an allocation bitmap. The third byte

8-bit

1

4-bit

3

2-bit

8

1-bit

19

does not contain a 4-bit boundary, even though there are 4 contiguous free bits. Free 4-bit boundary

Free 2-bit boundary

Free 8-bit boundary

Free 4-bit boundary

Free 4-bit boundary

Free 2-bit boundary

Figure 2: Derived information for a segment of primary bitmap

Bit Boundary

Segment Number 0

1

m-1

8 4

Derived Segment Information

2 1

Primary Bitmap (n bytes) Seg 0

Seg 1

Seg m-1

Figure 3: The allocation bitmap

5

object. However, a full bitmap would be very wasteful of space (it would generally be almost empty), and the object would have to be built out of the items in strictly increasing item order. Instead, we use a sparse bitmap to represent objects. It can be treated as a concise, ordered view of the full bitmap, which we never actually build. The conciseness depends on the locality of bits in the full bitmap; we get up to approximately 32-to-1 savings over storing block numbers in 32-bit integers. Figure 4 shows the basic structure of a sparse bitmap and its relationship to the allocation bitmap for the item. The sparse bitmap is composed of entries, each of which describes one or more allocated items. Each entry is composed of a byte oset into the full bitmap, an entry length (in bytes), and a bitmap of up to 2048 items at that oset5 . The order of the items comprising the object is taken from the order of the entries; within an entry, items are in full-bitmap order. An object composed of a sorted, tightly clustered set of items may be described by a sparse bitmap with one entry. Our allocator tries to allocate items in such a fashion. Larger objects and those composed of poorly clustered items require more entries in the sparse bitmap. An item with a particular index within the object can be quickly found by linear search through the sparse bitmap. For example, item number 24 can be found by summing the number of bits allocated by each entry (using table lookup to convert a byte of sparse bitmap into a count of bits set) until the entry containing the 24th allocated bit is found. Even though this linear scan is often fast enough, we divide the entries in a sparse bitmap into groups of entries (three groups for an inode, many more for an indirect block) and record the number of bits set in each group. Each group holds approximately the same number of bytes of bitmap. This indexing information can be scanned (again, linearly) to reduce the amount of work needed to nd a particular item.

2.4 Inode design inodes are similar to traditional FFS inodes with two exceptions. First, VIFS inodes replace 32-bit data-block addresses with a sparse bitmap of data-block addresses. The sparse bitmap can map a large number of data blocks, thus reducing the number of indirect blocks required for large les. FFS requires indirect blocks for les over 192KB (assuming 16KB blocks). In contrast, VIFS with the same block size can store up to 20MB before needing an indirect block. Second, some le contents may be stored with the inode, reducing the number of reads required to access data in small les. In particular, entries in small directories

VIFS

5

2048 is an arbitrary implementation limit.

6

Section of sparse bitmap for an individual file Byte offset in allocation bitmap

Length (bytes)

Bits set in the allocation bitmap

23

43

11110....01101

534

15

01110....10001

Number of bits set in group of entries

Byte offset in sparse bitmap of this group

{

128

0

234

40

00001....01110

42

64

{

768

24

11111....11101

876

134

{

2032

73

01010....01111

3210

1

10100100

Indexing information (for speed)

Allocation bitmap

Figure 4: Sparse bitmap are acquired when the directory's inode is read. Both the inode room allocated to the sparse bitmap and the amount of le contents stored in the inode are tunable parameters for each le system. However, we suggest restrictions on the total size of an inode:

The le-system block size should be evenly divisible by the size of an inode,

where the size of an inode includes both the sparse bitmap and le contents. If this advice is not followed, a small amount of space in each block of inodes is unusable. The minimum size of an inode should be 256 bytes, with zero bytes of le contents.

Each inode can store a small amount of le data, as mentioned earlier. The le contents stored in an inode may be taken from the beginning or end of the le. The default is a con guration parameter, but individual les may store the le contents from either location6. Inodes describing directories always keep data from the beginning of the le to avoid copying directory entries, since directories tend to grow. For data les, each approach has advantages: 6 ioctl

is called directly after creat.

7

Test system summary Number of bytes 689,198,205 Number of les 41742 Number of les over 96K 1273 Number of symlinks 1038 Number of 512 byte directories 2628 Amount of overhead space in test system Type of overhead VIFS FFS Indirect blocks 0 bytes 10,428,416 bytes Unused bytes in data blocks and fragments 10,156,669 bytes 23,040,899 bytes Size of used inodes 10,685,952 bytes 5,342,976 bytes Extra space used by unused inodes over FFS 2,137,600 bytes 0 bytes Total overhead 22,980,221 bytes 38,812,291 bytes Table 1: Comparison of FFS and VIFS overhead on a particular system

Storing le contents from the end of the le: Programs that use knowledge

of le system block size to perform internal buering (such as the stdio library) work as eciently as before. Storing le contents from the beginning of the le: Programs that need to read only the beginnings of les (such as some news or mail lters) are signi cantly sped up.

Files that are smaller than the amount of contents stored in the inode perform identically in either case. We compared overhead space in VIFS and FFS for a le system with 1.2GB of disk space and 8KB blocks7 . The le system contained a standard BSD 4.3 developer's environment, with mostly user les, source code, and executables. Table 1 shows that VIFS requires approximately 40% less overhead space than FFS, even though the size of an inode is twice as large. Part of this improvement is due to the fact that we are using 512-byte fragments, which save approximately 50% over using 1024-byte fragments8 . The rest of the improvement is due to the fact that no indirect blocks are needed9 . Since the amount of space reserved for an inode and its included le contents is six times larger than that of an FFS inode (in our example), more thought must be given Later we increased the minimum block size in VIFS to 16KB. allows at most eight fragments per block. This measurement was taken on a le system that had been in use for a while, so VIFS did not have the advantage of perfect clustering for all les. 7

8 FFS 9

8

Each bit represents one basic fragment

Fragment Allocation Bitmap

112 KB

Sizes shown reflect the Each bit represents one block in fragment file

Fragment Sparse Bitmap

16KB

default parameters for a 2 GB VIFS file system with 16 KB block size, 512 byte fragment size and 20% maximum fragment

Each bit represents one data block on the disk

blocks. Minimum compression ratio is set to 8.

Block Allocation Bitmap

16 KB

Figure 5: Fragment representation to the number of inodes that are allocated in a particular le system. We have found that we usually overallocate inodes in our FFS le systems by 50 to 75 percent.

2.5 Fragments fragments may be set (on a per- lesystem basis) to 2f bytes for any 9 f < n, where the data block size is 2n. Fragments allow us to use large block sizes (which is good for throughput) without the associated overhead space. Smaller fragment sizes reduce the amount of unused data space but require a larger fragment-space primary bitmap. Instead of storing all fragments in a reserved region (as was our initial intent), we allow any data block to be used for fragments; such a block is called a fragment block. The set of all fragment blocks is a pseudo- le called the fragment le. The data blocks comprising the fragment le are represented by the fragment sparse bitmap. The allocation state of the fragments in the fragment le is denoted by the fragment allocation bitmap (see Figure 5). The fragment allocation bitmap has one bit for every fragment, whereas the fragment sparse bitmap has one bit for every block devoted to fragments. The sizes of both the fragment allocation bitmap and the fragment sparse bitmap are determined at le system creation time from the given block size, fragment size, minimum compression ratio (1{32), and the percentage of fragments allowed. Therefore, the level of compression in the sparse bitmap is important. For this reason, we allocate fragment blocks in large contiguous chunks (typically 32 blocks at a time). When a contiguous set of fragments is needed that does not exist in the current fragVIFS

9

ment allocation bitmap, a new chunk is allocated. The size of a chunk is tunable at le system creation time.

2.6 Allocation policies manages several dynamically allocated data structures. Inodes, free data blocks, and fragments must each be allocated in an ecient manner that guarantees the best placement available. For inodes, best placement means that inodes referenced by a directory should be close together; for data blocks it means that blocks in a single le should be close together. VIFS uses the following algorithm to determine the best placement: VIFS

INPUT:

The allocation bitmap. A goal indicating the desired placement. Desired bit boundary (e.g. 1-, 2-, 4-, 8-bit free boundary) OUTPUT: The bit number allocated in the allocation bitmap. If there exists a free bit within the next 32 bits of the goal Then Allocate that free bit. Else Find a free bit based on the desired bit boundary. If such a free bit exists Then Allocate that free bit. Else Search through the lower bit boundaries until either a bit is found or there are no remaining free bits.

This algorithm allows locality of allocation in two important ways. First, VIFS uses it to cluster inodes within a directory and data blocks within a le. Next, we opt to begin a new entry in the sparse bitmap whenever more than 4 bytes are required to expand the last entry (in the current implementation, each entry requires at least 4 bytes). This allows the allocation algorithm to nd a bit with neighboring free bits for future le expansion10 . The placement algorithm bases its choice upon the desired number of contiguous free 10 There is no guarantee these free bits will be free later, but they are lower priority for allocation in another le, since one bit is occupied by this le. We could, however, allocate multiple blocks at once in expectation of future requests. Unused blocks could then be freed upon the last close of the le [9] at the same time the last block in the le is moved into a fragment.

10

bits. When a le is created, the block allocator assumes that the le will be small and requests a single bit. As the le grows, the number of contiguous free bits requested for each new entry grows exponentially up to the maximum available. This strategy tends to cluster small les together, while also clustering together the blocks of a large le. Unlike FFS, VIFS does not use the geometry of the physical disk to provide optimal placement. We rely on the device driver to provide a linear block address space, and we assume that blocks with close addresses are somewhat close together on the disk. Disk technology is changing, and cylinder groups are not always the most ecient method of subdividing a disk partition. SCSI notched disks have a varying number of sectors per cylinder, disk striping makes cylinder groups irrelevant, and future nonplatter storage will not have physical cylinders at all. Our benchmarks in Section 4.1 show that there is no performance degradation resulting from our simple algorithm.

2.7 Additional le system features The Viva le system contains features that can increase performance. First, create and delete operations have been changed to allow asynchronous operation. These changes produce an order of magnitude increase in speed for le creation and deletion (depending upon CPU speed). Second, VIFS introduces circular les, which overwrite the start of a le when the established size is exceeded.

2.7.1 Asynchronous operation synchronously writes inodes and directories when creating or deleting a le. That is, the directory is not written until the write to the inode is complete, and le creation or deletion is not considered done until the write to the directory is complete. The remainder of this section explains the conditions that require FFS inodes and directories to be written synchronously and how we avoid those conditions. When a le is created, if its directory entry is written rst and the machine crashes before the inode is written, the directory entry will refer to an improper inode. To prevent this problem, FFS synchronously writes the inode and then synchronously writes the directory block. The inode bitmap is written asynchronously at a later time, and crash recovery can assume that the le has been allocated. VIFS asynchronously writes these structures, but insures only that the inode bitmap is written after the inode is written. Crash recovery can discover an unreferenced inode or referenced inodes that are marked free in the inode allocation bitmap. The le is then assumed to be an incomplete allocation and is removed from the inode bitmap or directory.

FFS

11

For deletion, if an inode is reallocated before the directory entry is updated, two directory entries might refer to the same inode by mistake. Unless the map of used inodes has been updated to show the deletion of the le, this error is unrecoverable. FFS writes both the inode and the directory synchronously to prevent this problem. VIFS writes the directory block asynchronously and doesn't write the inode at all. The inode allocation bitmap is updated in memory to re ect the deletion after the directory block has been written. It is impossible to reallocate the inode until that update. A crash that occurs after the directory has been written will result in an unreferenced inode, which can be freed during recovery. Although VIFS can base le creation and deletion on asynchronous operations without danger of losing le-system integrity, such an implementation does change the semantics of these operations from the point of view of an application. FFS guarantees that a created le is on the disk or an unlinked le is removed from the disk before returning from the system call. VIFS cannot make such a guarantee when operating in asynchronous mode. Since we wish programs to enjoy the same consistency semantics under VIFS as under FFS in those rare cases where it is important, we allow VIFS le systems to act in either a synchronous or an asynchronous manner. VIFS provides options that allow the system administrator to specify either FFS semantics or VIFS semantics. Totally synchronous semantics (data block writes are also synchronous) are also supported. These options can be set for the entire le system, for particular directories (subsequently created les in such directories inherit the semantics), and for individual les.

2.7.2 Circular les Circular les are preallocated to a xed size and can be implicitly (as the le grows) or explicitly truncated from either the beginning or end11 . File I/O is performed through a translation table that maps logical osets into true osets. Circular les are particularly useful for log les. They reduce administrative overhead and often have good data-block locality.

2.7.3 Multiple read-ahead of sequentially accessed le blocks When FFS reads a block from a le, it rst checks to see if the le access pattern appears to be sequential. If so, an asynchronous read of the next le block is issued so that (hopefully) it will be in the buer cache when the next block is requested. This pattern continues as long as the le is being sequentially accessed. Large read operations are performed in the same manner, one block at a time. 11

The application program makes a le circular by an ioctl call immediately after creation.

12

While the FFS scheme works well when exactly one process is sequentially accessing a le, it degrades severely when many processes are sequentially accessing dierent les. This decrease in performance is caused by frequently moving the disk head between dierent locations on the disk, thus often causing the next read of a block to wait instead of bene tting from the previously issued read-ahead request. Multiple read-ahead in VIFS has proven to be an extremely eective solution to this problem. Read operations queue as many blocks as allowed12 , thus eectively reducing disk head movement from n2 to nb (where n is the total number of disk blocks to be read, and b is the number queued during each read operation). Single sequential processes generally show no bene t from multiple read-ahead. Section 4 demonstrates the advantages of this feature.

3 Implementing the Viva File System The complexity of the Viva file system is approximately the same as BSD 4.3 FFS. The implementation of the allocation bitmap and sparse bitmap code is uniform across data blocks, inodes, and fragments, thus controlling complexity. Many of the VIFS directory routines are similar to those of FFS. The remainder of this section is devoted to speci c implementation details and problems encountered during the implementation. We also discuss tunable parameters and various problems of VIFS.

3.1 File system creation parameters VIFS

le-system parameters tune data structure sizes and policies:

Number of inodes. Total number of inodes in the le system. Number of blocks. Total number of blocks in the le system (including the superblock, inode blocks, and data blocks).

Inode size. Number of bytes in an inode (including le contents); must be 256. The amount in excess of 256 is used for le contents.

The number of blocks a process is allowed to read ahead is based on the number of sequential accesses the process has already performed, the number of processes accessing les sequentially, a prede ned maximum that cannot be exceeded for any single le, and a prede ned maximum amount of buer space that all sequential processes are allowed to consume for multiple read-ahead. This value changes over time. 12

13

Block size. Number of bytes in a block (must be a multiple13 of 16K). Maximum fragment percentage. Maximum percentage (from 0 to 100) of data blocks that can be split into fragments.

Fragment size. Number of bytes in a fragment (must divide block size). Fragment chunk size. Number of fragment blocks to allocate when a new fragment block is needed.

Inode le-contents policy. Whether the le contents in the inode are taken from the beginning or end of the le. This policy may be changed dynamically, but all les previously created remain the same.

Each of these parameters may be speci ed at the time the le system is created, or a default value will be used.

3.2 Implementation notes Since VIFS allows the use of large block sizes and small fragments, we require a minimum block size of 16K and a fragment size of 512 bytes14 . The larger block size signi cantly increases performance and allows larger les to avoid indirect blocks (up to about 20MB). Indirect blocks, when they are needed, map up to about 2GB of le data. Most system administrators don't use FFS with 16K blocks because the minimum fragment size becomes 2K. VIFS has a minimum fragment size of 512 bytes regardless of the block size. When VIFS uses FFS-style synchronicity, creates and deletes become about 25% slower than under FFS since the inodes and data blocks are kept on opposite ends of the partition. Read and write performance remains the same regardless of the type of synchronicity. The synchronous nature of FFS create and delete operations is usually only for le-system integrity, not application-oriented semantics[2], and VIFS guarantees integrity without using synchronous operations. Therefore, we do not recommend using this option in VIFS unless the speed of these operations is not a consideration or the applications demand it. The code can be changed to allow 8K blocks, but this change will reduce the maximum size of the le system. An 8K block size would be of little use in VIFS unless most les were smaller than 16K. 14 We can easily reduce the minimum fragment size. For highly stable le systems, such a reduction would reduce wasted space, and the increased allocation cost would be inconsequential. 13

14

The three primary bitmaps (for blocks, inodes, and fragments) require one bit for each item represented. We use at least one block per primary bitmap. The derived segment information for all three allocation bitmaps ts into a single 16K block; we place it in the superblock. VIFS deviates from standard BSD 4.3 le systems in several ways. The layout of VIFS does not allow holes, which are areas in a le that when written for the rst time are allocated on demand. Holes save space in large les that are treated as sparse arrays. We are currently reviewing ways to represent holes eciently in VIFS.

4 Experience with the Viva File System The Viva le system is currently implemented under BSD/38615, a version of BSD 4.3. VIFS is operational and in use. We notice signi cant improvement in many applications such as news, mail, large compilations, database access, and general large le access. This section provides an overview of the characteristic speed and reliability of VIFS. We report the timings for several benchmarks and discuss crash recovery.

4.1 Benchmarks This section discusses performance of VIFS and compares it with FFS. We provide FFS timings using both 8K and 16K blocks, while VIFS timings use 16K data blocks. The timings shown in Figures 6 and 7 compare the performance of both FFS and VIFS. These tests were performed on clean le systems on a 80486/33MHz machine with 8MB of memory, a 1MB buer cache, and a Maxtor 4380E ESDI disk (18 millisecond average seek time and maximum 900KB/sec transfer rate). We used the same partition on the same disk for comparative tests between VIFS and FFS. Figure 6 shows the time needed to create 1000 les and write them sequentially, rewrite them in the order of creation, sequentially read each le, and then delete those 1000 les. We ran this test for many fairly small le sizes. The create phase and rewriting phase in VIFS have nearly identical performance, while there is considerable discrepancy between the two phases in FFS. File creation in VIFS is limited only by CPU speed and the amount of buer cache available. VIFS reads the 1000 les about 1.7 times as fast as FFS. The delete-phase time is independent of le size, but VIFS is an order of magnitude faster and is limited only by CPU speed. The large uctuations 15

BSD/386 is a trademark of Berkeley Software Design, Inc., Falls Church, Virginia.

15

for FFS with 8K blocks may be due to the allocation policy that spreads les across cylinders. The timings for a single very large le shown in Figure 7 show similar results. The random reading test shown in Figure 7 consists of reading all 16K le blocks in a predetermined random order. The same order is used for all le systems. We believe that VIFS is faster than FFS largely because it needs fewer indirect blocks (at most one) and is likely to nd it in the block cache. The amount of compression of our sparse bitmaps in comparison with 32-bit block addresses is a major consideration when discussing our speed increases. We say that a le has perfect compression or a compression ratio of 32 if its data blocks are completely contiguous16 . A compression ratio of 16 means that, on the average, for every two bits in the sparse bitmap, one bit is used (that is, twice the space of perfect compression is required). The worst compression ratio is 1. Figure 8 shows that VIFS does not degrade signi cantly in comparison to FFS when les are forced to have low compression ratios. To produce a test case that forces particular compression ratios, we lled the le system with 16KB les and deleted every 32nd le for a compression ratio of 1, and similarly for other compression ratios. There are many ways a particular compression ratio may be reached. A compression ratio of 16, for example, may be reached by any of the following methods:

Fill the le system with 16KB les, and then remove every other le. This is the worst case depicted in Figure 8.

Fill the le system with 16KB les, and then remove all les numbered X , where X mod 32 < 16. This is the best case depicted in Figure 8.

Use any method that produces exactly 16 free bits for every 32-bit section of the block allocation bitmap.

In all cases, the size of the tested le is 8MB. It is necessary (but not sucient) for the le system to be over 97% full (32MB free in a 1GB partition) to achieve a compression ratio of 1. All compression ratios of 3 or below occur when the le system is at least 90% full. We could allow the system administrator to reserve a percentage of disk space to improve performance (as in FFS), but we have not observed a serious need for it. Unlike FFS, VIFS allocation performance does not degrade when the le system is close to full. Since a low compression ratio does not necessarily imply poor clustering, we introduce the notion of eective clustering: the fraction of all blocks in a le for which the We ignore the fact that there is some inode space taken up by the other size and index elds in making this de nition. 16

16

FFS (8K/1K) FFS (16K/2K) VIFS (16K/512)

Creation of 1000 files

Rewrite of 1000 files

600

600

500

500

400

400

Time (sec)

Time (sec)

300

300

200

200

100

100

0 0

20

40

60 80 100 File Size (KB)

120

0 0

140

20

40


120

140

120

140

Deletion of 1000 files

Read of 1000 files 350

120

300

100

250 80 Time (sec)

200 Time (sec)

60

150 40 100 20

50 0 0

20

40


120

0 0

140

20

40


Figure 6: Creating, rewriting, reading and deleting 1000 les

17


Creation of 1 file

Rewriting of 1 file

400

400

300

300

Time (sec)

Time (sec)

200

200

100

100

0 0

20

40

60 80 100 File Size (MB)

120

0 0

140

20

40


120

140

Randomly reading 1 file

Reading of 1 file 350

600

300

500

250

400 Time (sec)

200 Time (sec) 150

300

200 100 100

50 0 0

20

40


120

0 0

140

20

40


120

140

Figure 7: Creating, rewriting, reading, and randomly reading a single large le

18

next block is adjacent. It is calculated by E = n?C+1 n , where C is the number of completely contiguous chunks of bits and n is the total number of bits set. Isolated bits are counted as a chunk. For example, the worst case for a compression ratio of 16 has an eective clustering of n1 , where n is the number of blocks in the le. However, the best case for a compression ratio of 16 has an eective clustering of more than 0.96 for a le of 32 disk blocks. While smaller compression ratios are unimportant for small and medium sized les (less than 1 MB or so), the poor read performance under low ratios is signi cant for larger les. A large le that was created over a period of months, perhaps by allocating only a few blocks per day, might very well have such a ratio. It is our experience that most of these les are log les, which should probably use circular les to preallocate the desired number of disk blocks. We performed a series of tests to determine the amount of clustering as the partition ages. For these tests, we create les with sizes drawn from a hyperexponential distribution with mean of 16KB for 95% of the les and 4MB for 10% of the les17 . We also delete les at random. The probability of creation (as opposed to deletion) is aimed to keep the le system fairly full: File System Probability of a Probability of a Fullness Create Operation Delete Operation 0{50% 0.9 0.1 50{85% 0.7 0.3 85{92% 0.5 0.5 92{98% 0.3 0.7 98{100% 0.0 1.0 Table 2 shows a set of snapshots of the le system state as it ages. We measure its age in generations: A generation nishes when each le from the previous generation has been deleted and replaced with a new le18 . We measure the compression ratio and the eective clustering of large les (those at least 1MB in length). We also note the current free-space information, which varies signi cantly over time. We observe that as the le system ages, the mean compression ratio gradually decreases and then

uctuates19 , but les with lower compression ratios still maintain fairly high eective clustering. We have performed these tests with both smaller and larger means. Smaller les provide much greater compression since most tiny les t into fragments, and fragments are necessarily grouped closely together. Larger les also provide greater compression than the values shown. 18 One generation typically lasts 1000 le creations. 19 The mean compression ratio and eective clustering uctuates over time due to the fact that some 1{2MB les are created when the le system is almost full. There are generally only one or two les with values close to the minimum shown in Table 2, but they can reduce the mean considerably. 17

19


Create Performance (worst case)

Create Performance (best case)

35

35

30

30

25

25

20

20

Time (sec) 15

Time (sec) 15

10

10

5

5

0 0

5

10

15 20 25 Compression Ratio

30

0 0

35

5

35

30

30

25

25

20

20

Time (sec) 15

Time (sec) 15

10

10

5

5

5

10



30

35

30

35

Read Performance (best case)

Read Performance (worst case)

35

0 0

10

30

0 0

35

5

10


Figure 8: Create/read performance for an 8MB le with various compression ratios

20

Snapshots of le system state for les 1MB (15,373 total blocks in le system) Comp. Ratio Eective Clust. Free-block Information Gen Min Max Mean Min Max Mean 8-bit 4-bit 2-bit 1-bit % free 1 24.71 31.81 29.90 0.97 1.00 0.99 617 1234 2478 4956 32.23 2 21.24 31.85 29.51 0.96 1.00 0.99 29 67 168 385 2.50 4 23.57 31.81 28.66 0.93 1.00 0.97 210 443 960 2007 13.05 8 20.00 31.41 27.16 0.86 1.00 0.96 190 391 859 1817 11.81 16 13.29 30.19 24.81 0.70 0.99 0.91 152 321 728 1577 10.26 24 11.93 30.43 24.35 0.59 0.98 0.91 229 484 1024 2100 13.66 32 15.59 30.09 24.88 0.68 0.99 0.92 183 383 903 1917 12.47 40 12.60 31.03 25.12 0.44 0.99 0.91 99 220 525 1176 7.60 44 13.48 30.34 23.87 0.68 0.98 0.92 47 104 251 543 3.53 48 13.91 30.29 24.21 0.77 0.98 0.92 44 117 329 785 5.10 64 12.15 31.16 24.04 0.74 0.99 0.90 356 746 1601 3339 21.71

Table 2: Analysis of disk fragmentation in an aging VIFS le system We nd that eective clustering is a good predictor of sequential read performance. For a le with size 1MB and eective clustering E , read time R may be predicted by: R = W ? E (W ? B ), where W is the worst-case time and B is the best-case time for les with the same size and compression ratio. VIFS works well under load. Compilations achieve a 5% to 15% performance boost over the same compilation performed under FFS. News maintenance runs ve to ten times faster under VIFS. Figure 9 shows the timings of multiple processes, each performing sequential 16KB reads of a dierent 16MB le. For this test, we con gure VIFS to use a maximum read-ahead of eight blocks and allow the entire 1MB buer cache to be used for multiple read-ahead. The performance of VIFS is almost perfectly linear until more than eight processes are running, at which time the buer cache becomes the limiting factor.

4.2 Discussion We believe that VIFS gets its primary speed advantage from the way it clusters les. Clustering makes read-ahead more eective. Multiple read-ahead is crucial to maintaining this advantage under load. A secondary advantage comes from the fact that VIFS almost never needs indirect blocks, and when it does, a few indirect blocks always suce. FFS is signi cantly faster with 16KB blocks than with 8KB blocks, rst because of the increased eective disk throughput, but second because fewer indirect accesses are needed. 21

700 600

FFS (8K/1K)

500

FFS (16K/2K) VIFS (16K/512)

400 Time (sec) 300 200 100 0 0

2

4 6 8 Number of processes

10

12

Figure 9: Multiple processes sequentially accessing dierent 16MB les Our experience with storing le contents in the inode is varied. If inodes hold the beginning of les, compilations tend to run about 25% slower with VIFS, mainly because the library routines used to read the les are confused about block alignment. On the other hand, if the inodes hold the end of les, compilation is about the same as having no data in the inode at all. The primary advantage we see in storing the beginning of les is that various information frequently stored at the beginning of a le can be quickly accessed.

4.3 Crash recovery As mentioned earlier, FFS uses synchronous create and delete operations to avoid the following scenarios: (1) A directory entry references an unallocated inode, (2) An inode is used by two separate les, (3) A disk block is used by two separate les. VIFS provides the same semantics by keeping two bitmaps, one from which free bits are allocated and a second bitmap which is updated for both deletions and allocations. When the inodes and directory blocks are all safely on the disk, the current bitmap is synchronously written to disk, and the in-memory bitmap is updated. If a directory entry points to an unallocated inode (as indicated by the inode allocation bitmap) after a crash, the le is deleted. Blocks duplicated in two separate les cannot occur (barring a severe hardware failure), and two separate directory entries cannot refer to the same inode unless they are referring to the same le. 22

A crash that occurs before or after the bitmaps are written does not cause a consistency problem; the crash is easily recoverable. A crash that occurs while the bitmaps are being written can cause newly allocated blocks and inodes to appear unallocated, but can never cause unallocated items to appear allocated unless the crash damages the bitmaps. In this case, we can rely on the structure of the le system to discover these items are allocated, since all inodes and directories have already been updated. This brings us to the question of how the le system checker knows that the bitmaps have been damaged. When the inodes and directories are written, a tag is assigned and written to the disk. Only when this tag is on the disk do we write the bitmaps, and when the bitmaps are safely on the disk we write the tag again in a dierent location. The tags can only be dierent when the crash has occurred directly before, during, or directly after the time the bitmaps have been written. In this case, we can either rely on a previous bitmap (bitmap blocks are written in alternate locations) or on the structure of the le system to reconstruct the most recent bitmap. In all other cases, we rely on the bitmaps to restore the le system to a consistent state. To conserve the amount of bitmap information that must be written (and therefore reduce the chance of a crash occurring during a bitmap write), only those sections of the bitmaps that have changed are actually written.

5 Comparisons to related work There has been much work in the area of le systems in the past two decades. Most of this work has focused on special-purpose le systems, but much early work focused on general-purpose le systems. Recently, there has been a surge of new le systems and modi cations to old le systems. Our main reference point has been the BSD 4.3 Fast File System [3]. This le system is an improvement of the original UNIXtm le system. FFS uses information about the geometry of the physical disk to determine optimal block placement. File blocks and inodes are allocated from cylinder groups of equal size, keeping le blocks from a single le within the same cylinder group if possible. Increasing disk rotational speeds, faster access times and unusual disk geometries make this method less desirable than it has been in the past. VIFS makes no assumptions about the physical disk geometry, but expects that the device driver provides a linear array of disk blocks. This method allows the disk controller to remap bad sectors and arrange blocks in any order it chooses without sacri cing le system performance. New disks with unusual geometries will require no modi cations to VIFS. Other le systems have improved general performance in various ways. Sprite LFS clusters I/O requests, uses large caches and, like VIFS, uses asynchronous creates and 23

deletes [5, 11] to achieve better le system performance. Others use heuristics to tune caching and storage allocation algorithms for current usage patterns [13]. VIFS is related to these systems because it attempts to group blocks in a le close together and the allocation algorithm changes slightly as the le size grows. It diers from these le systems in that it achieves most of its performance from the layout scheme rather than fancy buering mechanisms.

6 Further work Although VIFS is useful and complete, we plan to improve it by adding several features and utilities. New le system features might include: (1) append-only les, (2) multiple inode sizes on a single partition, (3) allowing large les to use the data portion of an inode as a sparse bitmap to avoid indirect blocks, (4) ability to handle holes, (5) allocation policy tuning with per- le granularity, and (6) a better, non-linear directory structure. Applications might include a performance analyzer that suggests parameters for the system administrator, a le system debugger for manually patching le systems, a backup tool (similar to dump), and a visual tool that gives information on the current state of a le system. We used the buer caching scheme inherent to BSD 4.3 in our implementation of VIFS. A uni ed virtual memory and buer cache [7] or another scheme that makes better use of large memories should provide added performance bene ts. VIFS could easily be modi ed to act as a software library to be linked into ordinary applications to turn any random-access device or le into a le system. Such a package would be useful to let us experiment with new le structures.

7 Conclusion We have shown that VIFS provides some bene cial features that are both novel and ecient. Our goal of increasing both le system performance and space conservation without large amounts of buer cache has been achieved. VIFS virtually eliminates indirect blocks by only doubling the size of the inode. Allocation bitmaps allow us to allocate blocks very quickly. Large block sizes may be used while maintaining the space eciency of small fragment sizes, thus eliminating the system administrators' best excuse for using smaller block sizes. Multiple read-ahead allows VIFS to maintain much of its speed under heavy load. Because VIFS maintains no knowledge of the 24

physical disk, disks with unusual geometries may be used without modi cation to the le system. Circular les allow truncation at the beginning of a le and automatic truncation when a le reaches its predetermined size. Guaranteed high-speed access to small les or the beginning of larger les is gained by keeping a small amount of le contents with each inode. The entire VIFS source distribution, including kernel sources, various utilities, installation instructions and other documentation will be publicly available in mid 1993 in ftp.ms.uky.edu:pub/unix/viva-.tar.Z.

8 Acknowledgements We would like to thank Ken Kubota, Daniel Chaney and Kenneth Herron for their many helpful comments during the development of VIFS. The insistence of Ken Kubota that truncating the beginning of a le \is a good thing" directly led to the addition of circular les. We would also like to thank Raj Yavatkar, James Grioen, and Brian Sturgill for reading a preliminary draft of this manuscript.

References [1] Mary G. Baker, John H. Hartman, Michael D. Kupfer, Ken W. Shirri, and John K. Ousterhout. Measurements of a distributed le system. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, pages 198{ 212, 1991. [2] Samuel J. Leer, Marshall Kirk McKusick, Michael J. Karels, and John S. Quarterman. The Design and Implementation of the 4.3BSD UNIX Operating System. Addison-Wesley, 1989. [3] Marshall K. McKusick, William N. Joy, Samuel J. Leer, and Robert S. Fabry. A fast le system for unix. ACM Transactions on Computer Systems, 2(3):181{197, 1984. [4] Marshall Kirk McKusick. Fsck { the unix le system check program. July 16, 1985. [5] L. W. McVoy and S. R. Kleiman. Extent-like performance from a unix le system. In USENIX Conference Proceedings, pages 33{43, Dallas, Texas, January 1991. [6] John K. Ousterhout. Why aren't operating systems getting faster as fast as hardware? In USENIX Conference Proceedings, pages 247{256, Anaheim, California, June 1990. 25

[7] John K. Ousterhout, Andrew R. Cherenson, Frederick Douglis, Michael N. Nelson, and Brent B. Welsh. The sprite network operating system. Computer, 21(2):23{36, 1988. [8] John K. Ousterhout, Herve Da Costa, David Harrison, John A. Kunze, Mike Kupfer, and James G. Thompson. A trace-driven analysis of the unix 4.2 bsd le system. In Proceedings of the 10th Symposium on Operating Systems Principles, pages 15{24. ACM, 1985. [9] M. L. Powell. The demos le system. In Proceedings of the Sixth Symposium on Operating Systems Principles, pages 33{42. ACM, 1977. [10] Mendel Rosenblum and John K. Ousterhout. The lfs storage manager. In USENIX Conference Proceedings, pages 315{324, Anaheim, California, June 1990. [11] Mendel Rosenblum and John K. Ousterhout. The design and implementation of a log-structured le system. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, pages 1{15, 1991. [12] M. Satyanarayanan. A study of le sizes and functional lifetimes. In Proceedings of the 8th Symposium on Operating Systems Principles, pages 96{108. ACM, 1981. [13] Carl Staelin and Hector Garcia-Molina. Smart lesystems. In USENIX Conference Proceedings, pages 45{51, Dallas, Texas, January 1991.

26