EXT234

EXT

Released in 1992 with Linux 0.96c

EXT2

Released in 1993 with Linux 0.99

Schematics:

Block

A partition, disk, file or block device formated with a Second Extended Filesystem is divided into small groups of sectors called “blocks”. These blocks are then grouped into larger units called block groups.

cool stats:

Block size 4KiB
file system blocks 2,147,483,647
blocks per block group 32,768
inodes per block group 32,768
bytes per block group 134,217,728 (128MiB)
file system size (real) 17,592,186,036,224 (16TiB)
file system size (Linux)17,592,186,036,224 (16TiB)
blocks per file 1,074,791,436
file size (real) 2,199,023,255,552 (2TiB)
file size (Linux 2.6.28)2,199,023,255,552 (2TiB)

Block Groups

Blocks are clustered into block groups in order to reduce fragmentation and minimise the amount of head seeking when reading a large amount of consecutive data. Information about each block group is kept in a descriptor table stored in the block(s) immediately after the superblock. Two blocks near the start of each group are reserved for the block usage bitmap and the inode usage bitmap which show which blocks and inodes are in use. Since each bitmap is limited to a single block, this means that the maximum size of a block group is 8 times the size of a block.

The block(s) following the bitmaps in each block group are designated as the inode table for that block group and the remainder are the data blocks. The block allocation algorithm attempts to allocate data blocks in the same block group as the inode which contains them.

EXT3

Schematics:

Released in 2001 with Linux 2.4.15

htree full explanation need to read [2]

Found a BSD chad with actual readable material [3]

full ext3 [4]

JFS

JFS is a system that was added to Linux with EXT3 there to handle journal transaction independent of the FS. A JFS journal would be placed, pointed by an inode, in the EXT3 FS to get the journaling capability. A JFS journal can be placed outside of a given FS and even manage multiple FS.

If EXT3 is unmounted cleanly, nothing to do, the journal will be empty and the superblock will remove the journal feature, in other word, it could now be mounted as EXT2.

there to fix consistency problems when unclean reboot, one of those was the Orphaned files.

They happen when a process has a file open while the directory entry for it is removed, this makes is so you have a phantom inode, not accounted anywhere, but that still takes place on your FS.[5]

also, no need to do a total FSCK after crash, only check the journal which is way faster than whatever happens with EXT2

Htree

added HTREES in Linux 2.5.42 NOT 2.5.40 like Wikipedia suggest

Made by Daniel Phillips

Finding a file in a Dir was insanely slow, so people wanted BTree to solve the issue. Problem is Btree violates the simplicity "Rule" that EXT2 was founded upon. In fact, the EXT2 implementation was 5000 lines of code long and a Btree code alone would have been double that. So the Btree idea was scraped in favor of a new type created for ext3: The Htree.

This is me becoming completely insane, for real info skip.

OK this is past 3H on understanding how does the DX block can be read in "Binary search"... pls never loose this link (info_length allows us to go in the middle of the hashes and go divide and conquer on them)[6]

Update: only the root has info_length and a root block should contain 508 hashes which is more than the 8 bit (255) info_length can manage....

Final update on this: info_length is the length of the root record and yes the first hash-block pair is in reality a limit-count pair... had to go to ext4 for that one.[7]

Study: looks like dx_countlimit (i think the thing that tells you how many DX entry you have) overlap with the dx_dentry->hash.... could the 1 hash stored in each DX block contain the quantity of DX entry allowing for binary search?

My source is that I made it the F@# up ok i'm starting to go insane.

Quick mental remainder, there is a flag on the inode of dir block telling us if the are htree infused, ext2 would wipe that flag if any editing happen which is how we are sure shit wasn't done in between reboot

Improving scalability

Quick reminder: that theses changes happen at a time where multicore CPU aren't that common, 2004 saw the release of the Pentium 4, a 1 core 2 threads CPU.

under parallel load, both EXT3 and the journal itself had possible scalability improvement

  1. Before patch, ext2-3 needed to lock the superblock in order to allocate new block and inodes, that is to keep counters like s_inodes_count and S_blocks_count updated. This causes issues in multithread situations as only one thread will be able to allocate at a time. After the patch, they put the locks at the block group descriptor level and only update the superblock when using a special syscall or umount.
  2. Replaced Big Kernel Lock (BKL) by smaller lock in Journal code.
  3. removed sleep_on() by wait_event(). sleep_on() was unsafe, most implementation looked like this:
  4. this
  5. is
  6. a
  7. test
  8. to
  9. see
  10. if
  11. my
  12. code
  13. scales

while (we_have_to_wait)

sleep_on(some_wait_queue);

the problem comes if that if the queue wakeup event happens between the while and the sleep_on() you are stuck forever... for this not to happen they used Big Kernel Lock (BKL) which ended up being removed. [8]

These changes yielded a 10x in parallel loads

Reservation based block allocator

Preallocation (Old)

If you want to write 2 files from the same dir at the same time, their data blocks could become interleaved. To combat this, EXT2 did preallocation by changing the block bitmap and showing that a few block beyond the current block were already used. This would be corrected after the write is finished or if the system crashed, after the FSCK at reboot.

With EXT3, that system doesn't work as we don't have a complete FSCK at reboot, we only have a check of the journal which is not compatible with the preallocation.

Reservation (New)

pls read [4] but if you don't, here is the TLDR.

Reservation is a process 100% done in memory instead of in disk.

You have a reservation window which is 8 block long (unless specified in IOCTL command)

Only the file that has opened the reservation window can write in it.

Here is the complete write process:

  • The first time an inode needs a new block, a block allocation structure is allocated and linked to the inode
  • The block allocator searches for a region of blocks that fulfills three criteria:
  1. The region must be near the ideal “goal” block, based on ext2/3’s existing block placement algorithms.
  2. The region must not overlap with any other inode’s reservation windows.
  3. The region must have at least one free block.

  • As an inode keeps growing, free blocks inside its reservation window will eventually be exhausted. At that point, a new window will be created for that inode.

Online resizing

The availability of a Logical Volume Manager (LVM), motivated the desire for on-line resizing.

Here are the 3 part of growing:

  1. Expand the last partial block group (if any) to be a full block group.
  2. Add new entry in Group descriptor table and fill group. up to 16GB increment for 4k block size
  3. Add a new block to the group descriptor table and add a new group to that block (Need to patch the FS before mounting)

Made incrementally, rinse and repeat until desired size reached

EXT4

released 10 October 2006 with Linux 2.6.19

Schematics:

Assume 4KB block size for all info in EXT4

To address this limit, in August 2006, we posted a series of patches introducing two key features to ext3: larger filesystem capacity and extents mapping. The patches unavoidably change the on-disk format and break for wards compatibility. In order to maintain the stability of ext3 for its massive user base, we decided to branch to ext4 from ext3

We want to give the large number of ext3 users the opportunity to easily upgrade their filesystem, as was done from ext2 to ext3.[9]

Main source for everything [10]

Larger FS capacity

under EXT3, the max size of the FS is 16TiB. To boost that, EXT4 brought a new "64-Bit" mode (compared to "32-Bit" before). This doesn't mean that any field that was 32 bits now is 64, it boost physical block number of extent from 32 to 48bits and a bunch of field in the BGDT with the prefix lo meaning "32bits" number and hi meaning "64bits". The theoretical limit of the FS is 64ZiB but the real practical limit is 1EiB because of the 48bits pointer of extent. This is not a real problem as EXT4 was understood from the start to be a "fix" to extend the life of EXT3.

Extents

see EXT4 I_Block as extents are self explanatory. The goal is to mix pointer + length in order to make bigger continuous chunk of data. it is also more resistant to corruptions as it contains checksums and logic block which allow an to see if the extent contains gibberish which isn't possible under a big EXT3 partition with indirect pointers.

Backward compatibility

EXT4 is backward compatible with EXT3 meaning that the EXT4 driver can mount an EXT3 partition (given that we provide it with flags to not use things like extents...). In fact, there is no EXT3 driver in the kernel as when EXT4 was finish, the EXT4 driver with the flags had better perf than the EXT3 driver.

Huge Dir

with EXT3 the default limit of subdir for a dir is 32,000. in EXT4, Htrees are enable by default meaning that you can have around10 million subdirs with a 2-level htree or around2 billion if you enable 'large_dir' option for 3-level Htrees [11]

Fast FSCK

inside BGDT is bg_itable_unused_lo and bg_itable_unused_hi which tells us that there is no used inode after a certain offset, this along with a bunch of other small change, makes FSCK way faster.[12]

Allocation

Multiblock allocation

Before any block allocation would be a single call, meaning that if you want to allocate 100MB of data you would need 25600 call! Multiblock allocation fixes that with a single call for multiple block with added optimizations.

Speculative allocation

The allocator will speculatively allocate what you need + 8KiB to reduce fragmentation. [13]

Delayed allocation

Delayed allocations, rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be written to the disk. This gives the block allocator the opportunity to optimize the allocation in situations where the old system couldn't. Delayed allocation plays very nicely with the two previous features mentioned, extents and multiblock allocation, because in many workloads when the file is written finally to the disk it will be allocated in extents whose block allocation is done with the mballoc allocator. The performance is much better, and the fragmentation is much improved in some workloads. [14]

TLDR: Keep writes in cache as long as you can (Mem full or sync()) and optimize when you write using multiblock allocation, extent and more.

Persistent preallocation

PP allows to preallocate data in the form of extent (see ee_len in EXT4 I_Block). This preallocation can be done with a fallocate() and will be read as zeros even if it wasn't zeroed before. The preallocation will normally be contiguous and is often use by software like P2P.

This helped a bit[15]

Metadata checksumming

not much to say, pretty simple checksum applied to metadata for increased reliability.

Mke2fs will now create file systems with the metadata_csum and 64bit

features enabled by default.[16]

detail on metadata checksum[17]

Better time

added field in inodes allow for nanosecond precision on time and also avoids the year 2038 problem by adding 408 years.

Extra space for Extended attributes

Space at the end of the inode to put extended attributes, can be expended at creation if needed. the idea is that if you can have the values you need without seeking on the disk again, you may get crazy gain. PS EA where handled by a pointer to a block which is slow...

Barriers

Barriers are enabled by default in EXT4. their goal is to prevent the writing of data before the journal has fully written its record. The reason why that is necessary is the large internal cache used by hard drive that may rearrange the order of the writes. This comes at a small perf cost.

Flexible Block Groups

Allows the Bitmaps+inode table of multiple block group to be grouped right after the BGDT, this allow for more contiguous files not blocked by the bitmaps. Superblock and BGDT copies will still be there.[18]

Meta Block Group

BGDT can be at its biggest 1 block group large. Normal block group are 128MiB meaning max 2^27/64B=2^21 block groups possible inside of the FS so max FS size = 2^21 * 128MiB = 256TIB. To fix this, we are going to cluster 64 BGD entry (to fill 4k) and put a copy of it along with the superblock into the 1, 2 and last block group. This new "Meta" block group controls 8GiB of disk and we can just split our whole disk with those. Because of the 2^32 block group descriptor limit we can now store for up to 512PiB. Note that the real limit for the BGDT was block size so once that was taken out, we can store up to limit of possible block group descriptor[18]

Project quotas

using the project ID (i_projid) field in the inode, we can apply a quota on multiple file/dir and all sub dir. This isn't strictly hierarchical as it can be applied on a per file/dir basis.

-Transparent encryption

-on a per dir basis??? no clue how that works.. for now, don't know don't care.[19]

-Multiple Mount Protection

-Don't know don't care [20]

Extra

EXT2-3-4 are all the same FS:

The way I've always treated it, and it's the way I believe most of the ext234

developers have treated it is, that what users call ext2, ext3, and ext4 are

different _implementations_ of the same _file_ _system_ [format]. That is

to say, ext4 simply happens to be a fuller, more complete implementation of

the same file system as ext2 and ext3. Ext2 doesn't support certain

features such as journaling and directory indexing; ext3 doesn't support some

advanced features such as delayed allocation and extents, and requires that

the journal always be present. Ext4 is a superset of ext2 plus ext3 plus

delayed allocation, extents, a multi-block allocator, and a few other new

features. But they are all the same file system.[21]

Kernel History

6.11:

20% perf gain in some test with new JBD2 (Journal) system optimization. [22]

6.13:

Limited support for atomic write operations has been added to ext4, XFS and some part of md(raid). Was first worked on 6.11 for SCSI and NVMe storage. [23]

Sources: