Released in 1992 with Linux 0.96c
Released in 1993 with Linux 0.99
Schematics:
A partition, disk, file or block device formated with a Second Extended Filesystem is divided into small groups of sectors called “blocks”. These blocks are then grouped into larger units called block groups.
cool stats:
Block size | 4KiB |
---|---|
file system blocks | 2,147,483,647 |
blocks per block group | 32,768 |
inodes per block group | 32,768 |
bytes per block group | 134,217,728 (128MiB) |
file system size (real) | 17,592,186,036,224 (16TiB) |
file system size (Linux) | 17,592,186,036,224 (16TiB) |
blocks per file | 1,074,791,436 |
file size (real) | 2,199,023,255,552 (2TiB) |
file size (Linux 2.6.28) | 2,199,023,255,552 (2TiB) |
Blocks are clustered into block groups in order to reduce fragmentation and minimise the amount of head seeking when reading a large amount of consecutive data. Information about each block group is kept in a descriptor table stored in the block(s) immediately after the superblock. Two blocks near the start of each group are reserved for the block usage bitmap and the inode usage bitmap which show which blocks and inodes are in use. Since each bitmap is limited to a single block, this means that the maximum size of a block group is 8 times the size of a block.
The block(s) following the bitmaps in each block group are designated as the inode table for that block group and the remainder are the data blocks. The block allocation algorithm attempts to allocate data blocks in the same block group as the inode which contains them.
Schematics:
Released in 2001 with Linux 2.4.15
htree full explanation need to read [2]
Found a BSD chad with actual readable material [3]
full ext3 [4]
JFS is a system that was added to Linux with EXT3 there to handle journal transaction independent of the FS. A JFS journal would be placed, pointed by an inode, in the EXT3 FS to get the journaling capability. A JFS journal can be placed outside of a given FS and even manage multiple FS.
If EXT3 is unmounted cleanly, nothing to do, the journal will be empty and the superblock will remove the journal feature, in other word, it could now be mounted as EXT2.
there to fix consistency problems when unclean reboot, one of those was the Orphaned files.
They happen when a process has a file open while the directory entry for it is removed, this makes is so you have a phantom inode, not accounted anywhere, but that still takes place on your FS.[5]
also, no need to do a total FSCK after crash, only check the journal which is way faster than whatever happens with EXT2
added HTREES in Linux 2.5.42 NOT 2.5.40 like Wikipedia suggest
Made by Daniel Phillips
Finding a file in a Dir was insanely slow, so people wanted BTree to solve the issue. Problem is Btree violates the simplicity "Rule" that EXT2 was founded upon. In fact, the EXT2 implementation was 5000 lines of code long and a Btree code alone would have been double that. So the Btree idea was scraped in favor of a new type created for ext3: The Htree.
OK this is past 3H on understanding how does the DX block can be read in "Binary search"... pls never loose this link (info_length allows us to go in the middle of the hashes and go divide and conquer on them)[6]
Update: only the root has info_length and a root block should contain 508 hashes which is more than the 8 bit (255) info_length can manage....
Final update on this: info_length is the length of the root record and yes the first hash-block pair is in reality a limit-count pair... had to go to ext4 for that one.[7]
Study: looks like dx_countlimit (i think the thing that tells you how many DX entry you have) overlap with the dx_dentry->hash.... could the 1 hash stored in each DX block contain the quantity of DX entry allowing for binary search?
My source is that I made it the F@# up ok i'm starting to go insane.
Quick mental remainder, there is a flag on the inode of dir block telling us if the are htree infused, ext2 would wipe that flag if any editing happen which is how we are sure shit wasn't done in between reboot
under parallel load, both EXT3 and the journal itself had possible scalability improvement
sleep_on()
by wait_event()
. sleep_on()
was unsafe, most implementation looked like this:while (we_have_to_wait)
sleep_on(some_wait_queue);
the problem comes if that if the queue wakeup event happens between the while and the sleep_on()
you are stuck forever... for this not to happen they used Big Kernel Lock (BKL) which ended up being removed. [8]
These changes yielded a 10x in parallel loads
If you want to write 2 files from the same dir at the same time, their data blocks could become interleaved. To combat this, EXT2 did preallocation by changing the block bitmap and showing that a few block beyond the current block were already used. This would be corrected after the write is finished or if the system crashed, after the FSCK at reboot.
With EXT3, that system doesn't work as we don't have a complete FSCK at reboot, we only have a check of the journal which is not compatible with the preallocation.
pls read [4] but if you don't, here is the TLDR.
Reservation is a process 100% done in memory instead of in disk.
You have a reservation window which is 8 block long (unless specified in IOCTL command)
Only the file that has opened the reservation window can write in it.
Here is the complete write process:
The availability of a Logical Volume Manager (LVM), motivated the desire for on-line resizing.
Here are the 3 part of growing:
Made incrementally, rinse and repeat until desired size reached
released 10 October 2006 with Linux 2.6.19
Schematics:
To address this limit, in August 2006, we posted a series of patches introducing two key features to ext3: larger filesystem capacity and extents mapping. The patches unavoidably change the on-disk format and break for wards compatibility. In order to maintain the stability of ext3 for its massive user base, we decided to branch to ext4 from ext3
We want to give the large number of ext3 users the opportunity to easily upgrade their filesystem, as was done from ext2 to ext3.[9]
Main source for everything [10]
under EXT3, the max size of the FS is 16TiB. To boost that, EXT4 brought a new "64-Bit" mode (compared to "32-Bit" before). This doesn't mean that any field that was 32 bits now is 64, it boost physical block number of extent from 32 to 48bits and a bunch of field in the BGDT with the prefix lo meaning "32bits" number and hi meaning "64bits". The theoretical limit of the FS is 64ZiB but the real practical limit is 1EiB because of the 48bits pointer of extent. This is not a real problem as EXT4 was understood from the start to be a "fix" to extend the life of EXT3.
see EXT4 I_Block as extents are self explanatory. The goal is to mix pointer + length in order to make bigger continuous chunk of data. it is also more resistant to corruptions as it contains checksums and logic block which allow an to see if the extent contains gibberish which isn't possible under a big EXT3 partition with indirect pointers.
EXT4 is backward compatible with EXT3 meaning that the EXT4 driver can mount an EXT3 partition (given that we provide it with flags to not use things like extents...). In fact, there is no EXT3 driver in the kernel as when EXT4 was finish, the EXT4 driver with the flags had better perf than the EXT3 driver.
with EXT3 the default limit of subdir for a dir is 32,000. in EXT4, Htrees are enable by default meaning that you can have around10 million subdirs with a 2-level htree or around2 billion if you enable 'large_dir' option for 3-level Htrees [11]
inside BGDT is bg_itable_unused_lo and bg_itable_unused_hi which tells us that there is no used inode after a certain offset, this along with a bunch of other small change, makes FSCK way faster.[12]
Before any block allocation would be a single call, meaning that if you want to allocate 100MB of data you would need 25600 call! Multiblock allocation fixes that with a single call for multiple block with added optimizations.
The allocator will speculatively allocate what you need + 8KiB to reduce fragmentation. [13]
Delayed allocations, rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be written to the disk. This gives the block allocator the opportunity to optimize the allocation in situations where the old system couldn't. Delayed allocation plays very nicely with the two previous features mentioned, extents and multiblock allocation, because in many workloads when the file is written finally to the disk it will be allocated in extents whose block allocation is done with the mballoc allocator. The performance is much better, and the fragmentation is much improved in some workloads. [14]
TLDR: Keep writes in cache as long as you can (Mem full or sync()) and optimize when you write using multiblock allocation, extent and more.
PP allows to preallocate data in the form of extent (see ee_len in EXT4 I_Block). This preallocation can be done with a fallocate() and will be read as zeros even if it wasn't zeroed before. The preallocation will normally be contiguous and is often use by software like P2P.
This helped a bit[15]
not much to say, pretty simple checksum applied to metadata for increased reliability.
Mke2fs will now create file systems with the metadata_csum and 64bit
features enabled by default.[16]
detail on metadata checksum[17]
added field in inodes allow for nanosecond precision on time and also avoids the year 2038 problem by adding 408 years.
Space at the end of the inode to put extended attributes, can be expended at creation if needed. the idea is that if you can have the values you need without seeking on the disk again, you may get crazy gain. PS EA where handled by a pointer to a block which is slow...
Barriers are enabled by default in EXT4. their goal is to prevent the writing of data before the journal has fully written its record. The reason why that is necessary is the large internal cache used by hard drive that may rearrange the order of the writes. This comes at a small perf cost.
Allows the Bitmaps+inode table of multiple block group to be grouped right after the BGDT, this allow for more contiguous files not blocked by the bitmaps. Superblock and BGDT copies will still be there.[18]
BGDT can be at its biggest 1 block group large. Normal block group are 128MiB meaning max 2^27/64B=2^21 block groups possible inside of the FS so max FS size = 2^21 * 128MiB = 256TIB. To fix this, we are going to cluster 64 BGD entry (to fill 4k) and put a copy of it along with the superblock into the 1, 2 and last block group. This new "Meta" block group controls 8GiB of disk and we can just split our whole disk with those. Because of the 2^32 block group descriptor limit we can now store for up to 512PiB. Note that the real limit for the BGDT was block size so once that was taken out, we can store up to limit of possible block group descriptor[18]
using the project ID (i_projid) field in the inode, we can apply a quota on multiple file/dir and all sub dir. This isn't strictly hierarchical as it can be applied on a per file/dir basis.
-on a per dir basis??? no clue how that works.. for now, don't know don't care.[19]
-Don't know don't care [20]
EXT2-3-4 are all the same FS:
The way I've always treated it, and it's the way I believe most of the ext234
developers have treated it is, that what users call ext2, ext3, and ext4 are
different _implementations_ of the same _file_ _system_ [format]. That is
to say, ext4 simply happens to be a fuller, more complete implementation of
the same file system as ext2 and ext3. Ext2 doesn't support certain
features such as journaling and directory indexing; ext3 doesn't support some
advanced features such as delayed allocation and extents, and requires that
the journal always be present. Ext4 is a superset of ext2 plus ext3 plus
delayed allocation, extents, a multi-block allocator, and a few other new
features. But they are all the same file system.[21]
20% perf gain in some test with new JBD2 (Journal) system optimization. [22]
Limited support for atomic write operations has been added to ext4, XFS and some part of md(raid). Was first worked on 6.11 for SCSI and NVMe storage. [23]
: https://www.nongnu.org/ext2-doc/ext2.html#def-blocks
: namei.c - fs/ext3/namei.c - Linux source code v2.5.42 - Bootlin Elixir Cross Referencer
: ext4 Data Structures and Algorithms — The Linux Kernel documentation
: raw.githubusercontent.com/tytso/e2fsprogs/master/doc/RelNotes/v1.43.0.txt
: linux/Documentation/filesystems/ext4/mmp.rst at master · torvalds/linux · GitHub
: Re: Introducing Next3 - built-in snapshots support for Ext3 [LWN.net]
: EXT4 Has A Very Nice Performance Optimization For Linux 6.11 - Phoronix
: Linux 6.13 To Expand Atomic Write Support To EXT4 XFS - Phoronix