This is my attempt to go through the kernel and explain as much as possible of its inner working. It is (For now) heavily inspired by Understanding the Linux Kernel, 3rd Edition i will still be up to date in my content.
Keep in mind, that was written when the best consumer hardware you could get your hands on would be a single core CPU with Brand new 64bits (;
We will be focusing on X86_64 for simplicity.
Linux is a True Unix Kernel.
Linux joins such well-known commercial Unix operating systems as System V Release 4 (SVR4)(pronounced: "System Five"); CSRG BSD release from the University of California at Berkeley; Digital UNIX from Digital Equipment Corporation (now Hewlett-Packard); AIX from IBM; HP-UX from Hewlett-Packard; Solaris from Sun Microsystems (Now Oracle); and Mac OS from Apple Computer, Inc. Beside Linux, a few other opensource Unix-like kernels exist, such as FreeBSD , NetBSD , and OpenBSD.
Also most commercial Unix Kernel were all bundled as OS, not Linux.
Commercial Unix kernels often introduce new features to gain a larger slice of the market, but these features are not necessarily useful, stable, or productive. As a matter of fact, modern Unix kernels tend to be quite bloated. By contrast, Linux—together with the other open source operating systems
Back it the day there used to be a thriving Unix kernel/OS but they are mainly rendered obsolete by the existence of Linux.
Even with their differences all Unix tend to respect the IEEE’s Portable Operating Systems 1003.1 which makes software made for one Unix easily portable to any other.
Those standards specify only an application programming interface (API) and do not restrict the implementation.
The kernel has 2 jobs:
In contrast to OS like MS-DOS that would let you play directly with the hardware, Unix Kernel (including Linux) will abstract low-level detail about the hardware. When a program needs resources from the system it needs to ask the Kernel. The kernel evaluates the request and, if it chooses to grant the resource, interacts with the proper hardware components on behalf of the user program.
To enforce this, the CPU has a feature called execution modes (protection rings). The kernel runs in Kernel mode (Ring 0) while any other user program will run in User mode (Ring 3). Under this system, a process in User mode cannot access kernel memory while a process in Kernel mode can access everything.
PS: Their is a Ring 1 and 2 inside CPU which is never used and was almost removed by intel under their failed X86S initiative [1]
Some will even say that there is a Ring -1 (Hypervisor), a Ring -2 (X86's System Management Mode) and a Ring -3 (The Firmware itself)
For 99% of cases, only User mode and Kernel mode matter.
Linux is also a MultiUser system which means that it needs to handled multiple user programs concurrently without the programs themselves caring/knowing about the others.
A MultiUser system also need to have:
Again User mode and Kernel mode comes into play to make this possible.
PS: One example of a multiprocessing OS that is not Multiuser would be Windows 98.
Users and group are used for file permission inside of the system. Identified by a User ID (UID) and a User Group ID, a Process can only interact with the files that allow the level of permission that was granted to its User/Group.
Unix-Like systems also has a Root or SuperUser which is a user that is exempt for the permission check, which make it able to access almost any resource on the system.
A process can be defined either as “an instance of a program in execution” or as the “execution context” of a running program. At its core, it is nothing more than a handful of register values and an list of instruction to be ran from memory.
It is the job of the Kernel to split the CPU between different processes. To do that, the Kernel will use a Scheduler that will decide when to preempt (remove from the CPU) a running process for another. This is mostly done by timing a given process and removing it at the end of the timer.
Unix-Like systems subscribe to the process/kernel model. Each process will run as if it is the only process on the machine and has exclusive access to its services. The process will access those services using a system call (a request to the kernel). As a result of a system call, the hardware will switch into Kernel mode and run the related kernel routine. Once the request fully executed, the hardware will return into User mode and the result will be returned. GOOD TIME FOR A VENDING MACHINE AND CANADA DRY ANALOGY. WINK WINK
Time for us to talk about the age old fight: Monolithic Kernel VS Microkernel
Lets define both:
Microkernel:
Monolithic Kernel:
Linux, FreeBSD, OpenBSD, NetBSD are all Monolithic Kernel.
Gnu Hurd is a Microkernel.
Mac OS uses an Hybrid Kernel (in between)
There once was a time when it was theorize that Microkernel were superior: They were smaller, easier to maintain and less prone to bugs.
One fact broke this belief: Switching between User and Kernel Mode add overhead on a simple call. Lets take a read call, under a Monolithic Kernel you would only switch 2 times, once at the start of the syscall and once at the end. In a Microkernel, if we assume that it is perfectly executed, this would already be bumped up to 4 switch.
This is a big reason why Mac OS, which started up with a pure Microkernel ended up merging some feature back into the kernel to reduce overhead.
The module feature also added all the benefits of the Microkernel without the drawback. It is possible in Linux to load any part of the kernel as a module when needed (using object file). Because the code is linked at runtime it doesn't result in the same performance overhead (except for the initial loading of course).
The kernel offers syscalls to access files but does not seek to interpret them.
From the user's point of view, the files are organized in a tree-structure with each node being a directory except all the leaves which are files
Each Unix process has a Current Working Directory which tells you the directory that is used by the process. To access a file, a process will need a Pathname, it will be constructed of nodes of the tree-structure separated by slashes.
A Pathname starting with a "/" is deemed to be absolute as it will ignore the Current Working Directory and start at the root of our tree
A Pathname that doesn't with a "/" is deemed relative as it will start at the Current Working Directory
"." and ".." can also be used:
"." means the Current Working Directory
".." the directory before the Current Working Directory
".." can also be chained to go higher and higher in our tree, ".." at the root level will leave you at the root level
There is 7 file type:
Unix makes a distinction between the content of a file and the information about the file.
When you access a file, you will be only accessing its content.
To access the information about the file, you will need to access the inode of the file.
There is 1 inode per file and the inode will contain:
There are three types of access rights read, write, and execute for each of these three classes(Owner Group Other/everyone). In reality, the set of access rights associated with a file consists of nine different binary flags. Three additional flags, called suid (Set User ID), sgid (Set Group ID), and sticky, which define the file mode. These flags have the following meanings when applied to executable files:
suid
A process executing a file normally keeps the User ID (UID ) of the process owner. However, if the executable file has the suid
flag set, the process gets the UID of the file owner.
sgid
FOR FILES A process executing a file keeps the user group ID of the process group. However, if the executable file has the sgid
flag set, the process gets the user group ID of the file.
FOR DIR This will set the default group of files created under this dir to the group of the dir
sticky
FOR UNIX An executable file with the sticky
flag set corresponds to a request to the kernel to keep the program in memory after its execution terminates.
FOR LINUX, this isn't used on file anymore. it can be used on a directory to restrict file deletion to only be possible by the owner (not the group)
When a file is created by a process, it will take the UID and GID of the process, unless the sgid
of the dir is set where it will take the dir GIB
When a process accesses the contents of either a regular file or a directory, it is accessing some data stored in a hardware block device. Because a process in User Mode cannot directly interact with the low-level hardware components, each actual file operation must be performed in Kernel Mode. Therefore, the Unix operating system defines several system calls for file handling.
Here are these syscalls:
fd = open(path, flag, mode)
path
Path of the file to be opened.
flag
Specifies how the file must be opened (e.g., read, write, read/write, append). It also can specify whether a nonexisting file should be created.
mode
Specifies the access rights of a newly created file.
Opening a file will return a File Descriptor (fd) and it contains:
offset
field that denotes the current position in the file from which the next operation will take place (the so-called file pointer), and flag
parameter.File descriptor are arrange in a table (File Descriptor Table), which is unique per process.
PS: By default, a process has 3 File Descriptor:
0 Standard input
1 Standard output
2 Standard error
The file descriptor will be pointing to an open file object (struct file) which itself will point to an inode [2]
One open file object can be shared along multiple File Descriptor but each open file object will only be used with one process.
Regular Unix files can be addressed either sequentially or randomly, while device files and named pipes are usually accessed sequentially.
By default, Sequential access is assumed: the read( )
and write( )
system calls always refer to the position of the current file pointer. To modify the value, a program must explicitly invoke the lseek( )
system call.
newoffset = lseek(fd, offset, whence);
fd
Indicates the file descriptor of the opened file
offset
Specifies a signed integer value that will be used for computing the new position of the file pointer
whence
Specifies whether the new position should be computed by adding the offset
value to the number start of the file, the end or the stored pointer.
nread = read(fd, buf, count);
fd
Indicates the file descriptor of the opened file
buf
Specifies the address of the buffer in the process’s address space to which the data will be transferred
count
Denotes the number of bytes to read
nread
Returns the number of bytes successfully read.
Write is used in the same way.
res = close(fd);
Returns 0 on success and -1 on error
When a process is terminated, all its files are automatically closed.
res = rename(oldpath, newpath);
res = unlink(pathname);
We know that our CPU can either run in User Mode or Kernel Mode, and that when a process requests a kernel service (with a syscall) it will switch to Kernel Mode temporarily to fulfil the request.
Lets keep in mind that a Process is a dynamic entity that has a limited life span.
It is the Kernel's job to Create, Eliminate, and Synchronize processes using kernel routines.
The kernel is NOT a process but it IS a process manager.
{Besides user processes, Unix systems also include a few privileged processes called Kernel Threads with the following characteristics:
lets talk about process again.
Syscalls are not the only way to reach kernel mode while running a Process:
To be able to manage a process, the kernel will need some information about the current state of the process.
That information is stored in a Process Descriptor (struct task_struct).
When the kernel stops the execution of a process, it will save the content of multiple processor register in the Process Descriptor, here is some example:
When the kernel wants to resume the execution of the process, it will load all those register back onto the processor.
A process not currently being executed will be waiting, most likely in a queues that will trigger at a specific event.
Each process runs in its private address space. A process running in User Mode refers to private stack, data, and code areas. When running in Kernel Mode, the process addresses the kernel data and code areas and uses another private stack.
While it appears to each process that it has access to a private address space, there are times when part of the address space is shared among processes. In some cases, this sharing is explicitly requested by processes; in others, it is done automatically by the kernel to reduce memory usage.
If the same program, say an editor, is needed simultaneously by several users, the program is loaded into memory only once, and its instructions can be shared by all of the users who need it. Its data, of course, must not be shared, because each user will have separate data. This kind of shared address space is done automatically by the kernel to save memory.
Processes also can share parts of their address space as a kind of interprocess communication, using the “shared memory” technique introduced in System V and supported by Linux.
Finally, Linux supports the mmap( )
system call, which allows part of a file or the information stored on a block device to be mapped into a part of a process address space. Memory mapping can provide an alternative to normal reads and writes for transferring data. If the same file is shared by several processes, its memory mapping is included in the address space of each of the processes that share it.
All Unix kernels are reentrant. This means that several processes may be executing in Kernel Mode at the same time. Of course, on uniprocessor systems, only one process can progress, but many can be blocked in Kernel Mode when waiting for the CPU or the completion of some I/O operation. For instance, after issuing a read to a disk on behalf of a process, the kernel lets the disk controller handle it and resumes executing other processes. An interrupt notifies the kernel when the device has satisfied the read, so the former process can resume the execution.
One way to provide reentrancy is to write functions so that they modify only local variables and do not alter global data structures. Such functions are called reentrant functions . But a reentrant kernel is not limited only to such reentrant functions. Instead, the kernel can include nonreentrant functions and use locking mechanisms to ensure that only one process can execute a nonreentrant function at a time.
If a hardware interrupt occurs, a reentrant kernel is able to suspend the current running process even if that process is in Kernel Mode. This capability is very important, because it improves the throughput of the device controllers that issue interrupts. Once a device has issued an interrupt, it waits until the CPU acknowledges it. If the kernel is able to answer quickly, the device controller will be able to perform other tasks while the CPU handles the interrupt.
Now let’s look at kernel reentrancy and its impact on the organization of the kernel.
First we need a definition.
A Kernel Control Path denotes the sequence of instructions (some code) executed by the kernel to handle a system call, an exception, or an interrupt.
One example of a Kernel Control Path would be an interrupt handler.
A Kernel Control Path can be interrupted under rare circumstances.
Implementing a reentrant kernel requires the use of synchronization.
For example, if A Kernel Control Path is running and needs to Read than Write, to Increment, a Global Variable, if it is preempted by another in the middle of that read write process, the Global variable will be corrupted.
This issue is called a Race Condition.
Most time, the solution to this is using Atomic Operations. (which are single, non-interruptible operation)
If the data structure cannot be accessed in a single operation (for example a linked list) a simple Atomic Operation wouldn't work. Theses sections would be called a Critical Regions.
This can also apply to processes sharing data but we will be focusing on how to synchronizing Kernel Control Paths.
One solution would be to remove the possibility of being preempted in Kernel Mode.
A Kernel Control Path could even relinquish control voluntarily but it would need to leave all data structures in a consistent state (not broken), and it would need to recheck any value that was previously accessed that could have changed.
If you wanted to implement this with more granularity, you could disable kernel preemption only in critical regions and reenable it when you leave them.
This would work on a single core system but would fail as soon as you add more cores, a Kernel Control Path from another core could still be an issue.
Another way would be to disable hardware interrupts while in a critical region, but this offers the same issues as the previous solution on multicore systems.
Semaphores are effective on single core and multicore machines:
down( )
and up( )
down( )
reduces the integer by 1, if the integer is lower than 0 it will put its name on the list of waiting process.
up( )
will add 1 to the integer and than Execute a process from the waiting list (if it exist)
Each data structure in a critical region would need to be protected by a Semaphore.
For simplicity, a Mutex is like a Semaphore but:
In the kernel, Semaphores and Mutexes are called Sleeping locks[3]
Semaphores while being useful can be slow, if your goal is that another core on your machine doesn't use a resource that you will use for very limited time, the upkeep of the waiting list can be a limitation.
The solution is a Spin Lock. It works like a semaphore but without the waiting list which forces a process waiting on a locked Spin Lock to.... Spin.
Note that Spin Lock are useless on a Single core machine as they will result in an infinite loop.
Deadlocks occur when 2 Kernel Control Path each lock a resource needed by the other.
For example
Kernel Control Path 1 locked resource A and need resource B
while
Kernel Control Path 2 locked resource B and need resource A
This will lock both Kernel Control Path and prevent them from going further.
The solution to this is better design, the Linux kernel will request resources in a specific order in order to sidestep the issue.
Process think User Mode
Unix signals provide a mechanism for notifying processes of system events. Each event has its own signal number, which is usually referred to by a symbolic constant such as SIGTERM
. There are two kinds of system events:
Asynchronous notifications (Triggered from the outside)
For instance, a user can send the interrupt signal SIGINT
to a foreground process by pressing the interrupt keycode (usually Ctrl-C) at the terminal.
Synchronous notifications (Triggered from the inside)
For instance, the kernel sends the signal SIGSEGV
to a process when it accesses a memory location at an invalid address. (Memory access violation)
The Linux defines about 31 different signals, 2 of which are user-definable and may be used as a primitive mechanism for communication and synchronization among processes in User Mode. [4]
In general, a process may react to a signal delivery in two possible ways:
If the process does not specify one of these alternatives, the kernel performs a default action that depends on the signal number. The five possible default actions are:
Kernel signal handling is rather elaborate, because the POSIX semantics allows processes to temporarily block signals. Moreover, the SIGKILL
and SIGSTOP
signals cannot be directly handled by the process or ignored.
Since introduced by ATT’s Unix System V, Unix kernels use IPC (InterProcess Communication).
It is composed by:
shmget( )
msgget( )
semget( )
You must use syscalls to acquire them and IPC resources are persistent, they must be deallocated by their owner (or the superuser) to be removed.
Shared memory provides the fastest way for processes to exchange and share data. A process starts by issuing a shmget( )
system call to create a new shared memory of a specific size. After obtaining the IPC resource identifier, the process invokes the shmat( )
system call, which returns the starting address of the new region within the process address space. When you want to detach the shared memory from its address space, it invokes the shmdt( )
system call.
The implementation of shared memory depends on how the kernel implements process address spaces.
Message queues allow processes to exchange messages by using the msgsnd( )
and msgrcv( )
system calls, which insert a message into a specific message queue and extract a message from it, respectively. (not saying based on POSIX standard (IEEE Std 1003.1-2001))
and Semaphores are the one you already know but made for User Mode.
Lets start by talking about 3 syscalls:
Will create a new process out of a given program. This will be done in a brand new address space
Will terminate the process.[5]
Will create a new child process, the process that ran fork()
will be the parent, they can find each other by a data structure which give the parent access the all their child and the child access to their parent.
The child created will have a "copy" of the data and the code from the parent process. (this is done by the hardware paging unit which will do Copy on Write and defer page duplication for as long as possible).
if a child gets _exit()
it will send the parent process a signal SIGCHLD
(ignored by default).
When a process gets terminated, it becomes a Zombie until its parent releases it.
The way a parent process would do that is with the wait3()
or wait4()
which will allow to wait until a child terminates to extract data about the Zombie process and releases its memory.
What if the Parent gets terminated, what happens to the child?
The child process would keep on running and the Parent would have to be released using wait4()
. That job will be done by PID1 The init (SystemD). The init will issue frequent wait4()
to take care of any zombie process directly under it and this will put all of the zombie's childs under the init's control.
This is how a process started by a shell can still run even when the shell gets terminated.
Take note that wait4()
is considered nonstandard in Linux, it is recommended to use waitpid()
or waitid()
which does the same thing but requires you to specify and identification for the process.
Processes can also have groups, groups are used to send signals to multiple process at the same time.
To identify them, the Process Descriptor will contain a Process Group ID (PGID) that will be equal to the Process ID of the first Process of the group called Process Group Leader.[6]
A shell with that is compatible with process group like Bash would put all process from this command under the same group:
ls | sort | more
When opening a Unix system, you will be prompted to log into your account. The system will then execute a Shell (Like Bash) for you to interact with the system.
That Shell will be a Session Leader and will have a Session ID equal to its Process ID.
Multiple Group can be part of 1 Session but only one can be in the foreground and output to the shell.
The other Process Group will be in the background and if they attempt the read or write the the terminal they will receive a signal (SIGTTIN
or SIGTTOUT
) to let them know.
A New process will inherit the Process Group ID and the Session ID of its parent.
All recent Unix systems provide a useful abstraction called Virtual Memory . Virtual memory acts as a logical layer between the application memory requests and the hardware Memory Management Unit (MMU). Virtual memory has many purposes and advantages:
The way to achieve that is to have a Virtual Address Space different from the Physical Memory Address. The CPU is equipped to handle Virtual Memory and will translate the Virtual address to its Physical equivalent.
Memory on X86_64 CPUs is handled in Pages which are 4KB long, the kernel keeps a Page Table that specify where the CPU should look for a given Page.
This makes Memory handling easier as you don't need Physical Memory to be contiguous for Virtual Memory to be.
The kernel has to do a few things with RAM:
Along with these task a kernel should:
To achieve this goal, Linux will use 2 system:
The buddy system seeks to divide the memory into chuck and grant the smallest power of 2 to the process questing it.
Here is how it works:
Lets say that we have 128KB of memory and we need to store 25KB of data
The kernel will divide the 128KB into 2 "Buddies" of 64KB and will repeat this operation until it cannot go smaller and fulfill the request.
This will result in a 32KB block being allocated.
When the 32KB block would be released by the process, the kernel would fuse as much unused buddy as possible.
The kernel will take 1 or more page of contiguous memory and create a "slab", The inside of the slab will be divided in a repeated pre-allocated size.
For example:
Lets have a Slab 3 pages long (12KB) and storing 2KB chucks.
This would allow to store 6 chucks.
Slabs are use for small objects
The kernel stores a process virtual address space as a list of Memory Descriptor (struct mm_struct).
This will contain pointers to:
Now if that seems like a lot, keep in mind that Unix kernel use Demand Paging.
Demand paging is the idea that we are not going to put data into memory if it wasn't requested. In practice, your program will run into non-present pages, that will trigger an exception that will get the MMU to free a page an initialize it with the required data.
The most limiting factor on your computer is most likely your hard drive. Even with an SSD, it is still at least 100x slower than ram.
To improve on this limitation, the kernel will delay disk access as much as possible by creating caches.
A read might stay in memory long after the process that requested it is terminated and
A write might be kept in ram for a long time.
This allows the kernel to check the faster cache instead of the slower hard drive, when it find what it needs, it is called a Cache Hit.
The kernel can be forced to write all the outstanding writes by issuing a sync()
syscall. This will synchronize all the dirty buffers (buffers containing not yet written data) to the disk.
The system will write buffers to storage periodically to avoid data loss.
The kernel interact with I/O Devices using Device Drivers. The Device Driver allows a few things:
The use will have access to the device either using syscalls or by interacting with the Device File (normally in /dev, like /dev/sda for the first drive.).