Partager

October 19, 2022

Introduction to Syscalls

 
The Linux kernel conveniently hides a lot of things from the user, including the standard libraries, and system calls.
 
Have you ever wondered what happens when the command echo "hello" is entered in a terminal? Let’s try it:

$ echo "hello"
hello

The string hello is displayed! 🎉🎉

But wait, how does that happen?

In this introduction post, we explain how an application communicates with the operating system.

Notes: This post is targeted to programmers with an interest in low-level details, but does not assume any experience. Some topics may go into the intermediate territory, but there is nothing I would consider to be advanced. Everything was written using Debian bullseye on x86-64, but the concepts are similar for other OSes and CPUs.

Displaying Text

The echo command comes with the OS and is part of GNU CoreUtils. As demonstrated with the type command, it is builtin into the shell, but it can still be found at /bin/echo:

$ type echo
echo is a shell builtin
$ which echo
/bin/echo

The echo command's source code can be found in src/echo.c. Looking at the code, we can see that basically, echo does the following:

while (argc > 0)
{
  fputs (argv[0], stdout);
  argc--;
  argv++;
  if (argc > 0)
    putchar (' ');
}

which sends each argument to stdout, one by one, separated by whitespace.

But where do the fputs and putchar functions come from? They aren't declared anywhere in the project!

Standard Libraries

The functions fputs and putchar come from the C standard library, and are declared in stdio.h.

What's a standard library? Well, usually, a language has a formal documentation that specifies the language's syntax, semantics, type system, and the base functionalities that come with it. That last part is the standard library (stdlib).

All languages, except maybe for assembly, have a stdlib that defines the function signatures that a programmer can use without the need for third-party packages. These stdlibs are often versioned (eg. C17, ruby 2.7), so projects are compiled (or interpreted) against a specific version of the language.

Their implementations may differ from system to system, and some languages even have multiple implementations for the same system, eg. the C stdlib with glibc for performance, and musl for size (used in the alpine Docker image).

With that in mind, we can understand that fputs is offered by the language itself. Its function signature is thus part of the C stdlib:

/* Writes a character string to a file stream.  */
extern int fputs (const char *str, FILE *stream);

But wait a minute, the code earlier provided stdout as the second argument, which isn't a file, is it?

It is, actually!

extern FILE *stdout;    /* Standard output stream.  */

In Unix, everything is a file -- normal files, folders, links, pipes, network sockets, keyboards, printers… Files are an abstraction layer for I/O streams, and it makes sense when you think about it, as files are just places where bytes can be read from and written to.

When a file is opened, a file descriptor (fd) is created, which can be used afterwards to operate on the file. A fd is basically an integer index in a table of files that a process can access, three of which are hard-coded and provided by default to new processes when they are started:

namefd
stdin0
stdout1
stderr2

These are really files that can be found on the system and used like normal files.

$ ls -la /dev/std*
lrwxrwxrwx 1 root root 15 Sep 23 18:02 /dev/stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root root 15 Sep 23 18:02 /dev/stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root root 15 Sep 23 18:02 /dev/stdout -> /proc/self/fd/1
$ cat myfile
hello
$ cp myfile /dev/stdout
hello

All this to say that echo calls the fputs function, which copies a string to stdout. Because the shell displays everything that goes through stdout on the terminal, the string appears on the screen!

But there is something weird about what happens inside fputs. Ok, the process already has access to stdout (we can see it under /proc/self/fd/), but fputs can be used with normal files. Can any process get a fd for any file (eg. /etc/passwd) and modify it?

What prevents a malicious process from wiping out the content of a device, or hijacking a browser's HTTP connection to a bank? That's a good question because there needs to be someone in charge of security. That someone is the OS kernel (and the CPU, but that's too much detail for now).

Huh, does that mean that fputs must ask the OS to write to stdout?

System Calls

To keep a system secure, a process must never have access to hardware - in fact, a process has a very limited set of things that it can do by itself! Whenever it needs to do anything more than its own little harmless calculations, a process must go through the OS kernel, via system calls (syscalls). Syscalls differ by OS and architecture (CPU).

Is there any way to look at which syscalls a process makes? Certainly, using the strace command:

Note: Processes make a lot of syscalls. Unless stated otherwise, only the relevant ones are shown here. These syscalls also depend on the actual implementation of a command, so specific syscalls may or may not be present on different systems.

$ strace echo "hello"
write(1, "hello\n", 6)                  = 6

Well, that makes sense! The syscall needed to write to a fd is write. The target fd is 1 (for stdout), and writes 6 bytes from the "hello\n" string. The syscall results in the value 6, indicating that all bytes were written.

Creating a File

With all the basic concepts out of the way, we can look at other examples a lot faster. So, what happens when a file is created?

$ strace touch myfile
openat(AT_FDCWD, "myfile", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
close(3)                                = 0

openat and close are used to create (O_CREAT) and close a file. The result of openat is a fd, which is then passed to close. close'ing a fd simply releases the fd and its index value.

Displaying a File

Ok, what about displaying the content of a file using cat?

$ echo "hello" > myfile
$ strace cat myfile
openat(AT_FDCWD, "myfile", O_RDONLY)    = 3
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd2d170a000
read(3, "hello\n", 131072)              = 6
write(1, "hello\n", 6)                  = 6
read(3, "", 131072)                     = 0
munmap(0x7fd2d170a000, 139264)          = 0
close(3)                                = 0

Here we can see the read syscall, used to read from a fd (in this case 3, which corresponds to myfile), and mmap and munmap, which allocates and deallocates the memory that we can safely assume to be used to hold the content of myfile until it is written to stdout.

An interesting thing to wonder right now is why read wants to read 131,072 bytes. This value was simply chosen arbitrarily for performance. But does that mean that cat only displays the first 128kIB of a file? Let's try it:

$ dd if=/dev/zero of=myfile2 bs=1k count=300
...
$ strace cat myfile2
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 45056
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 45056) = 45056
read(3, "", 131072)                     = 0

So it just loops until there is nothing left to read.

Note: The reason a last read is always required is because a partial read does not imply that a stream is empty. It is only when read reads nothing that an EOF is assumed. EOT is also of interest here.

Copying a File

A bit more complicated now - what happens when a file is copied?

$ strace cp myfile myfile3
stat("myfile3", 0x7fff648a6020)         = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "myfile", O_RDONLY)    = 3
openat(AT_FDCWD, "myfile3", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f89c1dd7000
read(3, "hello\n", 131072)              = 6
write(4, "hello\n", 6)                  = 6
read(3, "", 131072)                     = 0
close(4)                                = 0
close(3)                                = 0
munmap(0x7f89c1dd7000, 139264)          = 0

The main difference here is that cat writes to stdout, while cp writes to a file.

What if the destination file already exists? Then myfile3 is opened with different flags:

openat(AT_FDCWD, "myfile3", O_WRONLY|O_TRUNC) = 4

Moving a File

What about moving a file? We can imagine that it is the same as copying a file, but the source file is deleted after, right?

$ strace mv myfile3 myfile4
renameat2(AT_FDCWD, "myfile3", AT_FDCWD, "myfile4", RENAME_NOREPLACE) = 0

Oh, that's it? There's a syscall specifically for moving a file? Sure! There's unlink and rmdir to delete a file or folder, and chdir to change the current working directory. There's also getuid to get the user's own identity, kill, and reboot.

Networking

Last one, a simple local curl. Should be easy, right?

strace curl example.com

Woah, there's a lot more stuff going on here! There are many ways to communicate through a network, but the interesting syscalls are

socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=5, events=POLLPRI|POLLOUT|POLLWRNORM}], 1, 0) = 1 ([{fd=5, revents=POLLOUT|POLLWRNORM}])
getpeername(5, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, [128->16]) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(58534), sin_addr=inet_addr("172.16.0.1")}, [128->16]) = 0
sendto(5, "GET / HTTP/1.1\r\nHost: example.co"..., 75, MSG_NOSIGNAL, NULL, 0) = 75
recvfrom(5, "HTTP/1.1 200 OK\r\nAge: 432713\r\nCa"..., 102400, 0, NULL, NULL) = 1615
close(5)                                = 0

Although a bit out of scope, socket creates a Linux socket that is connect'ed to example.com's IP address, 93.184.216.34. The request body is sent with sendto, and the response is received with recvfrom.

Usage of Syscalls

This section was based on the excellent book Linux Insides, from 0xAX.

Is there a way to manually make a syscall? The following assembly code writes "hello" to stdout:

.data
msg:
    .ascii "hello\n"
    len = . - msg

.text
    .global _start

_start:
    movq  $1, %rax
    movq  $1, %rdi
    movq  $msg, %rsi
    movq  $len, %rdx
    syscall

    movq  $60, %rax
    movq  $0, %rdi
    syscall

It results in the following complete output:

$ strace ./test
execve("./test", ["./test"], 0x7ffe23220b40) = 0
write(1, "hello\n", 6)                  = 6
exit(0)                                 = ?
+++ exited with 0 +++

How does it work? The syscall instruction executes the syscall at the index specified by register rax. The syscall parameters are placed in some other different registers.

So the first syscall executes the syscall 1 (write), with fd 1 (stdout), a pointer to msg, and the msg's length. The second syscall executes the syscall 60 (exit) with the exit code 0.

Although it executes exactly the needed syscalls, there are obvious issues with coding in assembly, one of them being portability - the assembly code above can only be compiled and run on x86-64 CPUs. Other CPUs, especially older ones, might have different instructions and trigger syscalls using interrupts.

Developing

Developers of embedded systems and critical applications may be interested in taking a look at what syscalls their program makes. Although very low-level, this can help with debugging and profiling the internals of a process.

Emulation

Whenever a program runs on a system that has a different OS/CPU than the one it was intended to run on, syscalls need to be emulated.

Hypervisors and virtual machines have syscall mappings. Detected syscalls on the client are either simulated to do the intended operation on the client itself or are converted to a syscall that the host system can understand. Which one is needed depends on the intended action, and whether access to the host hardware is required. For example, mounting a hard disk inside a virtual machine will definitely need to access it, while a virtual hard disk may be completely handled by the VM manager.

Docker containers run inside a virtual machine that runs Linux on the host architecture.

QEMU is a popular emulator that can run programs compiled for a different architecture, and emulate complete systems.

CPU and console emulators are similar to virtual machines, in the sense that the operations that are performed are understood by the emulator, and converted to their equivalent. CPUs and early consoles do not have any OS, but the concept is similar.

wine is a dynamic loader for Windows executables, which intercepts Windows syscalls and maps them to POSIX-compliant ones. This allows most Windows applications to run on Linux. Some advanced information about wine internals can be found here.

WSL allows Windows systems to run a Linux bash shell seamlessly. WSL 1 emulates the Linux kernel by translating Linux syscalls to Windows ones. Similarly to how Docker containers work, WSL 2 runs a modified Linux kernel inside a managed virtual machine.

Conclusion

After all this, we can understand that a syscall is just how a process can ask the OS kernel to do something on its behalf. We also saw a few easy examples to give a feeling of which syscalls are required for a given task.

The main goal of this post was to keep everything easy to read to teach the concept and what bridges a process with the operating system. A lot of information was kept hidden (eg. user and kernel spaces) because it was not meant to go into advanced details.

Simon Bernier

Simon Bernier