Partager
October 19, 2022
Introduction to Syscalls
echo "hello"
is entered in a terminal? Let’s try it:$ echo "hello"
hello
The string hello
is displayed! 🎉🎉
But wait, how does that happen?
In this introduction post, we explain how an application communicates with the operating system.
Notes: This post is targeted to programmers with an interest in low-level details, but does not assume any experience. Some topics may go into the intermediate territory, but there is nothing I would consider to be advanced. Everything was written using Debian bullseye on x86-64, but the concepts are similar for other OSes and CPUs.
Displaying Text
The echo
command comes with the OS and is part of GNU CoreUtils. As demonstrated with the type
command, it is builtin into the shell, but it can still be found at /bin/echo
:
$ type echo
echo is a shell builtin
$ which echo
/bin/echo
The echo
command's source code can be found in src/echo.c
. Looking at the code, we can see that basically, echo
does the following:
while (argc > 0)
{
fputs (argv[0], stdout);
argc--;
argv++;
if (argc > 0)
putchar (' ');
}
which sends each argument to stdout
, one by one, separated by whitespace.
But where do the fputs
and putchar
functions come from? They aren't declared anywhere in the project!
Standard Libraries
The functions fputs
and putchar
come from the C standard library, and are declared in stdio.h
.
What's a standard library? Well, usually, a language has a formal documentation that specifies the language's syntax, semantics, type system, and the base functionalities that come with it. That last part is the standard library (stdlib).
All languages, except maybe for assembly, have a stdlib that defines the function signatures that a programmer can use without the need for third-party packages. These stdlibs are often versioned (eg. C17, ruby 2.7), so projects are compiled (or interpreted) against a specific version of the language.
Their implementations may differ from system to system, and some languages even have multiple implementations for the same system, eg. the C stdlib with glibc for performance, and musl for size (used in the alpine
Docker image).
With that in mind, we can understand that fputs
is offered by the language itself. Its function signature is thus part of the C stdlib:
/* Writes a character string to a file stream. */
extern int fputs (const char *str, FILE *stream);
But wait a minute, the code earlier provided stdout
as the second argument, which isn't a file, is it?
It is, actually!
extern FILE *stdout; /* Standard output stream. */
In Unix, everything is a file -- normal files, folders, links, pipes, network sockets, keyboards, printers… Files are an abstraction layer for I/O streams, and it makes sense when you think about it, as files are just places where bytes can be read from and written to.
When a file is opened, a file descriptor (fd) is created, which can be used afterwards to operate on the file. A fd is basically an integer index in a table of files that a process can access, three of which are hard-coded and provided by default to new processes when they are started:
name | fd |
---|---|
stdin | 0 |
stdout | 1 |
stderr | 2 |
These are really files that can be found on the system and used like normal files.
$ ls -la /dev/std*
lrwxrwxrwx 1 root root 15 Sep 23 18:02 /dev/stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root root 15 Sep 23 18:02 /dev/stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root root 15 Sep 23 18:02 /dev/stdout -> /proc/self/fd/1
$ cat myfile
hello
$ cp myfile /dev/stdout
hello
All this to say that echo calls the fputs
function, which copies a string to stdout
. Because the shell displays everything that goes through stdout
on the terminal, the string appears on the screen!
But there is something weird about what happens inside fputs
. Ok, the process already has access to stdout
(we can see it under /proc/self/fd/
), but fputs
can be used with normal files. Can any process get a fd for any file (eg. /etc/passwd
) and modify it?
What prevents a malicious process from wiping out the content of a device, or hijacking a browser's HTTP connection to a bank? That's a good question because there needs to be someone in charge of security. That someone is the OS kernel (and the CPU, but that's too much detail for now).
Huh, does that mean that fputs
must ask the OS to write to stdout
?
System Calls
To keep a system secure, a process must never have access to hardware - in fact, a process has a very limited set of things that it can do by itself! Whenever it needs to do anything more than its own little harmless calculations, a process must go through the OS kernel, via system calls (syscalls). Syscalls differ by OS and architecture (CPU).
Is there any way to look at which syscalls a process makes? Certainly, using the strace
command:
Note: Processes make a lot of syscalls. Unless stated otherwise, only the relevant ones are shown here. These syscalls also depend on the actual implementation of a command, so specific syscalls may or may not be present on different systems.
$ strace echo "hello"
write(1, "hello\n", 6) = 6
Well, that makes sense! The syscall needed to write to a fd is write
. The target fd is 1
(for stdout
), and writes 6
bytes from the "hello\n"
string. The syscall results in the value 6
, indicating that all bytes were written.
Creating a File
With all the basic concepts out of the way, we can look at other examples a lot faster. So, what happens when a file is created?
$ strace touch myfile
openat(AT_FDCWD, "myfile", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
close(3) = 0
openat
and close
are used to create (O_CREAT
) and close a file. The result of openat
is a fd, which is then passed to close
. close
'ing a fd simply releases the fd and its index value.
Displaying a File
Ok, what about displaying the content of a file using cat
?
$ echo "hello" > myfile
$ strace cat myfile
openat(AT_FDCWD, "myfile", O_RDONLY) = 3
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd2d170a000
read(3, "hello\n", 131072) = 6
write(1, "hello\n", 6) = 6
read(3, "", 131072) = 0
munmap(0x7fd2d170a000, 139264) = 0
close(3) = 0
Here we can see the read
syscall, used to read from a fd (in this case 3
, which corresponds to myfile
), and mmap
and munmap
, which allocates and deallocates the memory that we can safely assume to be used to hold the content of myfile
until it is written to stdout
.
An interesting thing to wonder right now is why read
wants to read 131,072 bytes. This value was simply chosen arbitrarily for performance. But does that mean that cat
only displays the first 128kIB of a file? Let's try it:
$ dd if=/dev/zero of=myfile2 bs=1k count=300
...
$ strace cat myfile2
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 131072
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 131072) = 45056
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 45056) = 45056
read(3, "", 131072) = 0
So it just loops until there is nothing left to read.
Note: The reason a last read
is always required is because a partial read
does not imply that a stream is empty. It is only when read
reads nothing that an EOF is assumed. EOT is also of interest here.
Copying a File
A bit more complicated now - what happens when a file is copied?
$ strace cp myfile myfile3
stat("myfile3", 0x7fff648a6020) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "myfile", O_RDONLY) = 3
openat(AT_FDCWD, "myfile3", O_WRONLY|O_CREAT|O_EXCL, 0644) = 4
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f89c1dd7000
read(3, "hello\n", 131072) = 6
write(4, "hello\n", 6) = 6
read(3, "", 131072) = 0
close(4) = 0
close(3) = 0
munmap(0x7f89c1dd7000, 139264) = 0
The main difference here is that cat
writes to stdout
, while cp
writes to a file.
What if the destination file already exists? Then myfile3
is opened with different flags:
openat(AT_FDCWD, "myfile3", O_WRONLY|O_TRUNC) = 4
Moving a File
What about moving a file? We can imagine that it is the same as copying a file, but the source file is deleted after, right?
$ strace mv myfile3 myfile4
renameat2(AT_FDCWD, "myfile3", AT_FDCWD, "myfile4", RENAME_NOREPLACE) = 0
Oh, that's it? There's a syscall specifically for moving a file? Sure! There's unlink
and rmdir
to delete a file or folder, and chdir
to change the current working directory. There's also getuid
to get the user's own identity, kill
, and reboot
.
Networking
Last one, a simple local curl
. Should be easy, right?
strace curl example.com
Woah, there's a lot more stuff going on here! There are many ways to communicate through a network, but the interesting syscalls are
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 5
connect(5, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=5, events=POLLPRI|POLLOUT|POLLWRNORM}], 1, 0) = 1 ([{fd=5, revents=POLLOUT|POLLWRNORM}])
getpeername(5, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("93.184.216.34")}, [128->16]) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(58534), sin_addr=inet_addr("172.16.0.1")}, [128->16]) = 0
sendto(5, "GET / HTTP/1.1\r\nHost: example.co"..., 75, MSG_NOSIGNAL, NULL, 0) = 75
recvfrom(5, "HTTP/1.1 200 OK\r\nAge: 432713\r\nCa"..., 102400, 0, NULL, NULL) = 1615
close(5) = 0
Although a bit out of scope, socket
creates a Linux socket that is connect
'ed to example.com
's IP address, 93.184.216.34
. The request body is sent with sendto
, and the response is received with recvfrom.
Usage of Syscalls
This section was based on the excellent book Linux Insides, from 0xAX.
Is there a way to manually make a syscall? The following assembly code writes "hello"
to stdout
:
.data
msg:
.ascii "hello\n"
len = . - msg
.text
.global _start
_start:
movq $1, %rax
movq $1, %rdi
movq $msg, %rsi
movq $len, %rdx
syscall
movq $60, %rax
movq $0, %rdi
syscall
It results in the following complete output:
$ strace ./test
execve("./test", ["./test"], 0x7ffe23220b40) = 0
write(1, "hello\n", 6) = 6
exit(0) = ?
+++ exited with 0 +++
How does it work? The syscall
instruction executes the syscall at the index specified by register rax
. The syscall parameters are placed in some other different registers.
So the first syscall
executes the syscall 1
(write
), with fd 1
(stdout
), a pointer to msg
, and the msg
's length. The second syscall
executes the syscall 60
(exit
) with the exit code 0
.
Although it executes exactly the needed syscalls, there are obvious issues with coding in assembly, one of them being portability - the assembly code above can only be compiled and run on x86-64 CPUs. Other CPUs, especially older ones, might have different instructions and trigger syscalls using interrupts.
Developing
Developers of embedded systems and critical applications may be interested in taking a look at what syscalls their program makes. Although very low-level, this can help with debugging and profiling the internals of a process.
Emulation
Whenever a program runs on a system that has a different OS/CPU than the one it was intended to run on, syscalls need to be emulated.
Hypervisors and virtual machines have syscall mappings. Detected syscalls on the client are either simulated to do the intended operation on the client itself or are converted to a syscall that the host system can understand. Which one is needed depends on the intended action, and whether access to the host hardware is required. For example, mounting a hard disk inside a virtual machine will definitely need to access it, while a virtual hard disk may be completely handled by the VM manager.
Docker containers run inside a virtual machine that runs Linux on the host architecture.
QEMU is a popular emulator that can run programs compiled for a different architecture, and emulate complete systems.
CPU and console emulators are similar to virtual machines, in the sense that the operations that are performed are understood by the emulator, and converted to their equivalent. CPUs and early consoles do not have any OS, but the concept is similar.
wine is a dynamic loader for Windows executables, which intercepts Windows syscalls and maps them to POSIX-compliant ones. This allows most Windows applications to run on Linux. Some advanced information about wine internals can be found here.
WSL allows Windows systems to run a Linux bash shell seamlessly. WSL 1 emulates the Linux kernel by translating Linux syscalls to Windows ones. Similarly to how Docker containers work, WSL 2 runs a modified Linux kernel inside a managed virtual machine.
Conclusion
After all this, we can understand that a syscall is just how a process can ask the OS kernel to do something on its behalf. We also saw a few easy examples to give a feeling of which syscalls are required for a given task.
The main goal of this post was to keep everything easy to read to teach the concept and what bridges a process with the operating system. A lot of information was kept hidden (eg. user and kernel spaces) because it was not meant to go into advanced details.