One of the Linux kernel features that have gained the most traction in the last
few years is probably (e)BPF. Originally, the "Berkeley Packet Filter" was
intended as a means of filtering network packets in kernel mode. However, BPF
quickly developed into a fully-featured VM used for all kinds of purposes. The
appeal of BPF is not hard to see: It allows you to load kernel mode code at
system runtime (similar to kernel modules) while keeping some degree of
sandboxing and fault tolerance afforded by the VM. It is much more difficult to
break your kernel with a BPF program than with a regular kernel module. One of
the most prominent current users of BPF is
sched_ext
, a framework for writing
scheduler implementations in BPF. This lets you easily tinker with your
scheduler and see results live and without the risk of breaking your kernel if
your implementation crashes.
All of this made me curious about what it would take to put my own VM into the Linux kernel. To be entirely honest, there is no practical use to this project. I have no intentions or delusions of replacing the BPF ecosystem, I'm just doing this for fun. The first step, then, would be to develop my own VM - a non-trivial task on its own. Luckily, I can use the Soil VM developed by my friend Marcel. He developed it because he wanted a lightweight VM for his programming language Martinaise. Soil is a relatively low-level VM whose instruction set maps well to x86 assembly. Importantly for me, there also exists a C implementation that I can mostly reuse for my in-kernel VM.
The basic architecture
Before I could think about any Soil specifics, I had to come up with a general architecture for the project. If you want to get your code running in the Linux kernel, there are two approaches:
- Develop in-tree: Many drivers live "in-tree", i.e. in the kernel's git repository. The main advantage of this setup is that your code updates along with the kernel. Unless you pull in the latest upstream, other kernel code cannot break your own code. Besides, built-in code has some capabilities that are not available to out-of-tree modules. For example, a loadable module cannot add its own system calls.
- Develop out-of-tree: You can also develop a kernel module which can be loaded at system runtime. Kernel modules are usually developed out-of-tree, meaning they live in their own repositories and have to be built for a specific kernel version. This is also their biggest disadvantage: There is no guarantee your module will work with newer kernels. The kernel generally makes no promises about ABI or (internal) API compatibility. However, you don't have to carry around the weight of the entire kernel repository for just your module. Also, an out-of-tree module can be easier to distribute. Your users don't need an entirely custom kernel, they can just load your module.
I decided in favor of an out-of-tree module for Soil. I have some previous experience working with kernel modules and I honestly just didn't feel like building a custom kernel. However, this means that I can't go the BPF route of using a system call to interact with the Soil VM. Instead, I opted for an IOCTL-based interface.
What is an IOCTL?
On a very abstract level, IO is simple: You can read from or write to a device.
In practice, it tends to be more complicated. Besides exchanging data, you often
have to deal with control interfaces. For example, if you are dealing with GPIO,
you first need to configure the pins you need as input or output. The Linux GPIO
driver separates this configuration procedure from regular data exchange using
IOCTLs. IOCTL stands for "IO Control" and represents an operation related to a
device's configuration. If you have a file descriptor to a device, you can use
the ioctl
system call on it with the correct IOCTL number and (if required)
arguments to trigger a control operation. For example, to request a GPIO line
(i.e. a pin), you can use the following IOCTL (definition taken from
here):
struct gpio_v2_line_info {
char name[GPIO_MAX_NAME_SIZE];
char consumer[GPIO_MAX_NAME_SIZE];
__u32 offset;
__u32 num_attrs;
__aligned_u64 flags;
struct gpio_v2_line_attribute attrs[GPIO_V2_LINE_NUM_ATTRS_MAX];
/* Space reserved for future use. */
__u32 padding[4];
};
#define GPIO_V2_GET_LINEINFO_IOCTL _IOWR(0xB4, 0x05, struct gpio_v2_line_info)
In theory, IOCTLs are identified by arbitrary 32-bit integers. In practice,
there are conventions to describe an IOCTL's behavior. The _IOWR
macro
indicates that this IOCTL both reads from and writes to the device. Its first
parameter is a magic number associated with the specific device driver. You can
think of the second as a driver-internal IOCTL number. You would run into
trouble if you tried to use the IOCTL number 5 globally, but by combining it
with the driver's magic number it becomes unique. Finally, the IOCTL definition
contains the type of the IOCTL's parameter (if it has one). This is relevant
because the function handling the IOCTL on the kernel side only receives an
opaque pointer to the parameter. While there is no physical Soil device, I could
still use IOCTLs for this project. As I mentioned earlier, kernel modules cannot
define syscalls. However, by creating a virtual device, I could define IOCTLs to
provide a userspace interface to the VM running in the kernel.
An alternative may have been a sysfs-based interface. Its manual page describes sysfs as a "filesystem for exporting kernel objects". The issue here is that the Soil VM is not really a kernel object (I'll get to the VM's internals in a bit). Also, the kernel documentation recommends IOCTLs as an alternative to writing your own system calls and there seems to be more information available online on the IOCTL-based approach than the sysfs one. On an abstract level, the interface should look like this:
- When loaded, the Soil kernel module provides a virtual device
/dev/soil
- A program that wants to execute Soil code opens
/dev/soil
- It sends an IOCTL to load Soil bytecode, sending the bytecode as a payload
- The kernel module handles the IOCTL by setting up a VM and running the bytecode
This meant that I had to figure out how to create a device.
Module Setup and Creating a Character Device
The most basic Linux kernel module (courtesy of The Linux Kernel Module Programming Guide) looks like this:
/*
* hello-1.c - The simplest kernel module.
*/
#include <linux/module.h> /* Needed by all modules */
#include <linux/printk.h> /* Needed for pr_info() */
int init_module(void)
{
pr_info("Hello world 1.\n");
/* A non 0 return means init_module failed; module can't be loaded. */
return 0;
}
void cleanup_module(void)
{
pr_info("Goodbye world 1.\n");
}
MODULE_LICENSE("GPL");
In order to build this, you will also need a Makefile:
obj-m += hello-1.o
PWD := $(CURDIR)
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
Running make
will create a file hello-1.ko
which you can load into your
kernel using the command insmod hello-1.ko
1. Once you
are done, you can unload the module using rmmod hello-1
. If you check your
dmesg
output, you will see the "Hello world 1" and "Goodbye world 1" messages.
Within this basic framework, I now needed to create a new character device.
The Linux kernel distinguishes between block-based and character-based IO. For example, hard drives are block-based: The data on the device is arranged in blocks (sectors in HDD parlance). Therefore, you can only read and write in fixed-size blocks. Character devices on the other hand communicate one byte at a time. For example, serial adapters are usually character-based. I didn't actually want to implement any read/write operations on my virtual Soil device, so this distinction wasn't particularly important here. However, since block devices are strongly tied into the mechanics of filesystems, and it is generally easier to build a new character device, I opted for a character device.
The Linux kernel has many interfaces to efficiently handle character devices,
most of which are not relevant to this use case, so I used the very basic
register_chrdev
function to create a new device2. This
function takes three parameters: A major number, a name and a
struct file_operations
vtable. Let's go through these in order.
Linux identifies devices through a major and minor number. The major number is
associated with the device driver (the magic number shown earlier in the IOCTL
definition), whereas the minor number describes a specific device instance. For
example, on my system the major number 4
seems to describe the TTY driver with
the different TTY instances using minor numbers 0
through 64
. The device
name is not particularly important to the kernel, it just provides a
human-readable name for your driver (although this has nothing to do with the
device file in /dev
yet!). Finally, most of the magic happens in the
file_operations
vtable. A vtable is a structure that contains function
pointers, a concept that appears quite commonly in the Linux kernel. Here, the
vtable describes the operations the Soil device file should support. Userspace
applications must be able to open and close the file, as well as send IOCTLs. In
the code, it looks like this:
struct soil_program
{
Byte *program;
int len;
};
static int handle_open(struct inode *inode, struct file *file)
{
return 0;
}
static int handle_release(struct inode *inode, struct file *file)
{
return 0;
}
static long handle_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) {
if (cmd == SOIL_IOCTL_LOAD)
{
struct soil_program prog;
int res = copy_from_user(&prog, (struct soil_program *)arg, sizeof(struct soil_program));
if (res != 0)
{
printk("Failed to copy param from user\n");
return 1;
}
char program[1024];
copy_from_user(program, prog.program, prog.len);
// Start the VM
}
return 0;
}
struct file_operations soil_fops = {
.open = handle_open,
.release = handle_release,
.unlocked_ioctl = handle_ioctl,
};
Let's go through this, starting at the bottom. You'll see that the module
defines an instance of the struct file_operations
vtable. By populating its
members with function pointers, I defined which operations my character device
supports and how they work. As mentioned above, you should be able to open and
close (here called "release") a file descriptor to /dev/soil
. Since userspace
applications don't need to read from or write to the device directly, I didn't
implement these functions. If you tried to use them anyway, the system calls
should fail with a return value of EINVAL
(invalid argument). The open and
release implementations are pretty simple. Since the Soil device isn't an actual
device, these calls should always succeed, so the functions simply return 0. The
function to handle IOCTLs does the most work here, but I'll get to that later.
Armed with this vtable, I could now register a character device and make it show
up in /dev
:
struct device *dev_file;
struct class *cls;
static int __init
init_soil_km(void)
{
int res = register_chrdev(IOC_MAGIC, "soil", &soil_fops);
if (res != 0)
{
pr_alert("Failed to register character device %d\n", IOC_MAGIC);
return -1;
}
cls = class_create("soil");
dev_file = device_create(cls, NULL, MKDEV(IOC_MAGIC, 0), NULL, "soil");
return 0;
}
The call to register_chrdev
creates a new device that will appear in
/proc/devices
. However, only after calling device_create
it shows up in
/dev
. A device needs a device class which is set up by the class_create
function. For a normal device, the class would carry various kinds of
functionality common to all devices of that class, but for Soil I only needed it
for formal reasons. If you are curious, you can find more information in the
kernel docs.
The MKDEV
macro takes a major and minor number to identify a specific device,
in my case the first minor number after the Soil driver's major number. Note
also that cls
and dev_file
are defined globally because I need to clean them
up after a module exit:
static void __exit
exit_soil_km(void)
{
device_destroy(cls, MKDEV(IOC_MAGIC, 0));
class_destroy(cls);
unregister_chrdev(IOC_MAGIC, "soil");
}
With the character device set up, I'm going to return to the function that makes it do anything useful:
#define SOIL_IOCTL_LOAD _IOW(IOC_MAGIC, 0, struct soil_program)
static long handle_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) {
if (cmd == SOIL_IOCTL_LOAD)
{
struct soil_program prog;
int res = copy_from_user(&prog, (struct soil_program *)arg, sizeof(struct soil_program));
if (res != 0)
{
printk("Failed to copy param from user\n");
return 1;
}
char program[1024];
copy_from_user(program, prog.program, prog.len);
// Start the VM
}
return 0;
}
The IOCTL handler receives a pointer to the device file an IOCTL was performed
on, the IOCTL number and an (optional) argument. Since there is only one
/dev/soil
at a time, it can ignore the file pointer. However, the IOCTL number
and argument are very relevant. As mentioned earlier, you can think of the IOCTL
number as analogous to a system call number. If the Soil device supported
multiple IOCTLs (and it may in the future!), I would need some way of
distinguishing them. Finally, the argument can be used to pass userspace data to
the kernel code handling the IOCTL. For Soil, I used this to pass the bytecode
to the VM. While arg
is declared as unsigned long
in the implementation, it
is actually a pointer into userspace memory.
An IOCTL is triggered by a system call in a userspace process. Therefore, the
kernel code to handle it runs in the context of the calling process. However,
you can't just access userspace memory because the page underlying the virtual
address might not be mapped. In order to safely access the IOCTL argument, you
need to use the copy_from_user
function to copy it into kernel memory. As you
can see, struct soil_program
contains another pointer to the actual bytecode
along with the bytecode's length. Since this Byte*
(Byte
being a typedef to
uint8_t
) is created by the calling userspace program, it is another userspace
pointer which needs to be copied to kernelspace. Before I describe how the
module sets up the VM, I want to take a look at the userspace side of things.
Welcome from the User Side
As described earlier, the Soil interface from userspace should at this point be pretty simple. For now, there is only one IOCTL which both loads and runs a program. I set up a basic C program that reads a file from disk and sends the contents to my kernel module. The result looks something like this:
#include <stdio.h>
#include <stdint.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include "soil_common.h"
int main(int argc, char **argv)
{
if (argc < 2)
{
return -1;
}
char buf[1024];
FILE* file = fopen(argv[1], "rb");
if (file == NULL)
{
perror("fopen");
return -1;
}
fseek(file, 0L, SEEK_END);
size_t len = ftell(file);
rewind(file);
fread(buf, sizeof(char), len, file);
int fd = open("/dev/soil", O_RDONLY);
if (fd < 0)
{
perror("open");
return -1;
}
struct soil_program prog = {
.program = buf,
.len = len,
};
int res = ioctl(fd, SOIL_IOCTL_LOAD, &prog);
if (res < 0)
{
perror("ioctl");
}
close(fd);
return 0;
}
The program initially checks that it received a file path as a command line
argument, then opens that file. Using a combination of fseek
and related
functions, it reads the file's length 3. After reading
the bytecode into a buffer, the program opens /dev/soil
and constructs a
struct soil_program
as the IOCTL's parameter. With all of this, it can finally
make the ioctl
system call. You may be wondering where does this program gets
the struct soil_program
type and SOIL_IOCTL_LOAD
from. In an earlier
example, I simplified things a bit. The definitions shared between the user and
kernel side live in a shared header soil_common.h
.
Aside: Who should load the bytecode?
Initially, I was unsure whether the userspace code should only transmit the bytecode path or the bytecode itself to the kernel. In terms of efficiency, I thought it might be better to have the kernel do the file IO rather than transmitting possibly large bytecode over the syscall boundary. However, the internet seems to agree that doing file IO in the kernel is considered bad practice, so I decided in favor of the current solution.
My kingdom for a VM!
I now had all the plumbing required to load Soil bytecode into a kernel module.
All that was missing now was the VM to run it. As I mentioned earlier, a C
implementation of the VM already exists, and thankfully it doesn't rely on the
standard library too much. Since the C standard library is built atop the
interfaces exposed by the Linux kernel, you can't use it in a kernel module. You
may have noticed earlier that instead of printf
, the module uses functions
like pr_alert
. This meant that I had to make some changes to the VM
implementation.
The Soil VM I am using is not intended to be embedded into other applications.
It's a single C file that compiles down to an executable. However, that was not
going to stop me. Most of the VM file's main function deals with loading the
bytecode and finding out its size (using the method described above). In order
to prepare a VM, it calls a function init_vm
to reset the VM's registers and
load the bytecode into memory. Then, a call to run
actually starts the
execution.
Looking at init_vm
, I quickly found some library calls I had to change:
malloc
has to be kmalloc
4 and the Soil-specific
eprintf
and panic
functions (which are used for debug printing and abnormal
exits respectively) use a printf
variant that is unavailable in kernel mode. I
want to highlight kmalloc
here because it looks slightly different from
userspacemalloc
. Unlike userspace, you have to specify which properties your
memory should have. The Soil module can get away with simply using GFP_KERNEL
memory, but if you are curious about other options, check the
kernel documentation.
There's one more hurdle, and thankfully I already knew about it going in. As it turns out, Soil supports floating point math. In general, floating point is a sensible trade-off in terms of accuracy and necessary for many applications, e.g. in computer graphics. However, for historic reasons, floating point instructions tend to be complex with their own set of registers and state that has to be kept intact. For this reason, you can't use floating point arithmetic in the Linux kernel. I could have worked around this, but for now floating point support is tedious to implement and mostly unnecessary for kernel applications.
This left one more part of the VM I had to deal with: Syscalls. While these
share a name with Linux's (and other operating systems') system calls, they
serve a slightly different purpose. In an operating system, syscalls are a means
to let userspace programs perform privileged operations (e.g. IO) safely. Soil,
on the other hand, uses syscalls to manage all interactions with the outside
world. For example, if you want to print something to stdout
or
open/read/write a file, you use a syscall. In order to avoid confusion, I'll use
the term "VM call" to distinguish them from Linux's system calls.
Most VM calls are not part of the minimum viable Soil kernel module, so for now
only three VM calls are implemented. The first of these is exit
. In the
userspace implementation, it kills the process by calling the exit
function
from the C library. However, I don't want to kill the entire kernel when a Soil
program exits. Since all the VM state is global, the VM call just sets a flag
that tells the fetch-decode-execute loop to stop executing. I also implemented
the print
and log
calls which just call eprintf
for now.
With all of this done, the Soil VM should run. Let's try out an example!
Kernelspace Fibonacci
My current Soil VM is pretty bare-bones. Apart from printing to the kernel log
and exiting, there is not much it can do in terms of interacting with the
outside world. Since I don't have any experience with Martinaise, Marcel
provided me with a simple program for testing the VM. So, I can proudly tell you
that it is now possible to calculate Fibonacci numbers in the Soil VM in a Linux
kernel module. The script
excerpt below shows the full set of commands to load
the module and run the example file.
[clemens@archlinux soil-km]$ sudo insmod soil.ko
[sudo] password for clemens:
[clemens@archlinux soil-km]$ sudo ./usoil fib.soil
Soil binary `fib.soil` is 127 bytes long.
[clemens@archlinux soil-km]$ sudo dmesg
[ 0.000000] Linux version 6.10.6-arch1-1 (linux@archlinux) (gcc (GCC) 14.2.1 20240805, GNU ld (GNU Binutils) 2.43.0) #1 SMP PREEMPT_DYNAMIC Mon, 19 Aug 2024 17:02:39 +0000
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=5bbe9fba-58d9-4ef3-9dcb-01b9c3fc4ba0 rw rootflags=subvol=@ zswap.enabled=0 rootfstype=btrfs loglevel=3 quiet
#
# Lots of output not related to soil...
#
[ 1236.449775] soil: loading out-of-tree module taints kernel.
[ 1236.449781] soil: module verification failed: signature and/or required key missing - tainting kernel
[ 1466.105055] Hello, soil!
[ 1474.259478] cmd = 1074291712, arg = 00000000d6e61c6a
[ 1474.259484] 127
[ 1474.259485] MEM SIZE 1000000
This first part is simple enough - You've seen the commands for loading a kernel
module earlier already, and we've looked at the userspace counterpart as well.
As a piece of debug information, it prints out the length of the loaded
bytecode. Looking at dmesg
, you can see that the module was loaded correctly,
printing Hello, soil!
. Since I'm not signing my module, the kernel complains,
but loads the module anyway. You can also see some debug output that gets
printed when we handle the IOCTL. cmd
is the actual IOCTL number generated by
the macros we used and arg
is the (userspace) pointer to the bytecode. After
copying the bytecode to kernel memory and setting up the VM, it starts executing
instructions. If you are curious, the hidden code listing shows the entire
tracing output - not particularly spectacular.
The VM's full tracing output
[ 1474.259528] ran d2 ->
[ 1474.259529] ip = 3, sp = f4240, st = 0, a = 1, b = 0, c = 0, d = 0, e = 0, f = 0
[ 1474.259531] ran d2 ->
[ 1474.259532] ip = 6, sp = f4240, st = 0, a = 1, b = 1, c = 0, d = 0, e = 0, f = 0
[ 1474.259533] ran d2 ->
[ 1474.259534] ip = 9, sp = f4240, st = 0, a = 1, b = 1, c = a, d = 0, e = 0, f = 0
[ 1474.259535] ran d2 ->
[ 1474.259536] ip = c, sp = f4240, st = 0, a = 1, b = 1, c = a, d = 1, e = 0, f = 0
[ 1474.259537] ran a1 ->
[ 1474.259538] ip = e, sp = f4240, st = 0, a = 1, b = 1, c = 9, d = 1, e = 0, f = 0
[ 1474.259539] ran d2 ->
[ 1474.259539] ip = 11, sp = f4240, st = 0, a = 1, b = 1, c = 9, d = 0, e = 0, f = 0
[ 1474.259541] ran c0 ->
[ 1474.259542] ip = 13, sp = f4240, st = 9, a = 1, b = 1, c = 9, d = 0, e = 0, f = 0
[ 1474.259543] ran c1 ->
[ 1474.259543] ip = 14, sp = f4240, st = 0, a = 1, b = 1, c = 9, d = 0, e = 0, f = 0
[ 1474.259545] ran f1 ->
[ 1474.259545] ip = 1d, sp = f4240, st = 0, a = 1, b = 1, c = 9, d = 0, e = 0, f = 0
[ 1474.259546] ran d0 ->
[ 1474.259547] ip = 1f, sp = f4240, st = 0, a = 1, b = 1, c = 9, d = 1, e = 0, f = 0
[ 1474.259548] ran a0 ->
[ 1474.259549] ip = 21, sp = f4240, st = 0, a = 1, b = 1, c = 9, d = 2, e = 0, f = 0
[ 1474.259550] ran d0 ->
[ 1474.259551] ip = 23, sp = f4240, st = 0, a = 1, b = 1, c = 9, d = 2, e = 0, f = 0
[ 1474.259552] ran d0 ->
[ 1474.259552] ip = 25, sp = f4240, st = 0, a = 1, b = 2, c = 9, d = 2, e = 0, f = 0
[ 1474.259554] ran f0 ->
[ 1474.259554] ip = 9, sp = f4240, st = 0, a = 1, b = 2, c = 9, d = 2, e = 0, f = 0
[ 1474.259555] ran d2 ->
[ 1474.259556] ip = c, sp = f4240, st = 0, a = 1, b = 2, c = 9, d = 1, e = 0, f = 0
[ 1474.259557] ran a1 ->
[ 1474.259558] ip = e, sp = f4240, st = 0, a = 1, b = 2, c = 8, d = 1, e = 0, f = 0
[ 1474.259559] ran d2 ->
[ 1474.259559] ip = 11, sp = f4240, st = 0, a = 1, b = 2, c = 8, d = 0, e = 0, f = 0
[ 1474.259561] ran c0 ->
[ 1474.259561] ip = 13, sp = f4240, st = 8, a = 1, b = 2, c = 8, d = 0, e = 0, f = 0
[ 1474.259563] ran c1 ->
[ 1474.259563] ip = 14, sp = f4240, st = 0, a = 1, b = 2, c = 8, d = 0, e = 0, f = 0
[ 1474.259564] ran f1 ->
[ 1474.259565] ip = 1d, sp = f4240, st = 0, a = 1, b = 2, c = 8, d = 0, e = 0, f = 0
[ 1474.259566] ran d0 ->
[ 1474.259567] ip = 1f, sp = f4240, st = 0, a = 1, b = 2, c = 8, d = 1, e = 0, f = 0
[ 1474.259568] ran a0 ->
[ 1474.259568] ip = 21, sp = f4240, st = 0, a = 1, b = 2, c = 8, d = 3, e = 0, f = 0
[ 1474.259570] ran d0 ->
[ 1474.259570] ip = 23, sp = f4240, st = 0, a = 2, b = 2, c = 8, d = 3, e = 0, f = 0
[ 1474.259571] ran d0 ->
[ 1474.259572] ip = 25, sp = f4240, st = 0, a = 2, b = 3, c = 8, d = 3, e = 0, f = 0
[ 1474.259573] ran f0 ->
[ 1474.259574] ip = 9, sp = f4240, st = 0, a = 2, b = 3, c = 8, d = 3, e = 0, f = 0
[ 1474.259575] ran d2 ->
[ 1474.259576] ip = c, sp = f4240, st = 0, a = 2, b = 3, c = 8, d = 1, e = 0, f = 0
[ 1474.259577] ran a1 ->
[ 1474.259577] ip = e, sp = f4240, st = 0, a = 2, b = 3, c = 7, d = 1, e = 0, f = 0
[ 1474.259579] ran d2 ->
[ 1474.259579] ip = 11, sp = f4240, st = 0, a = 2, b = 3, c = 7, d = 0, e = 0, f = 0
[ 1474.259580] ran c0 ->
[ 1474.259581] ip = 13, sp = f4240, st = 7, a = 2, b = 3, c = 7, d = 0, e = 0, f = 0
[ 1474.259582] ran c1 ->
[ 1474.259583] ip = 14, sp = f4240, st = 0, a = 2, b = 3, c = 7, d = 0, e = 0, f = 0
[ 1474.259584] ran f1 ->
[ 1474.259585] ip = 1d, sp = f4240, st = 0, a = 2, b = 3, c = 7, d = 0, e = 0, f = 0
[ 1474.259586] ran d0 ->
[ 1474.259586] ip = 1f, sp = f4240, st = 0, a = 2, b = 3, c = 7, d = 2, e = 0, f = 0
[ 1474.259588] ran a0 ->
[ 1474.259588] ip = 21, sp = f4240, st = 0, a = 2, b = 3, c = 7, d = 5, e = 0, f = 0
[ 1474.259589] ran d0 ->
[ 1474.259590] ip = 23, sp = f4240, st = 0, a = 3, b = 3, c = 7, d = 5, e = 0, f = 0
[ 1474.259591] ran d0 ->
[ 1474.259592] ip = 25, sp = f4240, st = 0, a = 3, b = 5, c = 7, d = 5, e = 0, f = 0
[ 1474.259593] ran f0 ->
[ 1474.259593] ip = 9, sp = f4240, st = 0, a = 3, b = 5, c = 7, d = 5, e = 0, f = 0
[ 1474.259595] ran d2 ->
[ 1474.259595] ip = c, sp = f4240, st = 0, a = 3, b = 5, c = 7, d = 1, e = 0, f = 0
[ 1474.259597] ran a1 ->
[ 1474.259597] ip = e, sp = f4240, st = 0, a = 3, b = 5, c = 6, d = 1, e = 0, f = 0
[ 1474.259598] ran d2 ->
[ 1474.259599] ip = 11, sp = f4240, st = 0, a = 3, b = 5, c = 6, d = 0, e = 0, f = 0
[ 1474.259600] ran c0 ->
[ 1474.259601] ip = 13, sp = f4240, st = 6, a = 3, b = 5, c = 6, d = 0, e = 0, f = 0
[ 1474.259602] ran c1 ->
[ 1474.259602] ip = 14, sp = f4240, st = 0, a = 3, b = 5, c = 6, d = 0, e = 0, f = 0
[ 1474.259604] ran f1 ->
[ 1474.259604] ip = 1d, sp = f4240, st = 0, a = 3, b = 5, c = 6, d = 0, e = 0, f = 0
[ 1474.259605] ran d0 ->
[ 1474.259606] ip = 1f, sp = f4240, st = 0, a = 3, b = 5, c = 6, d = 3, e = 0, f = 0
[ 1474.259607] ran a0 ->
[ 1474.259608] ip = 21, sp = f4240, st = 0, a = 3, b = 5, c = 6, d = 8, e = 0, f = 0
[ 1474.259609] ran d0 ->
[ 1474.259610] ip = 23, sp = f4240, st = 0, a = 5, b = 5, c = 6, d = 8, e = 0, f = 0
[ 1474.259611] ran d0 ->
[ 1474.259611] ip = 25, sp = f4240, st = 0, a = 5, b = 8, c = 6, d = 8, e = 0, f = 0
[ 1474.259613] ran f0 ->
[ 1474.259613] ip = 9, sp = f4240, st = 0, a = 5, b = 8, c = 6, d = 8, e = 0, f = 0
[ 1474.259614] ran d2 ->
[ 1474.259615] ip = c, sp = f4240, st = 0, a = 5, b = 8, c = 6, d = 1, e = 0, f = 0
[ 1474.259616] ran a1 ->
[ 1474.259617] ip = e, sp = f4240, st = 0, a = 5, b = 8, c = 5, d = 1, e = 0, f = 0
[ 1474.259618] ran d2 ->
[ 1474.259619] ip = 11, sp = f4240, st = 0, a = 5, b = 8, c = 5, d = 0, e = 0, f = 0
[ 1474.259620] ran c0 ->
[ 1474.259620] ip = 13, sp = f4240, st = 5, a = 5, b = 8, c = 5, d = 0, e = 0, f = 0
[ 1474.259622] ran c1 ->
[ 1474.259622] ip = 14, sp = f4240, st = 0, a = 5, b = 8, c = 5, d = 0, e = 0, f = 0
[ 1474.259623] ran f1 ->
[ 1474.259624] ip = 1d, sp = f4240, st = 0, a = 5, b = 8, c = 5, d = 0, e = 0, f = 0
[ 1474.259625] ran d0 ->
[ 1474.259626] ip = 1f, sp = f4240, st = 0, a = 5, b = 8, c = 5, d = 5, e = 0, f = 0
[ 1474.259627] ran a0 ->
[ 1474.259628] ip = 21, sp = f4240, st = 0, a = 5, b = 8, c = 5, d = d, e = 0, f = 0
[ 1474.259629] ran d0 ->
[ 1474.259629] ip = 23, sp = f4240, st = 0, a = 8, b = 8, c = 5, d = d, e = 0, f = 0
[ 1474.259631] ran d0 ->
[ 1474.259631] ip = 25, sp = f4240, st = 0, a = 8, b = d, c = 5, d = d, e = 0, f = 0
[ 1474.259632] ran f0 ->
[ 1474.259633] ip = 9, sp = f4240, st = 0, a = 8, b = d, c = 5, d = d, e = 0, f = 0
[ 1474.259634] ran d2 ->
[ 1474.259635] ip = c, sp = f4240, st = 0, a = 8, b = d, c = 5, d = 1, e = 0, f = 0
[ 1474.259636] ran a1 ->
[ 1474.259636] ip = e, sp = f4240, st = 0, a = 8, b = d, c = 4, d = 1, e = 0, f = 0
[ 1474.259638] ran d2 ->
[ 1474.259638] ip = 11, sp = f4240, st = 0, a = 8, b = d, c = 4, d = 0, e = 0, f = 0
[ 1474.259640] ran c0 ->
[ 1474.259640] ip = 13, sp = f4240, st = 4, a = 8, b = d, c = 4, d = 0, e = 0, f = 0
[ 1474.259641] ran c1 ->
[ 1474.259642] ip = 14, sp = f4240, st = 0, a = 8, b = d, c = 4, d = 0, e = 0, f = 0
[ 1474.259643] ran f1 ->
[ 1474.259644] ip = 1d, sp = f4240, st = 0, a = 8, b = d, c = 4, d = 0, e = 0, f = 0
[ 1474.259645] ran d0 ->
[ 1474.259645] ip = 1f, sp = f4240, st = 0, a = 8, b = d, c = 4, d = 8, e = 0, f = 0
[ 1474.259647] ran a0 ->
[ 1474.259647] ip = 21, sp = f4240, st = 0, a = 8, b = d, c = 4, d = 15, e = 0, f = 0
[ 1474.259649] ran d0 ->
[ 1474.259649] ip = 23, sp = f4240, st = 0, a = d, b = d, c = 4, d = 15, e = 0, f = 0
[ 1474.259650] ran d0 ->
[ 1474.259651] ip = 25, sp = f4240, st = 0, a = d, b = 15, c = 4, d = 15, e = 0, f = 0
[ 1474.259652] ran f0 ->
[ 1474.259653] ip = 9, sp = f4240, st = 0, a = d, b = 15, c = 4, d = 15, e = 0, f = 0
[ 1474.259654] ran d2 ->
[ 1474.259654] ip = c, sp = f4240, st = 0, a = d, b = 15, c = 4, d = 1, e = 0, f = 0
[ 1474.259656] ran a1 ->
[ 1474.259656] ip = e, sp = f4240, st = 0, a = d, b = 15, c = 3, d = 1, e = 0, f = 0
[ 1474.259658] ran d2 ->
[ 1474.259658] ip = 11, sp = f4240, st = 0, a = d, b = 15, c = 3, d = 0, e = 0, f = 0
[ 1474.259659] ran c0 ->
[ 1474.259660] ip = 13, sp = f4240, st = 3, a = d, b = 15, c = 3, d = 0, e = 0, f = 0
[ 1474.259661] ran c1 ->
[ 1474.259662] ip = 14, sp = f4240, st = 0, a = d, b = 15, c = 3, d = 0, e = 0, f = 0
[ 1474.259663] ran f1 ->
[ 1474.259663] ip = 1d, sp = f4240, st = 0, a = d, b = 15, c = 3, d = 0, e = 0, f = 0
[ 1474.259665] ran d0 ->
[ 1474.259665] ip = 1f, sp = f4240, st = 0, a = d, b = 15, c = 3, d = d, e = 0, f = 0
[ 1474.259667] ran a0 ->
[ 1474.259667] ip = 21, sp = f4240, st = 0, a = d, b = 15, c = 3, d = 22, e = 0, f = 0
[ 1474.259668] ran d0 ->
[ 1474.259669] ip = 23, sp = f4240, st = 0, a = 15, b = 15, c = 3, d = 22, e = 0, f = 0
[ 1474.259670] ran d0 ->
[ 1474.259671] ip = 25, sp = f4240, st = 0, a = 15, b = 22, c = 3, d = 22, e = 0, f = 0
[ 1474.259672] ran f0 ->
[ 1474.259672] ip = 9, sp = f4240, st = 0, a = 15, b = 22, c = 3, d = 22, e = 0, f = 0
[ 1474.259674] ran d2 ->
[ 1474.259674] ip = c, sp = f4240, st = 0, a = 15, b = 22, c = 3, d = 1, e = 0, f = 0
[ 1474.259676] ran a1 ->
[ 1474.259676] ip = e, sp = f4240, st = 0, a = 15, b = 22, c = 2, d = 1, e = 0, f = 0
[ 1474.259677] ran d2 ->
[ 1474.259678] ip = 11, sp = f4240, st = 0, a = 15, b = 22, c = 2, d = 0, e = 0, f = 0
[ 1474.259679] ran c0 ->
[ 1474.259680] ip = 13, sp = f4240, st = 2, a = 15, b = 22, c = 2, d = 0, e = 0, f = 0
[ 1474.259681] ran c1 ->
[ 1474.259682] ip = 14, sp = f4240, st = 0, a = 15, b = 22, c = 2, d = 0, e = 0, f = 0
[ 1474.259683] ran f1 ->
[ 1474.259683] ip = 1d, sp = f4240, st = 0, a = 15, b = 22, c = 2, d = 0, e = 0, f = 0
[ 1474.259685] ran d0 ->
[ 1474.259685] ip = 1f, sp = f4240, st = 0, a = 15, b = 22, c = 2, d = 15, e = 0, f = 0
[ 1474.259686] ran a0 ->
[ 1474.259687] ip = 21, sp = f4240, st = 0, a = 15, b = 22, c = 2, d = 37, e = 0, f = 0
[ 1474.259688] ran d0 ->
[ 1474.259689] ip = 23, sp = f4240, st = 0, a = 22, b = 22, c = 2, d = 37, e = 0, f = 0
[ 1474.259690] ran d0 ->
[ 1474.259690] ip = 25, sp = f4240, st = 0, a = 22, b = 37, c = 2, d = 37, e = 0, f = 0
[ 1474.259692] ran f0 ->
[ 1474.259692] ip = 9, sp = f4240, st = 0, a = 22, b = 37, c = 2, d = 37, e = 0, f = 0
[ 1474.259694] ran d2 ->
[ 1474.259694] ip = c, sp = f4240, st = 0, a = 22, b = 37, c = 2, d = 1, e = 0, f = 0
[ 1474.259695] ran a1 ->
[ 1474.259696] ip = e, sp = f4240, st = 0, a = 22, b = 37, c = 1, d = 1, e = 0, f = 0
[ 1474.259697] ran d2 ->
[ 1474.259698] ip = 11, sp = f4240, st = 0, a = 22, b = 37, c = 1, d = 0, e = 0, f = 0
[ 1474.259699] ran c0 ->
[ 1474.259699] ip = 13, sp = f4240, st = 1, a = 22, b = 37, c = 1, d = 0, e = 0, f = 0
[ 1474.259701] ran c1 ->
[ 1474.259701] ip = 14, sp = f4240, st = 0, a = 22, b = 37, c = 1, d = 0, e = 0, f = 0
[ 1474.259703] ran f1 ->
[ 1474.259703] ip = 1d, sp = f4240, st = 0, a = 22, b = 37, c = 1, d = 0, e = 0, f = 0
[ 1474.259704] ran d0 ->
[ 1474.259705] ip = 1f, sp = f4240, st = 0, a = 22, b = 37, c = 1, d = 22, e = 0, f = 0
[ 1474.259706] ran a0 ->
[ 1474.259707] ip = 21, sp = f4240, st = 0, a = 22, b = 37, c = 1, d = 59, e = 0, f = 0
[ 1474.259708] ran d0 ->
[ 1474.259708] ip = 23, sp = f4240, st = 0, a = 37, b = 37, c = 1, d = 59, e = 0, f = 0
[ 1474.259710] ran d0 ->
[ 1474.259710] ip = 25, sp = f4240, st = 0, a = 37, b = 59, c = 1, d = 59, e = 0, f = 0
[ 1474.259711] ran f0 ->
[ 1474.259712] ip = 9, sp = f4240, st = 0, a = 37, b = 59, c = 1, d = 59, e = 0, f = 0
[ 1474.259713] ran d2 ->
[ 1474.259714] ip = c, sp = f4240, st = 0, a = 37, b = 59, c = 1, d = 1, e = 0, f = 0
[ 1474.259715] ran a1 ->
[ 1474.259716] ip = e, sp = f4240, st = 0, a = 37, b = 59, c = 0, d = 1, e = 0, f = 0
[ 1474.259717] ran d2 ->
[ 1474.259717] ip = 11, sp = f4240, st = 0, a = 37, b = 59, c = 0, d = 0, e = 0, f = 0
[ 1474.259719] ran c0 ->
[ 1474.259719] ip = 13, sp = f4240, st = 0, a = 37, b = 59, c = 0, d = 0, e = 0, f = 0
[ 1474.259720] ran c1 ->
[ 1474.259721] ip = 14, sp = f4240, st = 1, a = 37, b = 59, c = 0, d = 0, e = 0, f = 0
[ 1474.259722] ran f1 ->
[ 1474.259723] ip = 2e, sp = f4240, st = 1, a = 37, b = 59, c = 0, d = 0, e = 0, f = 0
The final instruction which makes an exit
VM call is probably the most
immediately interesting:
[ 1474.259724] syscall exit(55)
[ 1474.259725] exited with 55
[ 1474.259726] ran f4 ->
[ 1474.259726] ip = 30, sp = f4240, st = 1, a = 37, b = 59, c = 0, d = 0, e = 0, f = 0
[clemens@archlinux soil-km]$ sudo rmmod soil
For any instruction, the Soil VM prints out the current register state and opcode since instruction tracing is enabled. As you can see, the program exits with a value of 55, the 10th Fibonacci number (if you start at 1). In other words, everything ran successfully!
What's still missing
At this point, most of the Soil VM works. The issue is all the machinery surrounding it. Here are some features I would still like to implement:
- Thread Safety: The current implementation relies on a lot of global state that only exists once in the kernel module. This means that you can't run two Soil programs concurrently - a major limitation if you want to use Soil for any long-running, daemon-like applications. There are two major options to solve this: I could introduce locks in the IOCTL handler, so only one Soil program can run at a time. Not ideal, so I'm probably going to implement the other option: Moving the VM state from global state to a struct. That shouldn't be too complicated, just a bit of work. Watch this statement come back to haunt me later.
- Memory Safety: The existing C code allocates heap memory for a variety of uses. However, it never frees that memory. That's not a huge deal for a userspace process. It certainly isn't good style, but not the end of the world. After all, the OS is there to clean up after you. In kernel mode, you probably should be more careful about leaking memory. Tracking memory should hopefully be easier when all the VM state lives in a struct.
- VM Calls: Right now, kernelspace Soil implements only the bare minimum of VM calls. I need to evaluate which of the existing VM calls make sense for a kernel module (e.g. interactive input is not useful here) and which new VM calls you may need for different purposes. BPF's kfuncs may be a source of inspiration.
- More Granular Interface: Right now, there is only one IOCTL that is
responsible for both loading and executing bytecode. This works for a
prototype, but ideally you should have some more control over the execution.
For example, it might be useful to load a program once and then run it
multiple times. Also, the Soil VM contains a tracing mode that is currently
enabled via a
#define
macro. In short, there are a number of features that already exist or are easy to add that make for a more flexible developer experience. - Asynchronous Execution: Right now, the IOCTL blocks until your program has exited. This means background tasks are currently not implementable in kernelspace Soil, a major limitation. I haven't thought too much about this yet, but spawning VMs in a kernel thread and providing a handle you can poll to read a VM's status seems like a reasonable solution.
Wrapping up and future plans
Achieving my initial goal of running Soil bytecode in the Linux kernel proved
surprisingly easy. I learned quite a bit about how character devices work
internally and seeing /dev/soil
pop up for the first time was a nice moment of
success. Soil itself is also relatively easy to understand, especially given
that I started with a working implementation. A couple of design decisions both
in the VM in general (e.g. VM calls for file IO) and the implementation
specifically (e.g. the approach to memory management) are clearly tailored to
userspace applications, but the simple, RISC-like design made it easy enough to
adapt.
As noted above, I have some ideas for where to take this project next, so expect further updates in the future.
-
In order to build kernel modules you will need a number of packages, usually including a C compiler,
make
, and kernel headers. If you want to compile your own kernel modules, consult your distribution's documentation to find out how to set up these requirements. Also keep in mind that commands likeinsmod
andrmmod
will require superuser privileges. ↩ -
The Linux Kernel Labs website was another very helpful source in figuring out how to create my own character device. ↩
-
Credit for this idea goes to Marcel. You can't simply
read
the file and usestrlen
to get its length because Soil bytecode may contain null bytes. ↩ -
As it turns out, this Soil implementation tries to allocate 1 GB of memory when starting a VM. Marcel tells me this is necessary for the Martinaise compiler, which also runs on Soil. For the kernel version, I reduced the initial memory to 1 MB. ↩