Minimal Syscalls in C
Created Saturday 20 July 2024
Zig implements syscalls as functions with extended inline assembly that puts each
of the function's arguments into the proper register.
(x86-64)
pub fn syscall6(
number: SYS,
arg1: usize,
arg2: usize,
arg3: usize,
arg4: usize,
arg5: usize,
arg6: usize,
) usize {
return asm volatile ("syscall"
: [ret] "={rax}" (-> usize),
: [number] "{rax}" (@intFromEnum(number)),
[arg1] "{rdi}" (arg1),
[arg2] "{rsi}" (arg2),
[arg3] "{rdx}" (arg3),
[arg4] "{r10}" (arg4),
[arg5] "{r8}" (arg5),
[arg6] "{r9}" (arg6),
: "rcx", "r11", "memory"
);
}
This does allow for nice automatic inlining even with various factors at play
like optimizations, because it explicity specifies the mapping between arguments
and registers, and the return value, and additionally has no hard calling convention
like a non-static C function might have. Generally, assuming that this is going
to be always optimized down into just a syscall rather than actually calling this
function, you could say that this syscall truly has the same cost as calling any other
function since it has to set up the arguments syscall just like everyone else sets
up the arguments to their functions.
Like this Zig implementation, I've made a more or less equivalent version in C.
static __inline__ __attribute__((always_inline))
unsigned long __sysc6(unsigned long id_,
unsigned long a_,
unsigned long b_,
unsigned long c_,
unsigned long d_,
unsigned long e_,
unsigned long f_)
{
volatile register unsigned long id __asm__ ("rax") = id_;
volatile register unsigned long a __asm__ ("rdi") = a_;
volatile register unsigned long b __asm__ ("rsi") = b_;
volatile register unsigned long c __asm__ ("rdx") = c_;
volatile register unsigned long d __asm__ ("r10") = d_;
volatile register unsigned long e __asm__ ("r8") = e_;
volatile register unsigned long f __asm__ ("r9") = f_;
__asm__ __volatile__("syscall\n" :
"+r"(id), "+r"(a), "+r"(b), "+r"(c), "+r"(d), "+r"(e), "+r"(f) :: );
return id;
}
I think there are some problems with doing syscalls completely this way, however, and perhaps some of this would apply to calling conventions in general.
The first problem is portability. You are relying on the presence few things here for these syscalls to turn out well:
- optimizations
- inlining
- whatever GCC extension __asm__ ("rax") is
- GCC extended asm syntax
If you want to inline through methods other than optimizations, I don't think there is a good way. GCC's extended asm syntax does not appear to support certain x86-64 registers that we need, specifically r10, r8, and r9, so we can't even specify those as registers to target, as, if we were to put everything into a single extended ASM block behind a macro to inline it. You could manually specify mov commands for each value to be put into a specific register, but then that relies on not running out of registers, and will guarantee a minimum additional cost of all of those mov instructions.
I've searched a bit on the internet regarding calling conventions in C, and I couldn't really find anything that would let you write your own custom calling convention for some function, or even just specifying the registers without an intermediate step, so you can't just make a nearly empty function of just syscall and then ret.
But, this isn't that much of a problem. We can say that this does rely on optimizations and inlining, and, even if that doesn't properly happen in our obscure and weird C89 implementation, that is fine.
The next issue, however, lies with the number of instructions actually necessary to perform the syscall.
The minimum number of instructions to call a syscall is really just the syscall instruction, assuming that all of your arguments are already in the appropriate registers and that you do in fact want the return value in the RAX register. However, if we cannot assume that every register is already in place, we need to perform movs to put our arguments in the right places, and valuable program size can be wasted for the sake of setting up the call to the syscall.
"How else are you supposed to perform a syscall?" you might rhetorically ask. And you would be right in the sense that you do in fact need to put all of the arguments in the right places for every call. But, in certain scenarios, we can still reduce the cost for a user to call a syscall wrapper.
Say if we had a syscall wrapper written like this, where a user uses the C calling convention to call into this function, which then moves the values of the registers to match the syscall calling convention. (I did only 5 syscall arguments because I didn't want to deal with popping off of the stack.)
__asm__ (
".globl __sysc5\n"
"__sysc5:\n"
"mov %rdi, %rax\n"
"mov %rsi, %rdi\n"
"mov %rdx, %rsi\n"
"mov %rcx, %rdx\n"
"mov %r8, %r10\n"
"mov %r9, %r8\n"
"syscall\n"
"ret\n"
);
The caller of this function must handle ensuring all the arguments are in the right place every time we perform a call to this. But, what if we could reuse the arguments to the syscall?
Say you are reading stdin over and over and printing what you read to stdout.
long rn, wn, i;
char buf[256];
while ((rn = read(STDIN_FILENO, buf, 256)) > 0)
{
i = 0;
while (i < rn)
{
if((wn = write(STDOUT_FILENO, buf + i, rn - i)) <= 0) return 0;
i += wn;
}
}
None of the arguments of the read call change, and only two of the arguments of the write call change. Therefore, let's try storing our arguments somewhere else.
long read_args[4], write_args[4]; /* 4 args, including the syscall ID */
read_args[0] = SYSCALL_ID_READ;
read_args[1] = STDIN_FILENO;
read_args[2] = (long)buf;
read_args[3] = 256;
write_args[0] = SYSCALL_ID_WRITE;
write_args[1] = STDOUT_FILENO;
while ((rn = (long)__sysc3(read_args)) > 0)
{
i = 0;
while (i < rn)
{
write_args[2] = (long)(buf + i);
write_args[3] = (long)(rn - i);
if((wn = (long)__sysc3(write_args)) <= 0) return 0;
i += wn;
}
}
Now, with this argument reuse, we only need to pass in a pointer to our argument array, a single argument, to our syscall wrapper, and our own cost of performing the syscall should have decreased.
Of course, this does assume an appropriate assembly implementation of the syscall wrapper. And we have that:
long __sysc3(long const *arr);
__asm__ (
".globl __sysc3\n"
"__sysc3:\n"
"mov 0x18(%rdi), %rdx\n"
"mov 0x10(%rdi), %rsi\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n" /* clobbering rdi here should be last since we
* rely on rdi for the array argument */
"syscall\n"
"ret\n"
);
The cost of performing the read syscall should now be a single mov instruction, and it basically would be with a little more optimization. The following is -O0 on gcc, and they just have an extra instruction for using the RAX register as an intermediary register.
1229: 7c 8e jl 11b9 <main+0x6e>
122b: 48 8d 85 b0 fe ff ff lea -0x150(%rbp),%rax
1232: 48 89 c7 mov %rax,%rdi
1235: e8 ff fe ff ff call 1139 <__sysc3>
123a: 48 89 85 a0 fe ff ff mov %rax,-0x160(%rbp)
1241: 48 83 bd a0 fe ff ff cmpq $0x0,-0x160(%rbp)
1248: 00
Under -Oz it is just a mov from a register to another.
1092: 48 83 64 24 10 00 andq $0x0,0x10(%rsp)
1098: 4c 89 ef mov %r13,%rdi
109b: e8 69 01 00 00 call 1209 <__sysc3>
10a0: 48 93 xchg %rax,%rbx
There is something else we can do, though this is something that is technically unnecessary.
Currently, the assembly for our array argument syscall implementation looks like this:
long __sysc0(long const *arr);
long __sysc1(long const *arr);
long __sysc2(long const *arr);
long __sysc3(long const *arr);
long __sysc4(long const *arr);
long __sysc5(long const *arr);
long __sysc6(long const *arr);
__asm__ (
".globl __sysc0\n"
"__sysc0:\n"
"mov 0x00(%rdi), %rax\n"
"syscall\n"
"ret\n"
);
__asm__ (
".globl __sysc1\n"
"__sysc1:\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n"
"syscall\n"
"ret\n"
);
__asm__ (
".globl __sysc2\n"
"__sysc2:\n"
"mov 0x10(%rdi), %rsi\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n"
"syscall\n"
"ret\n"
);
__asm__ (
".globl __sysc3\n"
"__sysc3:\n"
"mov 0x18(%rdi), %rdx\n"
"mov 0x10(%rdi), %rsi\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n"
"syscall\n"
"ret\n"
);
__asm__ (
".globl __sysc4\n"
"__sysc4:\n"
"mov 0x20(%rdi), %r10\n"
"mov 0x18(%rdi), %rdx\n"
"mov 0x10(%rdi), %rsi\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n"
"syscall\n"
"ret\n"
);
__asm__ (
".globl __sysc5\n"
"__sysc5:\n"
"mov 0x28(%rdi), %r8\n"
"mov 0x20(%rdi), %r10\n"
"mov 0x18(%rdi), %rdx\n"
"mov 0x10(%rdi), %rsi\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n"
"syscall\n"
"ret\n"
);
__asm__ (
".globl __sysc6\n"
"__sysc6:\n"
"mov 0x30(%rdi), %r9\n"
"mov 0x28(%rdi), %r8\n"
"mov 0x20(%rdi), %r10\n"
"mov 0x18(%rdi), %rdx\n"
"mov 0x10(%rdi), %rsi\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n"
"syscall\n"
"ret\n"
);
When I look at this, I see a lot of duplicated code. All of __sysc5 exists within __sysc6, so, surely, we could combine them? The answer is yes; the following code will still work the same.
__asm__ (
".globl __sysc5\n"
".globl __sysc6\n"
"__sysc6:\n"
"mov 0x30(%rdi), %r9\n"
"__sysc5:\n"
"mov 0x28(%rdi), %r8\n"
"mov 0x20(%rdi), %r10\n"
"mov 0x18(%rdi), %rdx\n"
"mov 0x10(%rdi), %rsi\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n"
"syscall\n"
"ret\n"
);
And we can do this with some of the others:
__asm__ (
".globl __sysc1\n"
".globl __sysc2\n"
".globl __sysc3\n"
".globl __sysc4\n"
".globl __sysc5\n"
".globl __sysc6\n"
"__sysc6:\n"
"mov 0x30(%rdi), %r9\n"
"__sysc5:\n"
"mov 0x28(%rdi), %r8\n"
"__sysc4:\n"
"mov 0x20(%rdi), %r10\n"
"__sysc3:\n"
"mov 0x18(%rdi), %rdx\n"
"__sysc2:\n"
"mov 0x10(%rdi), %rsi\n"
"__sysc1:\n"
"mov 0x00(%rdi), %rax\n"
"mov 0x08(%rdi), %rdi\n"
"syscall\n"
"ret\n"
);
This works fine, all the way up to __sysc1. However, this currently doesnt work with __sysc0; the last two movs are in the opposite order, and we therefore cannot just throw the __sysc0: label in there.
To fix this, there are a couple of solutions. One solution would be to just use a temporary register to store the value we want in the RDI register, and then after using the array in RDI to retrieve the value for the RAX register, we can then clobber RDI.
__asm__ (
".globl __sysc0\n"
".globl __sysc1\n"
".globl __sysc2\n"
".globl __sysc3\n"
".globl __sysc4\n"
".globl __sysc5\n"
".globl __sysc6\n"
"__sysc6:\n"
"mov 0x30(%rdi), %r9\n"
"__sysc5:\n"
"mov 0x28(%rdi), %r8\n"
"__sysc4:\n"
"mov 0x20(%rdi), %r10\n"
"__sysc3:\n"
"mov 0x18(%rdi), %rdx\n"
"__sysc2:\n"
"mov 0x10(%rdi), %rsi\n"
"__sysc1:\n"
"mov 0x08(%rdi), %rcx\n"
"__sysc0:\n"
"mov 0x00(%rdi), %rax\n"
"mov %rcx, %rdi\n"
"syscall\n"
"ret\n"
);
This solution does come at the cost of another instruction for the sake of the temporary register, but this works fine with our existing format. The thing about this, though, is that using arrays for __sysc0 makes no sense, because the syscall takes in only a single argument of the ID. So then, we might as well make another subroutine just for __sysc0 that doesn't use an array, and make the argument just the ID.
long __sysc0(long id);
__asm__ (
".globl __sysc0\n"
"__sysc0:\n"
"mov %rdi, %rax\n"
"syscall\n"
"ret\n"
);
In my search for other usable calling conventions in C, I did see something that I thought looked promising, but it didn't seem to work out. The regparm attribute said that it could put an argument into EAX, however it didn't seem to do that. It is possible it is because I am on x86-64 and not on x86-32 like how it specifies.
Now that we are in the range of the original x86 registers, and none of the extra ones only in x86-64 that are necessary for the syscalls with more arguments, surely we can actually use inline assembly for this? And I couldn't get it to work with just a syscall instruction in extended asm because it didn't like specifying a constant for a specific register.
Anyway, this is all fine. We have ourselves some syscall wrappers for all 7 varying argument counts that take in arrays that will minimize the cost of performing a single syscall.