Bare Hello World in C

Created Sunday 28 July 2024


The basic "Hello world" program in C that everyone knows is like this:

#include <stdio.h>

int main(void)
{
  printf("Hello, world!\n");
  return 0;
}

$ gcc -Oz main.c && strip a.out && ./a.out
Hello, world!
$ ldd a.out
	linux-vdso.so.1 (0x0000747611e1f000)
	libc.so.6 => /usr/lib/libc.so.6 (0x0000747611bfc000)
	/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x0000747611e21000)

This results in a 15K binary that depends on libc.


I am Zig programmer as well. When we compile a hello world with Zig, we get a 4.8K binary:


const std = @import("std");

pub fn main() !void {
    try std.io.getStdOut().writer().writeAll("Hello, world!\n");
}

$ zig build run -Doptimize=ReleaseSmall
Hello, world!
$ ldd ./zig-out/bin/zigtest0
	not a dynamic executable

And, not to mention, it isn't even linked to libc! What gives?


Well, there are a couple of drawbacks in Zig. By default, Zig does not enable PIE by default, while that seems to be the case for the C program. So, we'll enable that in the build.zig. Next, checksec says theres no stack canaries. It also doesn't say there are stack canaries for the C program, for that matter, however I suspect that is because our C program only actually has the main function that calls off to printf, while Zig must have many more functions of its own. But, at the moment, in order to enable stack canaries in Zig, you must both link libc and compile with a safe optimization mode like ReleaseSafe.


Enabling all of that for the Zig compiler brings us up all the way to... 78K ...what? Compiling with the same settings except without libc brings that down to 17K, and 6K on ReleaseSmall. It seems that linking libc is adding a boatload of extra stuff.


So, I tried static linking libc in the hopes that it would be able to optimize out a lot of the libc code while also having stack protectors.


glibc didn't like static linking, according to the Zig compiler:

$ zig build run -Dtarget="x86_64-linux-gnu" -Doptimize=ReleaseSafe
run
└─ run zigtest0
   └─ zig build-exe zigtest0 ReleaseSafe x86_64-linux-gnu failure
error: error: libc of the specified target requires dynamic linking

So then I tried with musl, and that compiled. But checksec still reports no stack canaries!


That's fine, I guess. We don't need to put down stack canaries. I guess. And so, if that is the case, then we'll just go for the ReleaseSmall PIE no-libc version. Which is 6K, and still a lot less than the C program. As a side note, gcc does seem to be able to statically include glibc. But, whatever.


Zig's standard library is made to be libc-optional. It works with it, and it works without it. Zig takes advantage of that by, instead of using libc as is standard for many other compilers in the programming world, using their standard library as a replacement for libc, handling all of the necessary startup code that you might normally rely on libc for.


But C is supposed to be the small and low level language language! Where you have control over everything!? Why is Zig, the fancy new modern kid on the block doing so much better out of the gate?


Well, that's because libc tends to be the default, and defaults aren't always the best or good.


Alright, so Zig does it better at first. But we're C, the low level programming language that can control everything. Can't we do the same, if not better?


In the Zig standard library, they directly use syscalls. And, in fact, looking at the Zig program's objdump -D, all of the syscalls used in the program appear to have ended up inlined directly.


So, let's try to do the same:


#include <unistd.h>

int main(void)
{
  write(STDOUT_FILENO, "Hello, world!\n", 14);
  return 0;
}

$ gcc -Oz main.c && strip a.out && ./a.out
Hello, world!

...and it still compiled down to 15K. We are still technically using libc, as we're actually using libc's syscall wrappers to do everything, and so libc is still doing everything else, including the startup code.


So, we simply must write our own startup code. And our own syscall wrapper.


typedef unsigned long ulong;

__asm__ (
  /* For the first three arguments in the C calling convention, they actually
   * line up with the first three arguments for syscalls, therefore we don't
   * need to move anything around, other than the syscall ID. */
  ".globl _sysc1\n"
  ".globl _sysc3\n"
  "_sysc3:\n" 
  "mov %rcx, %rax\n"
  "syscall\n"
  "ret\n"
  "_sysc1:\n"
  "mov %rsi, %rax\n"
  "syscall\n"
  "ret\n"
  );

ulong _sysc3(ulong a, ulong b, ulong c, ulong id);
ulong _sysc1(ulong a, ulong id);

/* Referenced https://github.com/ziglang/zig/blob/c15755092821c5c27727ebf416689084eab5b73e/lib/std/os/linux/syscalls.zig#L453 */
#define SYS_WRITE 1
#define SYS_EXIT  60

#define STDOUT_FILENO 1

void _start(void)
{
  _sysc3(STDOUT_FILENO, (ulong)"Hello, world!\n", 14, SYS_WRITE);
  _sysc1(0, SYS_EXIT);
}

Now, that should be pretty small, right? All this is, is a "Hello, world!\n" string stored somewhere, two function calls, and two syscall definitions.


gcc -Oz -fno-builtin -nostdlib -ffreestanding -Wl,--no-dynamic-linker -no-pie -fno-stack-protector main.c && strip a.out && ./a.out
Hello, world!

And this comes out to... 8.8K. Still larger than Zig, but it is doing a lot better. In fact, the only executable code is this:


0000000000401000 <.text>:
  401000:       48 89 c8                mov    %rcx,%rax
  401003:       0f 05                   syscall
  401005:       c3                      ret
  401006:       48 89 f0                mov    %rsi,%rax
  401009:       0f 05                   syscall
  40100b:       c3                      ret
  40100c:       50                      push   %rax
  40100d:       48 8d 35 ec 0f 00 00    lea    0xfec(%rip),%rsi        # 0x402000
  401014:       6a 01                   push   $0x1
  401016:       59                      pop    %rcx
  401017:       6a 0e                   push   $0xe
  401019:       5a                      pop    %rdx
  40101a:       6a 01                   push   $0x1
  40101c:       5f                      pop    %rdi
  40101d:       e8 de ff ff ff          call   0x401000
  401022:       6a 3c                   push   $0x3c
  401024:       31 ff                   xor    %edi,%edi
  401026:       5e                      pop    %rsi
  401027:       5a                      pop    %rdx
  401028:       e9 d9 ff ff ff          jmp    0x401006


That itself is only 45 bytes. And the .rodata section takes up only another 14. So then why is the entire thing 8984 bytes?


Looking at readelf -a for both of the programs, it is strange, because Zig has even more program headers and sections. Then, I noticed something: gcc seems to be aligning the offsets of each of the program headers to its correpsonding alignment. However, Zig is not doing that.


C program:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000000254 0x0000000000000254  R      0x1000
  LOAD           0x0000000000001000 0x0000000000401000 0x0000000000401000
                 0x000000000000002d 0x000000000000002d  R E    0x1000
  LOAD           0x0000000000002000 0x0000000000402000 0x0000000000402000
                 0x0000000000000058 0x0000000000000058  R      0x1000
  NOTE           0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x0000000000000030 0x0000000000000030  R      0x8
  NOTE           0x0000000000000230 0x0000000000400230 0x0000000000400230
                 0x0000000000000024 0x0000000000000024  R      0x4
  GNU_PROPERTY   0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x0000000000000030 0x0000000000000030  R      0x8
  GNU_EH_FRAME   0x0000000000002010 0x0000000000402010 0x0000000000402010
                 0x0000000000000014 0x0000000000000014  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10

Zig program:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000001f8 0x00000000000001f8  R      0x8
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000968 0x0000000000000968  R      0x1000
  LOAD           0x0000000000000968 0x0000000000001968 0x0000000000001968
                 0x00000000000006d1 0x00000000000006d1  R E    0x1000
  LOAD           0x0000000000001040 0x0000000000003040 0x0000000000003040
                 0x00000000000002a0 0x0000000000000fc0  RW     0x1000
  LOAD           0x00000000000012e0 0x00000000000042e0 0x00000000000042e0
                 0x0000000000000004 0x0000000000000041  RW     0x1000
  DYNAMIC        0x0000000000001200 0x0000000000003200 0x0000000000003200
                 0x00000000000000e0 0x00000000000000e0  RW     0x8
  GNU_RELRO      0x0000000000001040 0x0000000000003040 0x0000000000003040
                 0x00000000000002a0 0x0000000000000fc0  R      0x1
  GNU_EH_FRAME   0x0000000000000708 0x0000000000000708 0x0000000000000708
                 0x0000000000000064 0x0000000000000064  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000001000000  RW     0x0


Zig's highest offset only goes up to 0x12e0, or 4832 bytes. The C program's highest offset is 0x2010, or 8208 bytes. Those numbers seem proportional to the actual binary sizes.


The reason for the C program's size seems to be that they are forcing the offsets to be aligned, while Zig takes advantage of not needing to, it seems. So, is there a way to un-align our C program's program header offsets?


I thought maybe it had something to do with Zig using LLVM and GCC not using LLVM, so I tried compiling with Clang, but that only reduced it by 300 bytes, and the program header offsets were still aligned.


We're not out of luck though: we still have the option of writing a linker script. And with that, we should be able to explicitly specify the alignment of the program sections.


ENTRY(_start);
SECTIONS
{
  . = 0x10000 + SIZEOF_HEADERS;
  .text ALIGN(16) (READONLY) :
  {
    *(.text)
    *(.text*)
  }
  .rodata ALIGN(16) (READONLY) :
  {
    *(.rodata)
    *(.rodata*)
  }
}


With this, it is no longer constrained to 0x1000 alignment. And, surprisingly, we beat Zig.


$ gcc -g -Oz -fno-builtin -nostdlib -ffreestanding -Wl,--no-dynamic-linker -no-pie -fno-stack-protector -T linker.ld main.c && strip a.out && ./a.out
Hello, world!

This binary executable is a mere 1384 bytes, about 5 times smaller than our 6K Zig binary.


Of course, we also need appropriate Zig reference. Our original example is no longer as valid, since we're not using PIE and just doing a single write syscall (which is technically not correct; the correct way to do it would be to repeat it until it has successfully transmitted all of your data, but we'll rely on the syscall on Linux tending to work the first time anyway).


const std = @import("std");

pub export fn _start() void {
    _ = std.os.linux.write(std.os.linux.STDOUT_FILENO, "Hello, world!\n", 14);
    std.os.linux.exit(0);
}



And the result is...!


Zig still beats us at 1032 bytes. That was honestly quite funny when I saw that.


But, remember, this is only by 352 bytes. That's not a lot of data. Let's directly compare the sections:



At the very least, this accounts for 105 of those bytes. The C program also has more section headers, and more program headers, so those probably take up some space as well. We are also forcing alignment by 16, which is not necessary, and some bytes may be being lost to that.


ENTRY(_start);
SECTIONS
{
  . = 0x10000 + SIZEOF_HEADERS;
  .text ALIGN(1) (READONLY) :
  {
    *(.text)
    *(.text*)
  }
  .rodata ALIGN(1) (READONLY) :
  {
    *(.rodata)
    *(.rodata*)
  }
  /DISCARD/ :
  {
    *(.note.gnu.property)
    *(.note.gnu.build-id)
  }
}


$ gcc -g -Oz -fno-builtin -nostdlib -ffreestanding -Wl,--no-dynamic-linker -Wl,--build-id=none -no-pie -fno-stack-protector -T linker.ld main.c && strip a.out && ./a.out
Hello, world!


So, removing those two additional sections and changing alignment from 16 to 1, we end up with 960 bytes. We beat Zig! For now. I'm sure if I tried for longer, I could get the Zig program to take up a comparable amount of space, but I think this is enough to show how small C can actually get if you get rid of the bloat.