How to execute an object file: part 4, AArch64 edition

Translating source code written in a high-level programming language into an executable binary typically involves a series of steps, namely compiling and assembling the code into object files, and then linking those object files into the final executable. However, there are certain scenarios where it can be useful to apply an alternate approach that involves executing object files directly, bypassing the linker. For example, we might use it for malware analysis or when part of the code requires an incompatible compiler. We’ll be focusing on the latter scenario: when one of our libraries needed to be compiled differently from the rest of the code. Learning how to execute an object file directly will give you a much better sense of how code is compiled and linked together.

To demonstrate how this was done, we have previously published a series of posts on executing an object file:

The initial posts are dedicated to the x86 architecture. Since then the fleet of our working machines has expanded to include a large and growing number of ARM CPUs. This time we’ll repeat this exercise for the aarch64 architecture. You can pause here to read the previous blog posts before proceeding with this one, or read through the brief summary below and reference the earlier posts for more detail. We might reiterate some theory as working with ELF files can be daunting, if it’s not your day-to-day routine. Also, please be mindful that for simplicity, these examples omit bounds and integrity checks. Let the journey begin!

Introduction

In order to obtain an object file or an executable binary from a high-level compiled programming language the code needs to be processed by three components: compiler, assembler and linker. The compiler generates an assembly listing. This assembly listing is picked up by the assembler and translated into an object file. All source files, if a program contains multiple, go through these two steps generating an object file for each source file. At the final step the linker unites all object files into one binary, additionally resolving references to the shared libraries (i.e. we don’t implement the printf function each time, rather we take it from a system library). Even though the approach is platform independent, the compiler output varies by platform as the assembly listing is closely tied to the CPU architecture.

GCC (GNU Compiler Collection) can run each step: compiler, assembler and linker separately for us:

main.c:

#include <stdio.h>

int main(void)
{
	puts("Hello, world!");
	return 0;
}

Compiler (output main.s - assembly listing):

$ gcc -S main.c
$ ls
main.c  main.s

Assembler (output main.o - an object file):

$ gcc -c main.s -o main.o
$ ls
main.c  main.o  main.s

Linker (main - an object file):

$ gcc main.o -o main
$ ls
main  main.c  main.o  main.s
$ ./main
Hello, world!

All the examples assume gcc is running on a native aarch64 architecture or include a cross compilation flag for those who want to reproduce and have no aarch64.

We have two object files in the output above: main.o and main. Object files are files encoded with the ELF (Executable and Linkable Format) standard. Although, main.o is an ELF file, it doesn’t contain all the information to be fully executable.

$ file main.o
main.o: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), not stripped

$ file main
main: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), dynamically
linked, interpreter /lib/ld-linux-aarch64.so.1,
BuildID[sha1]=d3ecd2f8ac3b2dec11ed4cc424f15b3e1f130dd4, for GNU/Linux 3.7.0, not stripped

The ELF File

The central idea of this series of blog posts is to understand how to resolve dependencies from object files without directly involving the linker. For illustrative purposes we generated an object file based on some C-code and used it as a library for our main program. Before switching to the code, we need to understand the basics of the ELF structure.

Each ELF file is made up of one ELF header, followed by file data. The data can include: a program header table, a section header table, and the data which is referred to by the program or section header tables.

The ELF Header

The ELF header provides some basic information about the file: what architecture the file is compiled for, the program entry point and the references to other tables.

The ELF Header:

$ readelf -h main
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           AArch64
  Version:                           0x1
  Entry point address:               0x640
  Start of program headers:          64 (bytes into file)
  Start of section headers:          68576 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         29
  Section header string table index: 28

The ELF Program Header

The execution process of almost every program starts from an auxiliary program, called loader, which arranges the memory and calls the program’s entry point. In the following output the loader is marked with a line “Requesting program interpreter: /lib/ld-linux-aarch64.so.1”. The whole program memory is split into different segments with associated size, permissions and type (which instructs the loader on how to interpret this block of memory). Because the execution process should be performed in the shortest possible time, the sections with the same characteristics and located nearby are grouped into bigger blocks — segments — and placed in the program header. We can say that the program header summarizes the types of data that appear in the section header.

The ELF Program Header:

$ readelf -Wl main

Elf file type is DYN (Position-Independent Executable file)
Entry point 0x640
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x00001b 0x00001b R   0x1
      [Requesting program interpreter: /lib/ld-linux-aarch64.so.1]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x00088c 0x00088c R E 0x10000
  LOAD           0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000270 0x000278 RW  0x10000
  DYNAMIC        0x00fdd8 0x000000000001fdd8 0x000000000001fdd8 0x0001e0 0x0001e0 RW  0x8
  NOTE           0x000254 0x0000000000000254 0x0000000000000254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x0007a0 0x00000000000007a0 0x00000000000007a0 0x00003c 0x00003c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000238 0x000238 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03     .init_array .fini_array .dynamic .got .got.plt .data .bss 
   04     .dynamic 
   05     .note.gnu.build-id .note.ABI-tag 
   06     .eh_frame_hdr 
   07     
   08     .init_array .fini_array .dynamic .got

The ELF Section Header

In the source code of high-level languages, variables, functions, and constants are mixed together. However, in assembly you might see that the data and instructions are separated into different blocks. The ELF file content is divided in an even more granular way. For example, variables with initial values are placed into different sections than the uninitialized ones. This approach optimizes for space, otherwise the values for uninitialized variables would be filled with zeros. Along with the space efficiency, there are security reasons for stratification — executable instructions can’t have writable permissions, while memory containing variables can't be executable. The section header describes each of these sections.

The ELF Section Header:

$ readelf -SW main
There are 29 section headers, starting at offset 0x10be0:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        0000000000000238 000238 00001b 00   A  0   0  1
  [ 2] .note.gnu.build-id NOTE            0000000000000254 000254 000024 00   A  0   0  4
  [ 3] .note.ABI-tag     NOTE            0000000000000278 000278 000020 00   A  0   0  4
  [ 4] .gnu.hash         GNU_HASH        0000000000000298 000298 00001c 00   A  5   0  8
  [ 5] .dynsym           DYNSYM          00000000000002b8 0002b8 0000f0 18   A  6   3  8
  [ 6] .dynstr           STRTAB          00000000000003a8 0003a8 000092 00   A  0   0  1
  [ 7] .gnu.version      VERSYM          000000000000043a 00043a 000014 02   A  5   0  2
  [ 8] .gnu.version_r    VERNEED         0000000000000450 000450 000030 00   A  6   1  8
  [ 9] .rela.dyn         RELA            0000000000000480 000480 0000c0 18   A  5   0  8
  [10] .rela.plt         RELA            0000000000000540 000540 000078 18  AI  5  22  8
  [11] .init             PROGBITS        00000000000005b8 0005b8 000018 00  AX  0   0  4
  [12] .plt              PROGBITS        00000000000005d0 0005d0 000070 00  AX  0   0 16
  [13] .text             PROGBITS        0000000000000640 000640 000134 00  AX  0   0 64
  [14] .fini             PROGBITS        0000000000000774 000774 000014 00  AX  0   0  4
  [15] .rodata           PROGBITS        0000000000000788 000788 000016 00   A  0   0  8
  [16] .eh_frame_hdr     PROGBITS        00000000000007a0 0007a0 00003c 00   A  0   0  4
  [17] .eh_frame         PROGBITS        00000000000007e0 0007e0 0000ac 00   A  0   0  8
  [18] .init_array       INIT_ARRAY      000000000001fdc8 00fdc8 000008 08  WA  0   0  8
  [19] .fini_array       FINI_ARRAY      000000000001fdd0 00fdd0 000008 08  WA  0   0  8
  [20] .dynamic          DYNAMIC         000000000001fdd8 00fdd8 0001e0 10  WA  6   0  8
  [21] .got              PROGBITS        000000000001ffb8 00ffb8 000030 08  WA  0   0  8
  [22] .got.plt          PROGBITS        000000000001ffe8 00ffe8 000040 08  WA  0   0  8
  [23] .data             PROGBITS        0000000000020028 010028 000010 00  WA  0   0  8
  [24] .bss              NOBITS          0000000000020038 010038 000008 00  WA  0   0  1
  [25] .comment          PROGBITS        0000000000000000 010038 00001f 01  MS  0   0  1
  [26] .symtab           SYMTAB          0000000000000000 010058 000858 18     27  66  8
  [27] .strtab           STRTAB          0000000000000000 0108b0 00022c 00      0   0  1
  [28] .shstrtab         STRTAB          0000000000000000 010adc 000103 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), p (processor specific)

Executing example from Part 1 on aarch64

Actually, our initial code from Part 1 works on aarch64 as is!

Let’s have a quick summary about what was done in the code:

We need to find the code of two functions (add5 and add10) in the .text section of our object file (obj.o)
Load the functions in the executable memory
Return the memory locations of the functions to the main program

There is one nuance: even though all the sections are in the section header, neither of them have a string name. Without the names we can’t identify them. However, having an additional character field for each section in the ELF structure would be inefficient for the space — it must be limited by some maximum length and those names which are shorter would leave the space unfilled. Instead, ELF provides an additional section, .shstrtab. This string table concatenates all the names where each name ends with a null terminated byte. We can iterate over the names and match with an offset held by other sections to reference their name. But how do we find .shstrtab itself if we don’t have a name? To solve this chicken and egg problem, the ELF program header provides a direct pointer to .shstrtab. The similar approach is applied to two other sections: .symtab and .strtab. Where .symtab contains all information about the symbols and .strtab holds the list of symbol names. In the code we work with these tables to resolve all their dependencies and find our functions.

Executing example from Part 2 on aarch64

At the beginning of the second blog post on how to execute an object file we made the function add10 depend on add5 instead of being self-contained. This is the first time when we faced relocations. Relocations is the process of loading symbols defined outside the current scope. The relocated symbols can present global or thread-local variables, constant, functions, etc. We’ll start from checking assembly instructions which trigger relocations and uncovering how the ELF format handles them in a more general way.

After making add10 depend on add5 our aarch64 version stopped working as well, similarly to the x86. Let’s take a look at assembly listing:

$ objdump --disassemble --section=.text obj.o

obj.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <add5>:
   0:	d10043ff 	sub	sp, sp, #0x10
   4:	b9000fe0 	str	w0, [sp, #12]
   8:	b9400fe0 	ldr	w0, [sp, #12]
   c:	11001400 	add	w0, w0, #0x5
  10:	910043ff 	add	sp, sp, #0x10
  14:	d65f03c0 	ret

0000000000000018 <add10>:
  18:	a9be7bfd 	stp	x29, x30, [sp, #-32]!
  1c:	910003fd 	mov	x29, sp
  20:	b9001fe0 	str	w0, [sp, #28]
  24:	b9401fe0 	ldr	w0, [sp, #28]
  28:	94000000 	bl	0 <add5>
  2c:	b9001fe0 	str	w0, [sp, #28]
  30:	b9401fe0 	ldr	w0, [sp, #28]
  34:	94000000 	bl	0 <add5>
  38:	a8c27bfd 	ldp	x29, x30, [sp], #32
  3c:	d65f03c0 	ret

Have you noticed that all the hex values in the second column are exactly the same length, in contrast with the instructions lengths seen for x86 in Part 2 of our series? This is because all Armv8-A instructions are presented in 32 bits. Since it is impossible to encode every immediate value into less than 32 bits, some operations require more than one instruction, as we’ll see later. For now, we’re interested in one instruction - bl (branch with link) on rows 28 and 34. The bl is a “jump” instruction, but before the jump it preserves the next instruction after the current one in the link register (lr). When the callee finishes execution the caller address is recovered from lr. Usually, the aarch64 instructions reserve the last 6 bits [31:26] for opcode and some auxiliary fields such as running architecture (32 or 64 bits), condition flag and others. Remaining bits are shared between arguments like source register, destination register and immediate value. Since the bl instruction does not require a source or destination register, the full 26 bits can be used to encode the immediate offset instead. However, 26 bits can only encode a small range (+/-32 MB), but because the jump can only target a beginning of an instruction, it must always be aligned to 4 bytes, which increases the effective range of the encoded immediate fourfold, to +/-128 MB.

Similarly to what we did in Part 2 we’re going to resolve our relocations - first by manually calculating the correct addresses and then by using an approach similar to what the linker does. The current value of our bl instruction is 94000000 or in binary representation 100101**00000000000000000000000000**. All 26 bits are zeros, so we don't jump anywhere. The address is calculated by an offset from the current program counter (pc), which can be positive or negative. In our case we expect it to be -0x28 and -0x34. As described above, it should be divided by 4 and taken as two's complements: -0x28 / 4 = -0xA == 0xFFFFFFF6 and -0x34 / 4 = -0xD == 0xFFFFFFF3. From these values we need to take the lower 26 bits and concatenate them with the initial 6 bits to get the final instruction. So, the final ones will be: 100101**11111111111111111111110110** == 0x97FFFFF6 and 100101**11111111111111111111110011** == 0x97FFFFF3. Have you noticed that all the distance calculations are done relative to the bl (or current pc), not the next instruction as in x86?

Let’s add to the code and execute:

... 

static void parse_obj(void)
{
	...
	/* copy the contents of `.text` section from the ELF file */
	memcpy(text_runtime_base, obj.base + text_hdr->sh_offset, text_hdr->sh_size);

	*((uint32_t *)(text_runtime_base + 0x28)) = 0x97FFFFF6;
	*((uint32_t *)(text_runtime_base + 0x34)) = 0x97FFFFF3;

	/* make the `.text` copy readonly and executable */
	if (mprotect(text_runtime_base, page_align(text_hdr->sh_size), PROT_READ | PROT_EXEC)) {
	...

Compile and run:

$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52

It works! But this is not how the linker handles the relocations. The linker resolves relocation based on the type and formula assigned to this type. In our Part 2 we investigated it quite well. Here again we need to find the type and check the formula for this type:

$ readelf --relocs obj.o

Relocation section '.rela.text' at offset 0x228 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000028  000a0000011b R_AARCH64_CALL26  0000000000000000 add5 + 0
000000000034  000a0000011b R_AARCH64_CALL26  0000000000000000 add5 + 0

Relocation section '.rela.eh_frame' at offset 0x258 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000001c  000200000105 R_AARCH64_PREL32  0000000000000000 .text + 0
000000000034  000200000105 R_AARCH64_PREL32  0000000000000000 .text + 18

Our Type is R_AARCH64_CALL26 and the formula for it is:

ELF64 Code	Name	Operation
283	R_<CLS>_CALL26	S + A - P

where:

S (when used on its own) is the address of the symbol
A is the addend for the relocation
P is the address of the place being relocated (derived from r_offset)

Here are the relevant changes to loader.c:

/* Replace `#define R_X86_64_PLT32 4` with our Type */
#define R_AARCH64_CALL26 283
...

static void do_text_relocations(void)
{
	...
	uint32_t val;

	switch (type)
	{
	case R_AARCH64_CALL26:
		/* The mask separates opcode (6 bits) and the immediate value */
		uint32_t mask_bl = (0xffffffff << 26);
		/* S+A-P, divided by 4 */
		val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
		/* Concatenate opcode and value to get final instruction */
		*((uint32_t *)patch_offset) &= mask_bl;
		val &= ~mask_bl;
		*((uint32_t *)patch_offset) |= val;
		break;
	}
	...
}

Compile and run:

$ gcc -o loader loader.c 
$ ./loader
Calculated relocation: 0x97fffff6
Calculated relocation: 0x97fffff3
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52

So far so good. The next challenge is to add constant data and global variables to our object file and check relocations again:

$ readelf --relocs --wide obj.o

Relocation section '.rela.text' at offset 0x388 contains 8 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000000  0000000500000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .rodata + 0
0000000000000004  0000000500000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .rodata + 0
000000000000000c  0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000010  0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000024  0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000028  0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000068  000000110000011b R_AARCH64_CALL26       0000000000000040 add5 + 0
0000000000000074  000000110000011b R_AARCH64_CALL26       0000000000000040 add5 + 0
...

We have even two new relocations: R_AARCH64_ADD_ABS_LO12_NC and R_AARCH64_ADR_PREL_PG_HI21. Their formulas are:

ELF64 Code	Name	Operation
275	R_<CLS>_ ADR_PREL_PG_HI21	Page(S+A) - Page(P)
277	R_<CLS>_ ADD_ABS_LO12_NC	S + A

where:

Page(expr) is the page address of the expression expr, defined as (expr & ~0xFFF). (This applies even if the machine page size supported by the platform has a different value.)

It’s a bit unclear why we have two new types, while in x86 we had only one. Let’s investigate the assembly code:

$ objdump --disassemble --section=.text obj.o

obj.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <get_hello>:
   0:	90000000 	adrp	x0, 0 <get_hello>
   4:	91000000 	add	x0, x0, #0x0
   8:	d65f03c0 	ret

000000000000000c <get_var>:
   c:	90000000 	adrp	x0, 0 <get_hello>
  10:	91000000 	add	x0, x0, #0x0
  14:	b9400000 	ldr	w0, [x0]
  18:	d65f03c0 	ret

000000000000001c <set_var>:
  1c:	d10043ff 	sub	sp, sp, #0x10
  20:	b9000fe0 	str	w0, [sp, #12]
  24:	90000000 	adrp	x0, 0 <get_hello>
  28:	91000000 	add	x0, x0, #0x0
  2c:	b9400fe1 	ldr	w1, [sp, #12]
  30:	b9000001 	str	w1, [x0]
  34:	d503201f 	nop
  38:	910043ff 	add	sp, sp, #0x10
  3c:	d65f03c0 	ret

We see that all adrp instructions are followed by add instructions. The [add](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en) instruction adds an immediate value to the source register and writes the result to the destination register. The source and destination registers can be the same, the immediate value is 12 bits. The [adrp](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADRP--Form-PC-relative-address-to-4KB-page-?lang=en) instruction generates a pc-relative (program counter) address and writes the result to the destination register. It takes pc of the instruction itself and adds a 21-bit immediate value shifted left by 12 bits. If the immediate value weren’t shifted it would lie in a range of +/-1 MB memory, which isn’t enough. The left shift increases the range up to +/-1 GB. However, because the 12 bits are masked out with the shift, we need to store them somewhere and restore later. That’s why we see add instruction following adrp and two types instead of one. Also, it’s a bit tricky to encode adrp: 2 low bits of immediate value are placed in the position 30:29 and the rest in the position 23:5. Due to size limitations, the aarch64 instructions try to make the most out of 32 bits.

In the code we are going to use the formulas to calculate the values and description of adrp and add instructions to obtain the final opcode:

#define R_AARCH64_CALL26 283
#define R_AARCH64_ADD_ABS_LO12_NC 277
#define R_AARCH64_ADR_PREL_PG_HI21 275
...

{
case R_AARCH64_CALL26:
	/* The mask separates opcode (6 bits) and the immediate value */
	uint32_t mask_bl = (0xffffffff << 26);
	/* S+A-P, divided by 4 */
	val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
	/* Concatenate opcode and value to get final instruction */
	*((uint32_t *)patch_offset) &= mask_bl;
	val &= ~mask_bl;
	*((uint32_t *)patch_offset) |= val;
	break;
case R_AARCH64_ADD_ABS_LO12_NC:
	/* The mask of `add` instruction to separate 
	* opcode, registers and calculated value 
	*/
	uint32_t mask_add = 0b11111111110000000000001111111111;
	/* S + A */
	uint32_t val = *(symbol_address + relocations[i].r_addend);
	val &= ~mask_add;
	*((uint32_t *)patch_offset) &= mask_add;
	/* Final instruction */
	*((uint32_t *)patch_offset) |= val;
case R_AARCH64_ADR_PREL_PG_HI21:
	/* Page(S+A)-Page(P), Page(expr) is defined as (expr & ~0xFFF) */
	val = (((uint64_t)(symbol_address + relocations[i].r_addend)) & ~0xFFF) - (((uint64_t)patch_offset) & ~0xFFF);
	/* Shift right the calculated value by 12 bits.
	 * During decoding it will be shifted left as described above, 
	 * so we do the opposite.
	*/
	val >>= 12;
	/* Separate the lower and upper bits to place them in different positions */ 
	uint32_t immlo = (val & (0xf >> 2)) << 29 ;
	uint32_t immhi = (val & ((0xffffff >> 13) << 2)) << 22;
	*((uint32_t *)patch_offset) |= immlo;
	*((uint32_t *)patch_offset) |= immhi;
	break;
}

Compile and run:

$ gcc -o loader loader.c 
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42

It works! The final code is here.

Executing example from Part 3 on aarch64

Our Part 3 is about resolving external dependencies. When we write code we don’t think much about how to allocate memory or print debug information to the console. Instead, we involve functions from the system libraries. But the code of system libraries needs to be passed through to our programs somehow. Additionally, for optimization purposes, it would be nice if this code would be stored in one place and shared between all programs. And another wish — we don’t want to resolve all the functions and global variables from the libraries, only those which we need and at those times when we need them. To solve these problems, ELF introduced two sections: PLT (the procedure linkage table) and GOT (the global offset table). The dynamic loader creates a list which contains all external functions and variables from the shared library, but doesn’t resolve them immediately; instead they are placed in the PLT section. Each external symbol is presented by a small function, a stub, e.g. puts@plt. When an external symbol is requested, the stub checks if it was resolved previously. If not, the stub searches for an absolute address of the symbol, returns to the requester and writes it in the GOT table. The next time, the address returns directly from the GOT table.

In Part 3 we implemented a simplified PLT/GOT resolution. Firstly, we added a new function say_hello in the obj.c, which calls unresolved system library function puts. Further we added an optional wrapper my_puts in the loader.c. The wrapper isn’t required, we could’ve resolved directly to a standard function, but it's a good example of how the implementation of some functions can be overwritten with custom code. In the next steps we added our PLT/GOT resolution:

PLT section we replaced with a jumptable
GOT we replaced with assembly instructions

Basically, we created a small stub with assembly code (our jumptable) to resolve the global address of our my_puts wrapper and jump to it.

The approach for aarch64 is the same. But the jumptable is very different as it consists of different assembly instructions.

The big difference here compared to the other parts is that we need to work with a 64-bit address for the GOT resolution. Our custom PLT or jumptable is placed close to the main code of obj.c and can operate with the relative addresses as before. For the GOT or referencing my_puts wrapper we’ll use different branch instructions — br or blr. These instructions branch to the register, where the aarch64 registers can hold 64-bit values.

We can check how it resolves with the native PLT/GOT in our loader assembly code:

$ objdump --disassemble --section=.text loader
...
1d2c:	97fffb45 	bl	a40 <puts@plt>
1d30:	f94017e0 	ldr	x0, [sp, #40]
1d34:	d63f0000 	blr	x0
...

The first instruction is bl jump to puts@plt stub. The next ldr instruction tells us that some value was loaded into the register x0 from the stack. Each function has its own stack frame to hold the local variables. The last blr instruction makes a jump to the address stored in x0 register. There is an agreement in the register naming: if the stored value is 64-bits then the register is called x0-x30; if only 32-bits are used then it’s called w0-w30 (the value will be stored in the lower 32-bits and upper 32-bits will be zeroed).

We need to do something similar — place the absolute address of our my_puts wrapper in some register and call br on this register. We don’t need to store the link before branching, the call will be returned to say_hello from obj.c, which is why a plain br will be enough. Let’s check an assembly of simple C-function:

hello.c:

#include <stdint.h>

void say_hello(void)
{
    uint64_t reg = 0x555555550c14;
}

$ gcc -c hello.c
$ objdump --disassemble --section=.text hello.o

hello.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <say_hello>:
   0:	d10043ff 	sub	sp, sp, #0x10
   4:	d2818280 	mov	x0, #0xc14                 	// #3092
   8:	f2aaaaa0 	movk	x0, #0x5555, lsl #16
   c:	f2caaaa0 	movk	x0, #0x5555, lsl #32
  10:	f90007e0 	str	x0, [sp, #8]
  14:	d503201f 	nop
  18:	910043ff 	add	sp, sp, #0x10
  1c:	d65f03c0 	ret

The number 0x555555550c14 is the address returned by lookup_ext_function. We’ve printed it out to use as an example, but any 48-bits hex value can be used.

In our output we see that the value was split in three sections and written in x0 register with three instructions: one mov and two movk. The documentation says that there are only 16 bits for the immediate value, but a shift can be applied (in our case left shift lsl).

However, we can’t use x0 in our context. By convention the registers x0-x7 are caller-saved and used to pass function parameters between calls to other functions. Let’s use x9 then.

We need to modify our loader. Firstly let’s change the jumptable structure.

loader.c:

...
struct ext_jump {
	uint32_t instr[4];
};
...

As we saw above, we need four instructions: mov, movk, movk, br. We don’t need a stack frame as we aren’t preserving any local variables. We just want to load the address into the register and branch to it. But we can’t write human-readable code

e.g. mov x0, #0xc14 into instructions, we need machine binary or hex representation, e.g. d2818280.

Let’s write a simple assembly code to get it:

hw.s:

.global _start

_start: mov     x9, #0xc14 
        movk    x9, #0x5555, lsl #16
        movk    x9, #0x5555, lsl #32
        br      x9

$ as -o hw.o hw.s
$ objdump --disassemble --section=.text hw.o

hw.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <_start>:
   0:	d2818289 	mov	x9, #0xc14                 	// #3092
   4:	f2aaaaa9 	movk	x9, #0x5555, lsl #16
   8:	f2caaaa9 	movk	x9, #0x5555, lsl #32
   c:	d61f0120 	br	x9

Almost done! But there’s one more thing to consider. Even if the value 0x555555550c14 is a real my_puts wrapper address, it will be different on each run if ASLR(Address space layout randomization) is enabled. We need to patch these instructions to put the value which will be returned by lookup_ext_function on each run. We’ll split the obtained value in three parts, 16-bits each, and replace them in our mov and movk instructions according to the documentation, similar to what we did before for our second part.

if (symbols[symbol_idx].st_shndx == SHN_UNDEF) {
	static int curr_jmp_idx = 0;

	uint64_t addr = lookup_ext_function(strtab +  symbols[symbol_idx].st_name);
	uint32_t mov = 0b11010010100000000000000000001001 | ((addr << 48) >> 43);
	uint32_t movk1 = 0b11110010101000000000000000001001 | (((addr >> 16) << 48) >> 43);
	uint32_t movk2 = 0b11110010110000000000000000001001 | (((addr >> 32) << 48) >> 43);
	jumptable[curr_jmp_idx].instr[0] = mov;         // mov  x9, #0x0c14
	jumptable[curr_jmp_idx].instr[1] = movk1;       // movk x9, #0x5555, lsl #16
	jumptable[curr_jmp_idx].instr[2] = movk2;       // movk x9, #0x5555, lsl #32
	jumptable[curr_jmp_idx].instr[3] = 0xd61f0120;  // br   x9

	symbol_address = (uint8_t *)(&jumptable[curr_jmp_idx].instr[0]);
	curr_jmp_idx++;
} else {
	symbol_address = section_runtime_base(&sections[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
}
uint32_t val;
switch (type)
{
case R_AARCH64_CALL26:
	/* The mask separates opcode (6 bits) and the immediate value */
	uint32_t mask_bl = (0xffffffff << 26);
	/* S+A-P, divided by 4 */
	val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
	/* Concatenate opcode and value to get final instruction */
	*((uint32_t *)patch_offset) &= mask_bl;
	val &= ~mask_bl;
	*((uint32_t *)patch_offset) |= val;
	break;
...

In the code we took the address of the first instruction &jumptable[curr_jmp_idx].instr[0] and wrote it in the symbol_address, further because the type is still R_AARCH64_CALL26 it will be put into bl - jump to the relative address. Where our relative address is the first mov instruction. The whole jumptable code will be executed and finished with the blr instruction.

The final run:

$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42
Executing say_hello...
my_puts executed
Hello, world!

The final code is here.

Summary

There are several things we covered in this blog post. We gave a brief introduction on how the binary got executed on Linux and how all components are linked together. We saw a big difference between x86 and aarch64 assembly. We learned how we can hook into the code and change its behavior. But just as it was said in the first blog post of this series, the most important thing is to remember to always think about security first. Processing external inputs should always be done with great care. Bounds and integrity checks have been omitted for the purposes of keeping the examples short, so readers should be aware that the code is not production ready and is designed for educational purposes only.

The Cloudflare Blog

How to execute an object file: part 4, AArch64 edition

Introduction

The ELF File

The ELF Header

The ELF Program Header

The ELF Section Header

Executing example from Part 1 on aarch64

Executing example from Part 2 on aarch64

Executing example from Part 3 on aarch64

Summary

QUIC restarts, slow problems: udpgrm to the rescue

Sequential consistency without borders: how D1 implements global read replication

Pools across the sea: how Hyperdrive speeds up access to databases and why we’re making it free

A steam locomotive from 1993 broke my yarn test