This is the documentation for the Stahl project, including:

  • the Stahl programming language
  • the StahlOS kernel
  • the StahlOS Forth programming language, which the kernel is implemented in



Stahl uses levitation [The Gentle Art of Levitation] to define datatypes and generic operations on them. The basic idea is that one can avoid making datatype definitions built-in if they have a few built-in values. Datatypes are then described rather than defined -- there's a type Desc that directly mirrors the syntax of datatype declarations. An operator (Data) then turns a description into a type, by induction on the Desc.

This might seem a bit roundabout, since whatever "axiomness" exists in the built-in datatype definitions still exists in these new built-ins, but this also enables generic programming. Arbitrary other functions that induct over descriptions can be defined too, for example a function that takes a description, and returns a function that converts a value of the corresponding type to a string. New descriptions can be computed as well, for example a recursive datatype that represents expressions in a programming language can be transformed to add location annotations.

Further Reading

HOWTO: Add a Platform

Platforms are created as subdirectories of src/platforms. They require a few pieces:

  • A file, to describe the build.
  • A linker script, linker.ld.
  • One or more assembly files, to implement some platform-specific functions.
  • Optionally, some number of Forth files, to add code to the root process.

The file used to define a platform relies on the currently undocumented build system. You only need a small subset of the available functionality to define most platforms, though. An example is:

# This corresponds to the kernel/ file.
import stahl_build.kernel as kernel

# This defines the root subcomponent of the component for the directory.
# It depends on the kernel.common component, which all platforms should.
@configure("root", ["kernel.common"])
def configure(ctx):
	# We call out to the kernel configuration helper.
		# We pass the build context.
		# The name of our platform. This should be the same as
		# the name of the directory the platform is defined in.
		# The kernel subcomponents to include. This should always
		# include "common", as well as whatever drivers are useful
		# for the platform. The "debug" subcomponent is useful when
		# initially bringing up a target; it enables self-tests,
		# and in the future may enable other goodies.
		["common", "debug", "drivers/gicv3", "drivers/psci"],

For an example of a more complicated platform, see the js platform. It injects custom Forth code, and builds extra files. Note that since it depends on external files, it also requires them to be passed in via default.nix.

Assembly Files

The assembly files currently need to implement the following functions:



  • x0: A pointer to the FDT.



Side Effects

Can trash x0-x15.

Simplest Implementation

.global platform.init


Prints a string for debugging purposes.


  • x0: The address of the string to print
  • x1: The length of the string to print



Side Effects

Can trash x0-x15.

Simplest Implementation

.global platform.debug_print


Waits until an interrupt is received.

This is overridable in case extra actions need to be taken, for example power management functionality.





Side Effects

Can trash x0-x15.

Simplest Implementation

.global platform.idle

Frame Allocator

The frame allocator is responsible for allocating physical memory in units of 64KiB. 64KiB frames are used rather than 4KiB frames because individual objects should not be stored in frames. Rather, a process' heap should be allocated with these frames, and the GC should be run within them.

Because of this limited use-case, the frame allocator doesn't need to avoid fragmentation (since frames can't be fragmented), so it can be a simple singly-linked-list of frames, where the address of the next frame is the first double-word of the frame. This additionally makes allocating a frame atomically in only a few instructions quite practical.

Page Allocator

The page allocator is responsible for allocating virtual memory in units whose sizes are multiples of 64KiB. It sits directly on top of the frame allocator, and should almost always be used as the interface to memory allocation.

No auxiliary structures are needed by the page allocator besides the page tables themselves.

The allocator always allocates into high memory; low memory remains identity mapped. (Initially, just the kernel itself is paged into low memory; however, the page fault handler will map in the rest as it is used.) A guard page will always be allocated between two allocations.

Currently, running out of address space is a kernel panic, and address space will not be reused. In the future, this will be resolved in a GC-like manner: the (user-allocatable) virtual address space will be split into two halves, and whenever one half runs out of memory, allocations will be moved to the other half. (This is why the GC ABI requires support for moving a process' high-memory allocations.)


Disclaimer: I'm not a cryptographer, I'm rolling my own crypto, none of it's been audited, this is bad, don't use it for anything important.


StahlOS implements the following algorithms:

  • BLAKE2s (only the features present in RFC7693)
  • ChaCha20
  • Fortuna (using ChaCha20 and BLAKE2s)

Garbage Collector

Currently, the garbage collector is an implementation of Cheney's semispace collector. Right now, it's non-generational, but this is very likely to change in the future.

Memory Layout

Allocations' sizes are a multiple of 8 bytes, and must be at least 8 bytes.

Each allocation has an extra 8-byte dword of metadata before the allocation itself. The high 32 bits act as flags:

     addr-4           addr-3          addr-2          addr-1
LSB          MSB LSB         MSB LSB         MSB LSB          MSB
|   Reserved    |    Reserved   |D|M|  Reserved |    Reserved   |
  • D: All Data. This flag causes the low 32 bits to be interpreted differently. See below for details.
  • M: Moved. This flag declares that the value has been moved to the to-space, and the first dword of data is the forwarding pointer.

All reserved bits must be 0.

The low 32 bits are interpreted as either one or two length fields, depending on the value of the D flag.

If the D flag is zero, the low 16 bits (starting at addr-8) are interpreted as the length of the allocation, measured in 8-byte dwords. The next 16 bits (starting at addr-6) are interpreted as the number of data dwords in the allocation.

If the D flag is one, the low 32 bits (starting at addr-8) are interpreted as the length of the allocation in dwords, and it is assumed that all of these are data.


Allocation is very simple. A "bump pointer" allocator is used, which keeps track of the next free address. This value is added to with each allocation.

If this would go outside the bounds of the heap, instead a collection will be triggered. For this reason, allocation must always occur at a GC safepoint.


The roots of the garbage collector are found in the x16, x17, x18, and x19 registers, as well as the Forth data stack. From each of these values, a recursive function runs, taking the value and returning a new value for it to have.

  • If the high bit is zero, the same value is returned.
  • If the high bit is one, the value is treated as a pointer to the start of an allocation, whose M flag is checked.
    • If the M flag is one, the first dword is read, and its value is returned.
    • If the M flag is zero, a new allocation is made in the to-space with the same lengths and flags, and the data is then copied over. The M flag is then set on the original allocation, and the address of the new allocation is written to the first dword of the original allocation.

Once this has finished, this same process runs directly over values in the to-space. Any data values as specified by the header are skipped.

After this, the from-space is freed.

The JS Platform

To make trying StahlOS out easier, it can run under unicorn.js in the browser. This is rather slow, but actually works! This is implemented as the js platform, which this page will describe for reference.

Memory Layout

Due to JavaScript's number type not being able to store 64-bit integers, unicorn.js has some strange behavior around large addresses. As a result, things are mapped to an address that's lower than might often be expected.

RAM starts at 0x10000, and is 128MiB (0x8000000 bytes) long. The devicetree gets stored to 0x10000. The kernel is loaded at 0x20000.

A special region of memory is mapped from 0xffff0000 to 0xffffffff. This is the host interface, and is used to communicate between the browser and the kernel.

Host Interface

The host interface acts as a memory-mapped peripheral.


The main way the kernel communicates with the host interface is by writing a single byte to 0xffff0000. Depending on the value of the byte, this can mean several things:

0x00: Write to the Console

Writes the UTF-8 string whose address is in x0 and whose length is in x1 to the console.

0x01: Panic

Stops emulation permanently, and puts the UI into "panic mode."

0x02: Idle

Stops emulation temporarily, restarting it on the next user input.

User Input

User input is stored in a circular buffer, between addresses 0xffff1000 and 0xffff1fff (inclusive). The offset from 0xffff1000 where the next byte of input will be stored is in the halfword at 0xffff0002.


The boot process is fairly simple:

  • Perform a first pass over the devicetree, initialize the frame allocator with memory regions found from it.
  • Perform a second pass over the devicetree, initializing interrupt controllers.
  • Perform a third pass over the devicetree, initializing the rest of the devices.
  • Print "Hello, world!" to all serial ports.
  • Panic.


This page contains a set of notes about the ARM Generic Interrupt Controller v3.

Interrupt Types:

  • SPI: Shared Processor Interrupts -- most hw interrupts; e.g. uarts
  • PPI: Per-Processor Interrupts -- interrupt caused by hw attached to one core, e.g. timers
  • SGI: Software-Generated Interrupts -- interrupt caused by another core
  • LPI: Locality-specific Peripheral Interrupt -- ???, new in GICv3

Interrupt Numbers:

  • 0-15: SGIs
  • 16-31: PPIs
  • 32-1019: SPIs
  • 1020-1023: Special
  • 1024-1055: Reserved
  • 1056-1119: PPIs
  • 1120-4095: Reserved
  • 4096-5119: SPIs
  • 5120-8191: Reserved
  • 8192-???: LPIs

Affinity Routing:

PEs (cores) identified by a dotted-quad (like IPv4). In AArch32, the top octet is 0. The current core's affinity value can be read from the MPIDR_EL1 register, which also has other relevant flags.

Registers (3 Parts):

  • Distributor Interface: One per GIC
    • Routes SPIs and SGIs
  • Redistributor Interface: One per core
  • CPU Interface: One per core

Interrupts can be disabled/enabled with GICR_ICENABLER<n>/GICR_ISENABLER<n> registers; reading from each gets the same value, but writing to IC clears the positions with 1 bits, while IS sets the positions with 1 bits.


  • Interface Control Register
  • Interface Priority Mask Register
  • Interrupt Acknowledge Register
  • End of Interrupt Register


StahlOS contains an implementation of hashtables based on Google's Swiss table hashtables, using as reference "Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step". It also implements the "cache-line-aware groups" optimization described at 47:25 of "Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step," since it only supports 8-byte entries.

It's implemented in AArch64 assembly and follows the typical Procedure Call Standard, so it can be tested (and used) in userspace. The implementation (in kernel/src/common/hashtable/table.s) is hopefully quite readable, so it should be treated as part of this documentation as well, and is the primary reference for most of the algorithmic details of the hashtables.


The hashtables index hash groups (kinda sorta like buckets) with the low bits of the hash, and use the high 8 bits to find elements within the bucket. The number of hash groups is always a power of two, simplifying the implementation. Resizing occurs after a load factor of 15/16 (93.75%) is reached.

>>> def birthday_problem(hash_domain, num_entries):
...     """Returns the odds that no two entries in a group of `num_entries`
...     share the same random `hash_domain`-bit hash."""
...     acc = 1
...     for i in range(num_entries):
...         acc *= ((hash_domain - i) / hash_domain)
...     return acc

>>> def collision_chance(num_entries):
...     log2_num_groups = ceil(log2(num_entries / 7 * 16 / 15))
...     hash_domain = 2 ** 8 * 2 ** log2_num_groups
...     return birthday_problem(hash_domain, num_entries)

>>> plt.title("probability of no collisions")
... plt.gca(xlabel="number of entries", ylabel="probability")
... plt.plot(range(1, 500), [(collision_chance(n), 0.5) for n in range(1, 500)])
1980-01-01T00:00:00+00:00 image/svg+xml Matplotlib v3.4.1,

Since we do open addressing, the chance of a collision might actually be higher. TODO: Analyze this.

Bitwise Tricks

Estimate log2_initial_group_count from Initial Capacity

hashtable_init takes log2_initial_group_count (which is efficient to compute on) rather than an initial capacity, which is what most programmers expect. We provide an efficient approximation (which avoids the cost of a log and a division):

>>> def clz(x):
...     size = x.size()
...     expr = BitVecVal(size, size)
...     for i in reversed(range(x.size())):
...         j = size - i - 1
...         bit = Extract(j, j, x)
...         expr = If(bit == 1, i, expr)
...     return expr

>>> def estimate_log2_initial_group_count(initial_cap):
...     return 64 - clz((initial_cap * 3) >> 4)

>>> def estimate_log2_initial_group_count_correct(initial_cap):
...     log2_initial_group_count = estimate_log2_initial_group_count(initial_cap)
...     initial_group_count = BitVecVal(1, 64) << log2_initial_group_count
...     capacity = initial_group_count * 7 * 15 / 16
...     return And(ULE(initial_cap, capacity),
...                If(initial_cap < 7, ULT(capacity, initial_cap + 8),
...                                    ULT(capacity, initial_cap * 4)))

>>> n = BitVec("n", 64)
>>> solve(Not(estimate_log2_initial_group_count_correct(n)), ULE(1, n), ULT(n, 2 ** 56))
no solution

As Z3 shows (by failing to find a counter-example), for any initial capacity between 1 and 256, the approximation in estimate_log2_initial_group_count is correct, and not too wasteful.

Find Free Slot

During insertion, we want to be able to find a free slot in the hash group, given the metadata byte.

>>> cs = []

>>> def clz(x):
...     size = x.size()
...     expr = size
...     for i in reversed(range(x.size())):
...         j = size - i - 1
...         bit = Extract(j, j, x)
...         expr = If(bit == 1, i, expr)
...     return expr

>>> def index(l, n, canary):
...     cs.append(n >= 0)
...     cs.append(n < len(l))
...     expr = canary
...     for i in range(len(l)):
...         expr = If(i == n, l[i], expr)
...     return expr

>>> def find_free_slot(n):
...     return 63 - clz(~(n | 0xffffffffffffff01))

>>> def find_free_slot_correct(n):
...     p = [Bool(f"p{i}") for i in range(8)]
...     cs.append(Not(p[0]))
...     for i in range(1, 8):
...         cs.append(p[i] == (Extract(i, i, n) == 1))
...     cs.append(Extract(63, 8, n) == 0)
...     i = Int("i")
...     cs.append(i == find_free_slot(n))
...     return If(i == 0,
...         reduce(And, (Not(p[i]) for i in range(1, 8))),
...         Not(index(p, i, False)))

>>> n = BitVec("n", 64)
>>> solve(Not(find_free_slot_correct(n)), *cs)
no solution

Since Z3 couldn't find examples of incorrectness, find_free_slot must be a correct implementation.

StahlOS Forth Reference

This section is a reference for the words defined in StahlOS Forth.

TODO: Document notation

Language Words

TODO: Fix autogenerations.

These are the standard words provided as builtins, pseudobuiltins, or in a Forth file with priority <10.

Stack Manipulation


Signature: ( -- u )

Pushes the depth of the stack to the stack.


1 2 3 DEPTH
.S \ 1 2 3 3


Signature: ( x -- )

Pops an item from the top of the stack and discards it.


1 2 3 DROP
.S \ 1 2


Signature: ( x -- x x )

Duplicates the top item of the stack.


1 2 3 DUP
.S \ 1 2 3 3

Arithmetic, Comparison, and Logic



Signature: ( addr len n -- addr+n len-n )

Advances n bytes in a string.


$1000 16 4 ADJUST
.S \ $1004 12
S"Foo Bar" 4 ADJUST
TYPE \ Bar

Control Flow



Memory Access

Return Stack Manipulation

AArch64 Assembler

These words are for writing new code words. (At some point, these will probably be put into their own wordlist, and something like the typical ;CODE and END-CODE will be added.)



The constants X0, X1, ..., X29, X30, XZR, XSP exist, and are used to specify registers to instructions.

Other Constants

The shifts LSL, LSR, ASR, ROR are provided for instructions that take shifted registers as arguments.


All of these words consume all of their arguments and leave nothing on the stack.

INSN/ADD-IMM64!insn-addr dst-reg src-reg immadd dst-reg, src-reg, #imm
INSN/ADR!insn-addr reg valueadr reg, value
INSN/B!insn-addr dest-addrb dest-addr
INSN/BL!insn-addr dest-addrbl dest-addr
INSN/HVC!insn-addr immhvc #imm
INSN/LDP-POST64!insn-addr dst1-reg dst2-reg addr-reg immldp dst1-reg, dst2-reg, [addr-reg], #imm
INSN/LDR-IMM-POST64!insn-addr dst-reg addr-reg immldr dst-reg, [addr-reg], #imm
INSN/LDR-LIT64!insn-addr dst-reg mem-addrldr dst-reg, [mem-addr]
INSN/MOV-REG64!insn-addr dst-reg src-regmov dst-reg, src-reg
INSN/ORR-SREG64!insn-addr dst-reg src1-reg src2-reg shift-kind shift-amountorr dst-reg, src1-reg, src2-reg, shift-kind shift-amount
INSN/STP-PRE64!insn-addr src1-reg src2-reg addr-reg immstp src1-reg, src2-reg, [addr-reg, #imm]!
INSN/STR-IMM-PRE64!insn-addr src-reg addr-reg immstr src-reg, [addr-reg, #imm]!
INSN/SMC!insn-addr immsmc #imm
INSN/SVC!insn-addr immsvc #imm

AArch64 Intrinsics

These words do AArch64-specific things.

System Register Access

All of these words are of the form AARCH64/basename@ or AARCH64/basename!. The @ variety has a signature of ( -- u ), the ! variety has a signature of ( u -- ). They correspond to system registers.

Word Base NameRegister

Other Words


Signature: ( -- addr )

Pushes the address of the kernel-allocated exception vector table to the stack. This is allocated in the kernel so it is guaranteed to have proper alignment, since it needs a rather large one. (In the future, this might change, and it may instead be allocated inside a frame allocated by the standard frame allocator.)

StahlOS Forth Internals

Some internal details of the Forth implementation used in the StahlOS kernel is briefly described here. This ABI is tentative and may change.

Word Structure

StahlOS Forth is a DTC Forth.

Words must be aligned to a multiple of 8 bytes.

0x00 +-------------------+ --\
     |   Previous Word   |   |
     |   Header Address  |   |
0x08 +-------------------+   |
     |    Name Length    |   |
0x09 +-------------------+   |
     |     Word Flags    |   |
     |    (see below)    |   |
0x0c +-------------------+   |
     |     Word Name     |   +-- Word Header
     +-------------------+   |
     |      Padding      |   |  /- Multiple of 4 bytes
     |   (0 to 3 bytes)  |   |  |
0x?? +-------------------+ - | -+
     |    Name Length    |   |  |
0x?? +-------------------+   |  |
     | Padding (3 bytes) |   |  |
0x?? +-------------------+ --/--/
     |     Code Field    |   \
     +-------------------+   +-- Word Body
     |  Parameter Field  |   |
     +-------------------+ --/

Word Flags are:

       0x08            0x09            0x0a            0x0b
 LSB         MSB LSB         MSB LSB         MSB LSB         MSB
|  Name Length  |S|I|0|0|0|0|0|0|    Reserved   |    Reserved   |
  • 0: Reserved, must be zero.
  • S: Smudge. 0 for visible, 1 for hidden.
  • I: Immediate. 0 for non-immediate, 1 for immediate.


  • x0-x15: Temporaries
  • x16-x19: Reserved for processes
  • x20: The Forth instruction pointer
  • x21: The process context pointer
  • x22: The stack base pointer
  • x23: The stack top pointer
  • x24: The top stack item
  • x25: The return stack base pointer
  • x26: The return stack top pointer
  • x27: The top return stack item
  • x28: Temporary
  • x29: Temporary, used by DOES> / (DODOES)

The x22 register should always be x21 + 0x0200. The x23 register should be x22 + n, where n is the depth of the stack in bytes.

x25 and x26 have the same relationship x22 and x23 do but for the return stack, with x25 always containing x21 + 0x0100.

Process Context

0x0000 +---------------------------+---------------------------+
       |    Low 64 Bits of PID     |  Saved Forth Insn Pointer |
0x0010 +---------------------------+---------------------------+
       |       Dict Pointer        |    Dict Remaining Length  |
0x0020 +---------------------------+---------------------------+
       |    Last Defined Header    |       Process Flags       |
0x0030 +---------------------------+---------------------------+
       |       Source Address      |       Source Length       |
0x0040 +---------------------------+---------------------------+
       |                                                       |
      ...                       Reserved                      ...
       |                                                       |
0x00e0 +---------------------------+---------------------------+
       |         Saved x16         |         Saved x17         |
0x00f0 +---------------------------+---------------------------+
       |         Saved x18         |         Saved x19         |
0x0100 +---------------------------+---------------------------+
       | Saved Return Stack Depth  |                           |
0x0108 +---------------------------+                           |
       |                                                       |
      ...                     Return Stack                    ...
       |                                                       |
0x0200 +---------------------------+---------------------------+
       |     Saved Stack Depth     |                           |
0x0208 +---------------------------+                           |
       |                                                       |
      ...                        Stack                        ...
       |                                                       |
0x0400 +-------------------------------------------------------+

Process Flags are:

       0x28            0x29            0x2a            0x2b
 LSB         MSB LSB         MSB LSB         MSB LSB         MSB
|B|S| Reserved  |    Reserved   |    Reserved   |    Reserved   |

       0x2c            0x2d            0x2e            0x2f
 LSB         MSB LSB         MSB LSB         MSB LSB         MSB
|    Reserved   |    Reserved   |    Reserved   |    Reserved   |
  • B: Base. 0 for decimal, 1 for hex.
  • S: State. 0 for interpret, 1 for compile.