Introduction
This is the documentation for the Stahl project, including:
- the Stahl programming language
- the StahlOS kernel
- the StahlOS Forth programming language, which the kernel is implemented in
Levitation
Introduction
Stahl uses levitation [The Gentle Art of Levitation] to define datatypes and generic operations on them. The basic idea is that one can avoid making datatype definitions built-in if they have a few built-in values. Datatypes are then described rather than defined -- there's a type Desc
that directly mirrors the syntax of datatype declarations. An operator (Data
) then turns a description into a type, by induction on the Desc
.
This might seem a bit roundabout, since whatever "axiomness" exists in the built-in datatype definitions still exists in these new built-ins, but this also enables generic programming. Arbitrary other functions that induct over descriptions can be defined too, for example a function that takes a description, and returns a function that converts a value of the corresponding type to a string. New descriptions can be computed as well, for example a recursive datatype that represents expressions in a programming language can be transformed to add location annotations.
Further Reading
- Generic Constructors and Eliminators from Descriptions
- The Gentle Art of Levitation
- The Practical Guide to Levitation
HOWTO: Add a Platform
Platforms are created as subdirectories of src/platforms
. They require a few pieces:
- A
configure.py
file, to describe the build. - A linker script,
linker.ld
. - One or more assembly files, to implement some platform-specific functions.
- Optionally, some number of Forth files, to add code to the root process.
The configure.py
file used to define a platform relies on the currently undocumented build system. You only need a small subset of the available functionality to define most platforms, though. An example is:
# This corresponds to the kernel/configure.py file.
import stahl_build.kernel as kernel
# This defines the root subcomponent of the component for the directory.
# It depends on the kernel.common component, which all platforms should.
@configure("root", ["kernel.common"])
def configure(ctx):
# We call out to the kernel configuration helper.
kernel.configure(
# We pass the build context.
ctx,
# The name of our platform. This should be the same as
# the name of the directory the platform is defined in.
"my-cool-target",
# The kernel subcomponents to include. This should always
# include "common", as well as whatever drivers are useful
# for the platform. The "debug" subcomponent is useful when
# initially bringing up a target; it enables self-tests,
# and in the future may enable other goodies.
["common", "debug", "drivers/gicv3", "drivers/psci"],
)
For an example of a more complicated platform, see the js
platform. It injects custom Forth code, and builds extra files. Note that since it depends on external files, it also requires them to be passed in via default.nix
.
Assembly Files
The assembly files currently need to implement the following functions:
platform.init
Inputs
x0
: A pointer to the FDT.
Outputs
None.
Side Effects
Can trash x0
-x15
.
Simplest Implementation
.global platform.init
platform.init:
ret
platform.debug_print
Prints a string for debugging purposes.
Inputs
x0
: The address of the string to printx1
: The length of the string to print
Outputs
None.
Side Effects
Can trash x0
-x15
.
Simplest Implementation
.global platform.debug_print
platform.debug_print:
ret
platform.idle
Waits until an interrupt is received.
This is overridable in case extra actions need to be taken, for example power management functionality.
Inputs
None.
Outputs
None.
Side Effects
Can trash x0
-x15
.
Simplest Implementation
.global platform.idle
platform.idle:
wfi
ret
Frame Allocator
The frame allocator is responsible for allocating physical memory in units of 64KiB. 64KiB frames are used rather than 4KiB frames because individual objects should not be stored in frames. Rather, a process' heap should be allocated with these frames, and the GC should be run within them.
Because of this limited use-case, the frame allocator doesn't need to avoid fragmentation (since frames can't be fragmented), so it can be a simple singly-linked-list of frames, where the address of the next frame is the first double-word of the frame. This additionally makes allocating a frame atomically in only a few instructions quite practical.
Page Allocator
The page allocator is responsible for allocating virtual memory in units whose sizes are multiples of 64KiB. It sits directly on top of the frame allocator, and should almost always be used as the interface to memory allocation.
No auxiliary structures are needed by the page allocator besides the page tables themselves.
The allocator always allocates into high memory; low memory remains identity mapped. (Initially, just the kernel itself is paged into low memory; however, the page fault handler will map in the rest as it is used.) A guard page will always be allocated between two allocations.
Currently, running out of address space is a kernel panic, and address space will not be reused. In the future, this will be resolved in a GC-like manner: the (user-allocatable) virtual address space will be split into two halves, and whenever one half runs out of memory, allocations will be moved to the other half. (This is why the GC ABI requires support for moving a process' high-memory allocations.)
Cryptography
Disclaimer: I'm not a cryptographer, I'm rolling my own crypto, none of it's been audited, this is bad, don't use it for anything important.
Algorithms
StahlOS implements the following algorithms:
- BLAKE2s (only the features present in RFC7693)
- ChaCha20
- Fortuna (using ChaCha20 and BLAKE2s)
Garbage Collector
Currently, the garbage collector is an implementation of Cheney's semispace collector. Right now, it's non-generational, but this is very likely to change in the future.
Memory Layout
Allocations' sizes are a multiple of 8 bytes, and must be at least 8 bytes.
Each allocation has an extra 8-byte dword of metadata before the allocation itself. The high 32 bits act as flags:
addr-4 addr-3 addr-2 addr-1
LSB MSB LSB MSB LSB MSB LSB MSB
+---------------+---------------+-+-+-----------+---------------+
| Reserved | Reserved |D|M| Reserved | Reserved |
+---------------+---------------+-+-+-----------+---------------+
D
: All Data. This flag causes the low 32 bits to be interpreted differently. See below for details.M
: Moved. This flag declares that the value has been moved to the to-space, and the first dword of data is the forwarding pointer.
All reserved bits must be 0.
The low 32 bits are interpreted as either one or two length fields, depending on the value of the D
flag.
If the D
flag is zero, the low 16 bits (starting at addr-8
) are interpreted as the length of the allocation, measured in 8-byte dwords. The next 16 bits (starting at addr-6
) are interpreted as the number of data dwords in the allocation.
If the D
flag is one, the low 32 bits (starting at addr-8
) are interpreted as the length of the allocation in dwords, and it is assumed that all of these are data.
Allocation
Allocation is very simple. A "bump pointer" allocator is used, which keeps track of the next free address. This value is added to with each allocation.
If this would go outside the bounds of the heap, instead a collection will be triggered. For this reason, allocation must always occur at a GC safepoint.
Collection
The roots of the garbage collector are found in the x16, x17, x18, and x19 registers, as well as the Forth data stack. From each of these values, a recursive function runs, taking the value and returning a new value for it to have.
- If the high bit is zero, the same value is returned.
- If the high bit is one, the value is treated as a pointer to the start of an allocation, whose
M
flag is checked.- If the
M
flag is one, the first dword is read, and its value is returned. - If the
M
flag is zero, a new allocation is made in the to-space with the same lengths and flags, and the data is then copied over. TheM
flag is then set on the original allocation, and the address of the new allocation is written to the first dword of the original allocation.
- If the
Once this has finished, this same process runs directly over values in the to-space. Any data values as specified by the header are skipped.
After this, the from-space is freed.
The JS Platform
To make trying StahlOS out easier, it can run under unicorn.js in the browser. This is rather slow, but actually works! This is implemented as the js
platform, which this page will describe for reference.
Memory Layout
Due to JavaScript's number
type not being able to store 64-bit integers, unicorn.js has some strange behavior around large addresses. As a result, things are mapped to an address that's lower than might often be expected.
RAM starts at 0x10000
, and is 128MiB (0x8000000
bytes) long. The devicetree gets stored to 0x10000
. The kernel is loaded at 0x20000
.
A special region of memory is mapped from 0xffff0000
to 0xffffffff
. This is the host interface, and is used to communicate between the browser and the kernel.
Host Interface
The host interface acts as a memory-mapped peripheral.
Commands
The main way the kernel communicates with the host interface is by writing a single byte to 0xffff0000
. Depending on the value of the byte, this can mean several things:
0x00
: Write to the Console
Writes the UTF-8 string whose address is in x0
and whose length is in x1
to the console.
0x01
: Panic
Stops emulation permanently, and puts the UI into "panic mode."
0x02
: Idle
Stops emulation temporarily, restarting it on the next user input.
User Input
User input is stored in a circular buffer, between addresses 0xffff1000
and 0xffff1fff
(inclusive). The offset from 0xffff1000
where the next byte of input will be stored is in the halfword at 0xffff0002
.
Booting
The boot process is fairly simple:
- Perform a first pass over the devicetree, initialize the frame allocator with memory regions found from it.
- Perform a second pass over the devicetree, initializing interrupt controllers.
- Perform a third pass over the devicetree, initializing the rest of the devices.
- Print "Hello, world!" to all serial ports.
- Panic.
GICv3
This page contains a set of notes about the ARM Generic Interrupt Controller v3.
Interrupt Types:
- SPI: Shared Processor Interrupts -- most hw interrupts; e.g. uarts
- PPI: Per-Processor Interrupts -- interrupt caused by hw attached to one core, e.g. timers
- SGI: Software-Generated Interrupts -- interrupt caused by another core
- LPI: Locality-specific Peripheral Interrupt -- ???, new in GICv3
Interrupt Numbers:
- 0-15: SGIs
- 16-31: PPIs
- 32-1019: SPIs
- 1020-1023: Special
- 1024-1055: Reserved
- 1056-1119: PPIs
- 1120-4095: Reserved
- 4096-5119: SPIs
- 5120-8191: Reserved
- 8192-???: LPIs
Affinity Routing:
PEs (cores) identified by a dotted-quad (like IPv4). In AArch32, the top octet is 0. The current core's affinity value can be read from the MPIDR_EL1
register, which also has other relevant flags.
Registers (3 Parts):
- Distributor Interface: One per GIC
- Routes SPIs and SGIs
- Redistributor Interface: One per core
- CPU Interface: One per core
Interrupts can be disabled/enabled with GICR_ICENABLER<n>
/GICR_ISENABLER<n>
registers; reading from each gets the same value, but writing to IC
clears the positions with 1 bits, while IS
sets the positions with 1 bits.
f
- Interface Control Register
- Interface Priority Mask Register
- Interrupt Acknowledge Register
- End of Interrupt Register
Hashtables
StahlOS contains an implementation of hashtables based on Google's Swiss table hashtables, using as reference "Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step". It also implements the "cache-line-aware groups" optimization described at 47:25 of "Designing a Fast, Efficient, Cache-friendly Hash Table, Step by Step," since it only supports 8-byte entries.
It's implemented in AArch64 assembly and follows the typical Procedure Call Standard, so it can be tested (and used) in userspace. The implementation (in kernel/src/common/hashtable/table.s
) is hopefully quite readable, so it should be treated as part of this documentation as well, and is the primary reference for most of the algorithmic details of the hashtables.
Collisions
The hashtables index hash groups (kinda sorta like buckets) with the low bits of the hash, and use the high 8 bits to find elements within the bucket. The number of hash groups is always a power of two, simplifying the implementation. Resizing occurs after a load factor of 15/16 (93.75%) is reached.
>>> def birthday_problem(hash_domain, num_entries):
... """Returns the odds that no two entries in a group of `num_entries`
... share the same random `hash_domain`-bit hash."""
... acc = 1
... for i in range(num_entries):
... acc *= ((hash_domain - i) / hash_domain)
... return acc
>>> def collision_chance(num_entries):
... log2_num_groups = ceil(log2(num_entries / 7 * 16 / 15))
... hash_domain = 2 ** 8 * 2 ** log2_num_groups
... return birthday_problem(hash_domain, num_entries)
>>> plt.title("probability of no collisions")
... plt.gca(xlabel="number of entries", ylabel="probability")
... plt.plot(range(1, 500), [(collision_chance(n), 0.5) for n in range(1, 500)])
... plt.show()
Since we do open addressing, the chance of a collision might actually be higher. TODO: Analyze this.
Bitwise Tricks
Estimate log2_initial_group_count
from Initial Capacity
hashtable_init
takes log2_initial_group_count
(which is efficient to compute on) rather than an initial capacity, which is what most programmers expect. We provide an efficient approximation (which avoids the cost of a log
and a division):
>>> def clz(x):
... size = x.size()
... expr = BitVecVal(size, size)
... for i in reversed(range(x.size())):
... j = size - i - 1
... bit = Extract(j, j, x)
... expr = If(bit == 1, i, expr)
... return expr
>>> def estimate_log2_initial_group_count(initial_cap):
... return 64 - clz((initial_cap * 3) >> 4)
>>> def estimate_log2_initial_group_count_correct(initial_cap):
... log2_initial_group_count = estimate_log2_initial_group_count(initial_cap)
... initial_group_count = BitVecVal(1, 64) << log2_initial_group_count
... capacity = initial_group_count * 7 * 15 / 16
... return And(ULE(initial_cap, capacity),
... If(initial_cap < 7, ULT(capacity, initial_cap + 8),
... ULT(capacity, initial_cap * 4)))
>>> n = BitVec("n", 64)
>>> solve(Not(estimate_log2_initial_group_count_correct(n)), ULE(1, n), ULT(n, 2 ** 56))
no solution
As Z3 shows (by failing to find a counter-example), for any initial capacity between 1 and 256, the approximation in estimate_log2_initial_group_count
is correct, and not too wasteful.
Find Free Slot
During insertion, we want to be able to find a free slot in the hash group, given the metadata byte.
>>> cs = []
>>> def clz(x):
... size = x.size()
... expr = size
... for i in reversed(range(x.size())):
... j = size - i - 1
... bit = Extract(j, j, x)
... expr = If(bit == 1, i, expr)
... return expr
>>> def index(l, n, canary):
... cs.append(n >= 0)
... cs.append(n < len(l))
... expr = canary
... for i in range(len(l)):
... expr = If(i == n, l[i], expr)
... return expr
>>> def find_free_slot(n):
... return 63 - clz(~(n | 0xffffffffffffff01))
>>> def find_free_slot_correct(n):
... p = [Bool(f"p{i}") for i in range(8)]
... cs.append(Not(p[0]))
... for i in range(1, 8):
... cs.append(p[i] == (Extract(i, i, n) == 1))
... cs.append(Extract(63, 8, n) == 0)
... i = Int("i")
... cs.append(i == find_free_slot(n))
... return If(i == 0,
... reduce(And, (Not(p[i]) for i in range(1, 8))),
... Not(index(p, i, False)))
>>> n = BitVec("n", 64)
>>> solve(Not(find_free_slot_correct(n)), *cs)
no solution
Since Z3 couldn't find examples of incorrectness, find_free_slot
must be a correct implementation.
StahlOS Forth Reference
This section is a reference for the words defined in StahlOS Forth.
TODO: Document notation
Language Words
TODO: Fix autogenerations.
These are the standard words provided as builtins, pseudobuiltins, or in a Forth file with priority <10.
Stack Manipulation
DEPTH
Signature: ( -- u )
Pushes the depth of the stack to the stack.
Example:
1 2 3 DEPTH
.S \ 1 2 3 3
DROP
Signature: ( x -- )
Pops an item from the top of the stack and discards it.
Example:
1 2 3 DROP
.S \ 1 2
DUP
Signature: ( x -- x x )
Duplicates the top item of the stack.
Example:
1 2 3 DUP
.S \ 1 2 3 3
Arithmetic, Comparison, and Logic
Strings
ADJUST
Signature: ( addr len n -- addr+n len-n )
Advances n
bytes in a string.
Examples:
$1000 16 4 ADJUST
.S \ $1004 12
S"Foo Bar" 4 ADJUST
TYPE \ Bar
Control Flow
Compilation
Dictionary
Memory Access
Return Stack Manipulation
AArch64 Assembler
These words are for writing new code words. (At some point, these will probably be put into their own wordlist, and something like the typical ;CODE
and END-CODE
will be added.)
Constants
Registers
The constants X0
, X1
, ..., X29
, X30
, XZR
, XSP
exist, and are used to specify registers to instructions.
Other Constants
The shifts LSL
, LSR
, ASR
, ROR
are provided for instructions that take shifted registers as arguments.
Instructions
All of these words consume all of their arguments and leave nothing on the stack.
Word | Arguments | Instruction |
---|---|---|
INSN/ADD-IMM64! | insn-addr dst-reg src-reg imm | add dst-reg, src-reg, #imm |
INSN/ADR! | insn-addr reg value | adr reg, value |
INSN/B! | insn-addr dest-addr | b dest-addr |
INSN/BL! | insn-addr dest-addr | bl dest-addr |
INSN/ERET! | insn-addr | eret |
INSN/HVC! | insn-addr imm | hvc #imm |
INSN/LDP-POST64! | insn-addr dst1-reg dst2-reg addr-reg imm | ldp dst1-reg, dst2-reg, [addr-reg], #imm |
INSN/LDR-IMM-POST64! | insn-addr dst-reg addr-reg imm | ldr dst-reg, [addr-reg], #imm |
INSN/LDR-LIT64! | insn-addr dst-reg mem-addr | ldr dst-reg, [mem-addr] |
INSN/MOV-REG64! | insn-addr dst-reg src-reg | mov dst-reg, src-reg |
INSN/NOP! | insn-addr | nop |
INSN/ORR-SREG64! | insn-addr dst-reg src1-reg src2-reg shift-kind shift-amount | orr dst-reg, src1-reg, src2-reg, shift-kind shift-amount |
INSN/RET! | insn-addr | ret |
INSN/STP-PRE64! | insn-addr src1-reg src2-reg addr-reg imm | stp src1-reg, src2-reg, [addr-reg, #imm]! |
INSN/STR-IMM-PRE64! | insn-addr src-reg addr-reg imm | str src-reg, [addr-reg, #imm]! |
INSN/SMC! | insn-addr imm | smc #imm |
INSN/SVC! | insn-addr imm | svc #imm |
AArch64 Intrinsics
These words do AArch64-specific things.
System Register Access
All of these words are of the form AARCH64/basename@
or AARCH64/basename!
. The @
variety has a signature of ( -- u )
, the !
variety has a signature of ( u -- )
. They correspond to system registers.
Word Base Name | Register |
---|---|
DAIF | DAIF |
ELR-EL1 | ELR_EL1 |
ESR-EL1 | ESR_EL1 |
FAR-EL1 | FAR_EL1 |
MAIR-EL1 | MAIR_EL1 |
SCTLR-EL1 | SCTLR_EL1 |
SPSEL | SPSel |
SPSR-EL1 | SPSR_EL1 |
TCR-EL1 | TCR_EL1 |
TTBR0-EL1 | TTBR0_EL1 |
TTBR1-EL1 | TTBR0_EL1 |
VBAR-EL1 | VBAR_EL1 |
Other Words
AARCH64/EXCEPTION-VECTOR-TABLE
Signature: ( -- addr )
Pushes the address of the kernel-allocated exception vector table to the stack. This is allocated in the kernel so it is guaranteed to have proper alignment, since it needs a rather large one. (In the future, this might change, and it may instead be allocated inside a frame allocated by the standard frame allocator.)
StahlOS Forth Internals
Some internal details of the Forth implementation used in the StahlOS kernel is briefly described here. This ABI is tentative and may change.
Word Structure
StahlOS Forth is a DTC Forth.
Words must be aligned to a multiple of 8 bytes.
0x00 +-------------------+ --\
| Previous Word | |
| Header Address | |
0x08 +-------------------+ |
| Name Length | |
0x09 +-------------------+ |
| Word Flags | |
| (see below) | |
0x0c +-------------------+ |
| Word Name | +-- Word Header
+-------------------+ |
| Padding | | /- Multiple of 4 bytes
| (0 to 3 bytes) | | |
0x?? +-------------------+ - | -+
| Name Length | | |
0x?? +-------------------+ | |
| Padding (3 bytes) | | |
0x?? +-------------------+ --/--/
| Code Field | \
+-------------------+ +-- Word Body
| Parameter Field | |
+-------------------+ --/
Word Flags are:
0x08 0x09 0x0a 0x0b
LSB MSB LSB MSB LSB MSB LSB MSB
+---------------+-+-+-+-+-+-+-+-+---------------+---------------+
| Name Length |S|I|0|0|0|0|0|0| Reserved | Reserved |
+---------------+-+-+-+-+-+-+-+-+---------------+---------------+
0
: Reserved, must be zero.S
: Smudge.0
for visible,1
for hidden.I
: Immediate.0
for non-immediate,1
for immediate.
Registers
x0-x15
: Temporariesx16-x19
: Reserved for processesx20
: The Forth instruction pointerx21
: The process context pointerx22
: The stack base pointerx23
: The stack top pointerx24
: The top stack itemx25
: The return stack base pointerx26
: The return stack top pointerx27
: The top return stack itemx28
: Temporaryx29
: Temporary, used byDOES>
/(DODOES)
The x22
register should always be x21 + 0x0200
. The x23
register should be x22 + n
, where n
is the depth of the stack in bytes.
x25
and x26
have the same relationship x22
and x23
do but for the return stack, with x25
always containing x21 + 0x0100
.
Process Context
0x0000 +---------------------------+---------------------------+
| Low 64 Bits of PID | Saved Forth Insn Pointer |
0x0010 +---------------------------+---------------------------+
| Dict Pointer | Dict Remaining Length |
0x0020 +---------------------------+---------------------------+
| Last Defined Header | Process Flags |
0x0030 +---------------------------+---------------------------+
| Source Address | Source Length |
0x0040 +---------------------------+---------------------------+
| |
... Reserved ...
| |
0x00e0 +---------------------------+---------------------------+
| Saved x16 | Saved x17 |
0x00f0 +---------------------------+---------------------------+
| Saved x18 | Saved x19 |
0x0100 +---------------------------+---------------------------+
| Saved Return Stack Depth | |
0x0108 +---------------------------+ |
| |
... Return Stack ...
| |
0x0200 +---------------------------+---------------------------+
| Saved Stack Depth | |
0x0208 +---------------------------+ |
| |
... Stack ...
| |
0x0400 +-------------------------------------------------------+
Process Flags are:
0x28 0x29 0x2a 0x2b
LSB MSB LSB MSB LSB MSB LSB MSB
+-+-+-----------+---------------+---------------+---------------+
|B|S| Reserved | Reserved | Reserved | Reserved |
+-+-+-----------+---------------+---------------+---------------+
0x2c 0x2d 0x2e 0x2f
LSB MSB LSB MSB LSB MSB LSB MSB
+---------------+---------------+---------------+---------------+
| Reserved | Reserved | Reserved | Reserved |
+---------------+---------------+---------------+---------------+
B
: Base.0
for decimal,1
for hex.S
: State.0
for interpret,1
for compile.