Multi-Hart Scheduling on RISC-V
A single-core OS is a toy. FRISC OS now supports symmetric multiprocessing across all available RISC-V harts.
Hart discovery
On RISC-V, each CPU core is called a "hart" (hardware thread). At boot, only hart 0 runs. We discover additional harts through the device tree blob (DTB) and wake them via the SBI (Supervisor Binary Interface).
Per-hart state
Each hart gets its own kernel stack, trap vector, and run queue. The tp (thread pointer) register holds a pointer to the per-hart structure:
struct HartLocal {
hart_id: UInt,
current_task: *Task,
run_queue: VecDeque<*Task>,
idle_task: Task,
tick_count: UInt,
}
Inter-Processor Interrupts
When hart 0 spawns a task, it may send an IPI to wake an idle hart. IPIs use the SBI send_ipi call, which triggers a software interrupt on the target hart.
Work stealing
When a hart's run queue is empty, it steals tasks from other harts. We use a lock-free deque (Chase-Lev) to minimize contention:
fn steal_work(self_hart: &HartLocal) -> Option<*Task> {
all_harts()
|> filter(|h| h.hart_id != self_hart.hart_id)
|> filter(|h| h.run_queue.len() > 1)
|> max_by(|h| h.run_queue.len())
|> and_then(|h| h.run_queue.steal())
}
Results
With 4 harts on QEMU, we see near-linear speedup for embarrassingly parallel workloads. The scheduler overhead is under 200 cycles per context switch.
The boot protocol
On RISC-V, all harts (hardware threads) start executing at the same address. The first challenge is coordination: only one hart should initialize the kernel while the others wait. We use an atomic spinlock:
// boot.S — SMP boot sequence
_start:
// Read hart ID from mhartid CSR
csrr t0, mhartid
// Hart 0 is the bootstrap processor
bnez t0, .wait_for_init
// Hart 0: set up stack, clear BSS, jump to kernel_main
la sp, _stack_top
call clear_bss
call kernel_main
.wait_for_init:
// Secondary harts: spin until the init flag is set
la t1, smp_ready_flag
.spin:
lr.w t2, (t1) // Load-reserved (atomic read)
beqz t2, .spin // Keep spinning if flag is 0
// Flag is set — initialize this hart's stack and enter scheduler
call setup_hart_stack
call scheduler_enter
The lr.w (load-reserved) instruction is key. It's a RISC-V atomic operation that reads memory without bus contention. The alternative — a regular load in a tight loop — would saturate the memory bus and slow down the bootstrap hart.
Per-hart scheduling
Each hart runs its own scheduler instance. There's no global run queue — instead, each hart has a local queue and work is stolen when a hart goes idle:
struct HartScheduler {
hart_id: u32,
local_queue: VecDeque<Task>,
idle: bool,
}
fn scheduler_tick(sched: &mut HartScheduler) {
match sched.local_queue.pop_front() {
Some(task) => run_task(task),
None => {
// Try to steal from other harts
for other in all_harts().filter(|h| h.id != sched.hart_id) {
if let Some(task) = other.local_queue.steal_back() {
run_task(task)
return
}
}
// Nothing to do — enter low-power wait
sched.idle = true
wfi() // Wait For Interrupt
}
}
}
Work stealing from the back of another hart's queue minimizes cache invalidation. The stolen task is likely the oldest, meaning its data is less likely to be in the victim hart's L1 cache.
Inter-hart communication
Harts communicate through RISC-V's Inter-Processor Interrupt (IPI) mechanism. We use software interrupts via the CLINT (Core Local Interruptor):
- IPI for scheduling — when a new high-priority task arrives, the creating hart sends an IPI to wake idle harts
- IPI for TLB shootdown — when a page table changes, all harts must flush their TLB entries. The initiating hart sends an IPI and waits for acknowledgment from all others.
- Atomic message queues — for bulk data transfer between harts, we use lock-free MPSC (multi-producer, single-consumer) queues built on RISC-V's AMO (Atomic Memory Operation) instructions.
Debugging SMP
SMP bugs are the hardest bugs. Race conditions that appear once per million iterations. Deadlocks that only manifest under specific scheduling patterns. Here's what saved us:
- QEMU's
-d intflag — logs every interrupt on every hart with timestamps. Invaluable for diagnosing IPI issues. - Hart-colored logging — each hart prints in a different color. When debugging a race condition, you can visually trace which hart did what and when.
- Deterministic replay — we added a recording mode that logs every scheduling decision. Replaying the log reproduces the exact interleaving that caused the bug.
- Single-hart mode — boot with
max_harts=1to rule out SMP-specific issues. If the bug disappears, it's a concurrency bug.