February 2026 · 8 min read

Multi-Hart Scheduling on RISC-V

A single-core OS is a toy. FRISC OS now supports symmetric multiprocessing across all available RISC-V harts.

Hart discovery

On RISC-V, each CPU core is called a "hart" (hardware thread). At boot, only hart 0 runs. We discover additional harts through the device tree blob (DTB) and wake them via the SBI (Supervisor Binary Interface).

Per-hart state

Each hart gets its own kernel stack, trap vector, and run queue. The tp (thread pointer) register holds a pointer to the per-hart structure:

struct HartLocal {
    hart_id: UInt,
    current_task: *Task,
    run_queue: VecDeque<*Task>,
    idle_task: Task,
    tick_count: UInt,
}

Inter-Processor Interrupts

When hart 0 spawns a task, it may send an IPI to wake an idle hart. IPIs use the SBI send_ipi call, which triggers a software interrupt on the target hart.

Work stealing

When a hart's run queue is empty, it steals tasks from other harts. We use a lock-free deque (Chase-Lev) to minimize contention:

fn steal_work(self_hart: &HartLocal) -> Option<*Task> {
    all_harts()
        |> filter(|h| h.hart_id != self_hart.hart_id)
        |> filter(|h| h.run_queue.len() > 1)
        |> max_by(|h| h.run_queue.len())
        |> and_then(|h| h.run_queue.steal())
}

Results

With 4 harts on QEMU, we see near-linear speedup for embarrassingly parallel workloads. The scheduler overhead is under 200 cycles per context switch.

The boot protocol

On RISC-V, all harts (hardware threads) start executing at the same address. The first challenge is coordination: only one hart should initialize the kernel while the others wait. We use an atomic spinlock:

// boot.S — SMP boot sequence
_start:
    // Read hart ID from mhartid CSR
    csrr t0, mhartid
    
    // Hart 0 is the bootstrap processor
    bnez t0, .wait_for_init
    
    // Hart 0: set up stack, clear BSS, jump to kernel_main
    la sp, _stack_top
    call clear_bss
    call kernel_main
    
.wait_for_init:
    // Secondary harts: spin until the init flag is set
    la t1, smp_ready_flag
.spin:
    lr.w t2, (t1)       // Load-reserved (atomic read)
    beqz t2, .spin      // Keep spinning if flag is 0
    
    // Flag is set — initialize this hart's stack and enter scheduler
    call setup_hart_stack
    call scheduler_enter

The lr.w (load-reserved) instruction is key. It's a RISC-V atomic operation that reads memory without bus contention. The alternative — a regular load in a tight loop — would saturate the memory bus and slow down the bootstrap hart.

Per-hart scheduling

Each hart runs its own scheduler instance. There's no global run queue — instead, each hart has a local queue and work is stolen when a hart goes idle:

struct HartScheduler {
    hart_id: u32,
    local_queue: VecDeque<Task>,
    idle: bool,
}

fn scheduler_tick(sched: &mut HartScheduler) {
    match sched.local_queue.pop_front() {
        Some(task) => run_task(task),
        None => {
            // Try to steal from other harts
            for other in all_harts().filter(|h| h.id != sched.hart_id) {
                if let Some(task) = other.local_queue.steal_back() {
                    run_task(task)
                    return
                }
            }
            // Nothing to do — enter low-power wait
            sched.idle = true
            wfi()  // Wait For Interrupt
        }
    }
}

Work stealing from the back of another hart's queue minimizes cache invalidation. The stolen task is likely the oldest, meaning its data is less likely to be in the victim hart's L1 cache.

Inter-hart communication

Harts communicate through RISC-V's Inter-Processor Interrupt (IPI) mechanism. We use software interrupts via the CLINT (Core Local Interruptor):

IPI for scheduling — when a new high-priority task arrives, the creating hart sends an IPI to wake idle harts
IPI for TLB shootdown — when a page table changes, all harts must flush their TLB entries. The initiating hart sends an IPI and waits for acknowledgment from all others.
Atomic message queues — for bulk data transfer between harts, we use lock-free MPSC (multi-producer, single-consumer) queues built on RISC-V's AMO (Atomic Memory Operation) instructions.

Debugging SMP

SMP bugs are the hardest bugs. Race conditions that appear once per million iterations. Deadlocks that only manifest under specific scheduling patterns. Here's what saved us:

QEMU's -d int flag — logs every interrupt on every hart with timestamps. Invaluable for diagnosing IPI issues.
Hart-colored logging — each hart prints in a different color. When debugging a race condition, you can visually trace which hart did what and when.
Deterministic replay — we added a recording mode that logs every scheduling decision. Replaying the log reproduces the exact interleaving that caused the bug.
Single-hart mode — boot with max_harts=1 to rule out SMP-specific issues. If the bug disappears, it's a concurrency bug.