November 2023 · 10 min read

Writing the First Compiler

Every language starts somewhere. Lateralus started as a 200-line tokenizer in a Python file called pipe.py.

Day one: a tokenizer

The very first version could only recognize three things: identifiers, the pipeline operator |>, and parentheses. I wrote it in an afternoon. It couldn't do anything yet, but seeing TOKEN(PIPE_ARROW, '|>') in the output felt like magic.

Week one: parsing pipelines

The parser came next. I chose a Pratt parser because pipeline operators have a natural precedence. The core insight: |> is a left-associative binary operator where the right side is always a function call, and the left side becomes its first argument.

# The first AST node for pipelines (Python)
class PipelineExpr:
    def __init__(self, left, fn_name, args):
        self.left = left
        self.fn_name = fn_name
        self.args = args

Month one: a tree-walk interpreter

Before generating any code, I built a tree-walk interpreter. It was slow but made iteration instant — change the grammar, rerun, see results. This is where the standard library functions like map, filter, and reduce were born.

The first real program

The first program that felt "real" was a pipeline that read a CSV, filtered rows, and printed a summary. It was 8 lines of Lateralus replacing 25 lines of Python. That's when I knew the core idea worked.

Lesson: start small

If you want to build a language, don't start with LLVM. Start with a tokenizer. Make it recognize your one novel operator. Then parse it. Then interpret it. The compiler can come later — and it did, six months later, when we switched to bytecode and then to C99 codegen.

The first 1,000 lines are the hardest. After that, you're just adding features to something that already works.

The lexer

The lexer (tokenizer) is the simplest phase: turn a stream of characters into a stream of tokens. Lateralus's lexer is 400 lines of Lateralus:

enum Token {
    // Literals
    IntLit(i64),
    StringLit(String),
    
    // Keywords
    Let, Fn, Match, If, Else, Return,
    
    // Operators
    Pipe,       // |>
    PipeError,  // |?>
    Arrow,      // ->
    FatArrow,   // =>
    
    // Delimiters
    LParen, RParen, LBrace, RBrace, LBracket, RBracket,
    
    // Identifiers
    Ident(String),
}

fn tokenize(source: String) -> Vec<Token> {
    let chars = source.chars() |> peekable()
    let tokens = []
    
    while chars.has_next() {
        match chars.peek() {
            ' ' | '
' | '	' => chars.next(),  // Skip whitespace
            '/' if chars.peek_at(1) == '/' => skip_line_comment(chars),
            '|' if chars.peek_at(1) == '>' => { chars.advance(2); tokens.push(Pipe) },
            '0'..'9' => tokens.push(read_number(chars)),
            '"' => tokens.push(read_string(chars)),
            c if c.is_alphabetic() => tokens.push(read_ident_or_keyword(chars)),
            c => tokens.push(read_symbol(chars)),
        }
    }
    tokens
}

The parser

The parser turns tokens into an AST (Abstract Syntax Tree). We use a recursive descent parser — each grammar rule is a function:

fn parse_pipeline(tokens: &mut TokenStream) -> Expr {
    let mut expr = parse_primary(tokens)
    
    while tokens.peek() == Pipe {
        tokens.consume(Pipe)
        let func = parse_call(tokens)
        expr = PipeExpr { input: expr, function: func }
    }
    
    expr
}

fn parse_match(tokens: &mut TokenStream) -> Expr {
    tokens.consume(Match)
    let scrutinee = parse_expr(tokens)
    tokens.consume(LBrace)
    
    let arms = []
    while tokens.peek() != RBrace {
        let pattern = parse_pattern(tokens)
        tokens.consume(FatArrow)
        let body = parse_expr(tokens)
        arms.push(MatchArm { pattern, body })
        tokens.consume_optional(Comma)
    }
    
    tokens.consume(RBrace)
    MatchExpr { scrutinee, arms }
}

Recursive descent is the parser technique most professional compilers use (GCC, Clang, Rust, Go). It's predictable, debuggable, and produces excellent error messages because each function knows the grammatical context.

Type checking

The type checker walks the AST and annotates every node with its type. Lateralus uses Hindley-Milner type inference, which means you rarely write type annotations:

// The compiler infers all types here:
let data = [1, 2, 3]                    // Vec<Int>
let doubled = data |> map(|x| x * 2)   // Vec<Int>
let text = doubled |> map(to_string)    // Vec<String>
let result = text |> join(", ")         // String

// Type inference across function boundaries:
fn process(items) {              // items: Vec<T> (inferred from usage)
    items |> filter(|i| i > 0)  // Constrains T to numeric types
          |> sum()               // Returns T
}

Self-hosting milestone

The original compiler was written in Python (3,000 lines). The first self-hosted compiler was a direct port to Lateralus (2,400 lines — Lateralus is more concise). The bootstrap process:

Stage 0 — Python compiler compiles Lateralus compiler source to C99
Stage 1 — GCC compiles the C99 to a native binary (the stage-1 compiler)
Stage 2 — Stage-1 compiler compiles its own source code to C99, then to native binary
Stage 3 — Stage-2 compiler compiles its own source. The output must be bit-identical to Stage 2.

Stage 2 = Stage 3 confirms the compiler correctly compiles itself. This is the gold standard for compiler correctness, used by GCC, Clang, and Rust.