Writing the First Compiler
Every language starts somewhere. Lateralus started as a 200-line tokenizer in a Python file called pipe.py.
Day one: a tokenizer
The very first version could only recognize three things: identifiers, the pipeline operator |>, and parentheses. I wrote it in an afternoon. It couldn't do anything yet, but seeing TOKEN(PIPE_ARROW, '|>') in the output felt like magic.
Week one: parsing pipelines
The parser came next. I chose a Pratt parser because pipeline operators have a natural precedence. The core insight: |> is a left-associative binary operator where the right side is always a function call, and the left side becomes its first argument.
# The first AST node for pipelines (Python)
class PipelineExpr:
def __init__(self, left, fn_name, args):
self.left = left
self.fn_name = fn_name
self.args = args
Month one: a tree-walk interpreter
Before generating any code, I built a tree-walk interpreter. It was slow but made iteration instant — change the grammar, rerun, see results. This is where the standard library functions like map, filter, and reduce were born.
The first real program
The first program that felt "real" was a pipeline that read a CSV, filtered rows, and printed a summary. It was 8 lines of Lateralus replacing 25 lines of Python. That's when I knew the core idea worked.
Lesson: start small
If you want to build a language, don't start with LLVM. Start with a tokenizer. Make it recognize your one novel operator. Then parse it. Then interpret it. The compiler can come later — and it did, six months later, when we switched to bytecode and then to C99 codegen.
The first 1,000 lines are the hardest. After that, you're just adding features to something that already works.
The lexer
The lexer (tokenizer) is the simplest phase: turn a stream of characters into a stream of tokens. Lateralus's lexer is 400 lines of Lateralus:
enum Token {
// Literals
IntLit(i64),
StringLit(String),
// Keywords
Let, Fn, Match, If, Else, Return,
// Operators
Pipe, // |>
PipeError, // |?>
Arrow, // ->
FatArrow, // =>
// Delimiters
LParen, RParen, LBrace, RBrace, LBracket, RBracket,
// Identifiers
Ident(String),
}
fn tokenize(source: String) -> Vec<Token> {
let chars = source.chars() |> peekable()
let tokens = []
while chars.has_next() {
match chars.peek() {
' ' | '
' | ' ' => chars.next(), // Skip whitespace
'/' if chars.peek_at(1) == '/' => skip_line_comment(chars),
'|' if chars.peek_at(1) == '>' => { chars.advance(2); tokens.push(Pipe) },
'0'..'9' => tokens.push(read_number(chars)),
'"' => tokens.push(read_string(chars)),
c if c.is_alphabetic() => tokens.push(read_ident_or_keyword(chars)),
c => tokens.push(read_symbol(chars)),
}
}
tokens
}
The parser
The parser turns tokens into an AST (Abstract Syntax Tree). We use a recursive descent parser — each grammar rule is a function:
fn parse_pipeline(tokens: &mut TokenStream) -> Expr {
let mut expr = parse_primary(tokens)
while tokens.peek() == Pipe {
tokens.consume(Pipe)
let func = parse_call(tokens)
expr = PipeExpr { input: expr, function: func }
}
expr
}
fn parse_match(tokens: &mut TokenStream) -> Expr {
tokens.consume(Match)
let scrutinee = parse_expr(tokens)
tokens.consume(LBrace)
let arms = []
while tokens.peek() != RBrace {
let pattern = parse_pattern(tokens)
tokens.consume(FatArrow)
let body = parse_expr(tokens)
arms.push(MatchArm { pattern, body })
tokens.consume_optional(Comma)
}
tokens.consume(RBrace)
MatchExpr { scrutinee, arms }
}
Recursive descent is the parser technique most professional compilers use (GCC, Clang, Rust, Go). It's predictable, debuggable, and produces excellent error messages because each function knows the grammatical context.
Type checking
The type checker walks the AST and annotates every node with its type. Lateralus uses Hindley-Milner type inference, which means you rarely write type annotations:
// The compiler infers all types here:
let data = [1, 2, 3] // Vec<Int>
let doubled = data |> map(|x| x * 2) // Vec<Int>
let text = doubled |> map(to_string) // Vec<String>
let result = text |> join(", ") // String
// Type inference across function boundaries:
fn process(items) { // items: Vec<T> (inferred from usage)
items |> filter(|i| i > 0) // Constrains T to numeric types
|> sum() // Returns T
}
Self-hosting milestone
The original compiler was written in Python (3,000 lines). The first self-hosted compiler was a direct port to Lateralus (2,400 lines — Lateralus is more concise). The bootstrap process:
- Stage 0 — Python compiler compiles Lateralus compiler source to C99
- Stage 1 — GCC compiles the C99 to a native binary (the stage-1 compiler)
- Stage 2 — Stage-1 compiler compiles its own source code to C99, then to native binary
- Stage 3 — Stage-2 compiler compiles its own source. The output must be bit-identical to Stage 2.
Stage 2 = Stage 3 confirms the compiler correctly compiles itself. This is the gold standard for compiler correctness, used by GCC, Clang, and Rust.