User
So I have a compiler with these phases: Lexer -> Parser (CST) -> AST (view into CST) -> HIR -> THIR -> Codegen (VM bytecode). Now I'm running into a problem. I want to desugar constructs like for x in array loops into something more codegen friendly, and also I have match statements that need some desugaring. I have added the for loop desugaring into while loops at the AST -> HIR layer transformation. But for match, I need the typechecker to know about all match arms to do exhaustivness analysis and stuff, I can't desugar into if statements. I was basically trying to mimic Rust, which in the playground has a "Show HIR" button and I can see desugared for loops. Match statements are not desugared there, only at the MIR level. But even at MIR Rust still does analysis of borrow checking and stuff. The problem with analyzing desugared code is that when finding errors you have to show the user his source code, not your compiler desugared code. So if I desugar at HIR layer my only option is to carry both the source code and the desugared code nodes, I use source for typechecking and errors, I use desugared for codegen. If there's error in desugared code I guess I can use synthetic spans that refer back to user's code so errors point at it, not at my desugared code. Another solution I thought about is to add another layer after THIR called DHIR (Desugared HIR) or something like that, where errors can no longer happen. But I wonder if I need at some point a CFG MIR where errors can be found, I'm gonna run into the same problem. How do other compilers deal with this? What would be the best approach?
Model
This is a classic compiler architecture dilemma: The trade-off between IR simplicity (desugaring early) and diagnostic fidelity (keeping original structure).
You have correctly identified the standard solution (synthetic spans) and the standard frustration (showing errors on code that "doesn't exist" in the user's file).
Here is how production compilers (like Rust, Swift, and Roslyn) typically solve this, and how you should fit it into your pipeline.
1. The Golden Rule: Span Propagation
Your intuition about "synthetic spans" is correct. This is not just a backup option; it is the industry standard way to handle this.
You do not need to carry both the source node and the desugared node in the AST/HIR. You carry the desugared node, but you carefully assign it the Span (file location) of the original source construct.
Example: Desugaring a for loop
Suppose the user writes:
When you desugar this into a while loop in your AST->HIR phase:
1. The while node inherits the span of the for keyword (or the whole loop).
2. The iterator variable (hidden) inherits the span of arr.
3. The .next() call (hidden) inherits the span of arr or in.
If a Type Error occurs in the desugared code:
If arr is not an iterator, the type checker (running on HIR/THIR) will complain that "type Array does not have method .next()". Because the method call node has the span of arr, the error highlights arr in the user's source code. The user sees: "Error: arr is not iterable"—which is perfectly intelligible, even though the user never wrote .next().
2. The Desugaring Hierarchy
You should split your desugaring into two distinct buckets. You are currently trying to do everything at once, which is causing the friction with match.
A. Syntactic Sugar (Early Desugaring)
Phase: AST $\to$ HIR
Strategy: Desugar things that map purely structurally and don't require type information.
* Constructs: for loops, while let, if let, unless, self-assign operators (+=).
* Reason: These can be lowered to loop, match, and assignment immediately. This simplifies your HIR so your type checker has fewer node types to handle.
* Diagnostics: As mentioned above, rely on span propagation.
B. Semantic Constructs (Late Lowering)
Phase: THIR $\to$ MIR (or THIR $\to$ Bytecode)
Strategy: Keep these high-level during type checking, then lower them once you have types.
* Constructs: match expressions, closures (sometimes), async/await.
* Reason: As you noted, you need the "matrix" of patterns to check exhaustiveness. You cannot easily check exhaustiveness on a series of nested if/else statements.
* Workflow:
1. HIR: match exists as a tree node.
2. THIR: Type checker validates types of arms and performs Exhaustiveness Checking on the high-level match node.
3. Lowering: Once THIR is validated, you compile the match into a Decision Tree or a CFG (Control Flow Graph) of jumps/branches.
3. Do you need a "DHIR" or "MIR"?
You mentioned: THIR $\to$ Codegen (VM Bytecode).
The jump from a Typed Tree (THIR) directly to VM Bytecode is often too large if you have complex control flow like match. This is where MIR (Mid-level IR) comes in. In your nomenclature, "DHIR" is essentially MIR.
I strongly recommend inserting a CFG-based IR (Control Flow Graph) after THIR but before Bytecode.
The Pipeline:
1. AST: Raw parse tree.
2. HIR: Desugared for loops. Everything is a tree. Names are resolved.
3. THIR: Fully typed. match is still a high-level node.
* Action: Perform exhaustiveness checks here.
* Action: Report "missing match arm" errors here.
4. MIR (Your DHIR): Lower match into jumps/gotos. Flatten nested expressions.
* Action: Borrow checking (if you have it) usually happens here because it needs a Control Flow Graph to understand lifetimes across jumps.
* Note: Even though the code looks nothing like the source, the nodes still hold the original Spans.
5. Codegen: Emit bytecode.
4. How to show errors on Desugared MIR
You asked: "If I desugar... and find errors, I have to show the user his source code."
This is the hardest part of compiler UX. Here is how you handle it:
Scenario: You have lowered match into a complex web of GOTO statements in MIR. You run a borrow checker on that MIR and find a use-after-free error.
- The Span: The specific MIR instruction that causes the error (e.g.,
move _1) holds the span of the variable usage in the source code. You highlight that. - The Context: If the error is specific to the desugaring logic (e.g., a variable created by the desugaring is invalid), you use a "Note".
- Primary Error: "Borrow of moved value:
x" (points to user code). - Note: "This error occurs in the desugaring of a
forloop." (You can track "Desugaring Origin" in your span data structure).
- Primary Error: "Borrow of moved value:
Summary of Recommendation
- Desugar
forloops in AST $\to$ HIR. Ensure the generated nodes point to the source text spans. This keeps your type checker simple (it doesn't need to know what aforloop is). - Keep
matchin HIR and THIR. Do your exhaustiveness checking on the high-level representation. - Create a Lowering Phase (THIR $\to$ MIR). In this phase:
- Convert the
matchinto a Decision Tree or If-Else chain. - Convert expressions into a Control Flow Graph (linear basic blocks).
- Convert the
- Codegen from MIR.
Next Step:
Would you like to see a concrete example of how to implement the Span propagation for a for loop desugaring to ensure error messages point to the right place?
User
Okay, I want to understand exactly how an MIR looks like. So, for Rust, I rememeber reading a couple things:
- This blog post: https://blog.rust-lang.org/2016/04/19/MIR/
- And the RFC: https://github.com/rust-lang/rfcs/blob/master/text/1211-mir.md
Also in the Rust playground I can see the MIR.
Given this program:
enum Status {
Active(i32),
Idle
}
fn print(s: &Status) {
for _ in 0..10 {
let s = match s {
Status::Active(n) => String::from("active"),
Status::Idle => String::from("idle")
};
}
}
It shows me this MIR:
// WARNING: This output format is intended for human consumers only
// and is subject to change without notice. Knock yourself out.
// HINT: See also -Z dump-mir for MIR at specific points during compilation.
fn print(_1: &Status) -> () {
debug s => _1;
let mut _0: ();
let mut _2: std::ops::Range<i32>;
let mut _3: std::ops::Range<i32>;
let mut _4: std::ops::Range<i32>;
let mut _5: std::option::Option<i32>;
let mut _6: &mut std::ops::Range<i32>;
let mut _7: isize;
let mut _9: isize;
scope 1 {
debug iter => _4;
let _8: std::string::String;
let _10: &i32;
scope 2 {
debug s => _8;
}
scope 3 {
debug n => _10;
}
}
bb0: {
_3 = std::ops::Range::<i32> { start: const 0_i32, end: const 10_i32 };
_2 = <std::ops::Range<i32> as IntoIterator>::into_iter(move _3) -> [return: bb1, unwind continue];
}
bb1: {
_4 = move _2;
goto -> bb2;
}
bb2: {
_6 = &mut _4;
_5 = <std::ops::Range<i32> as Iterator>::next(copy _6) -> [return: bb3, unwind continue];
}
bb3: {
_7 = discriminant(_5);
switchInt(move _7) -> [0: bb6, 1: bb5, otherwise: bb4];
}
bb4: {
unreachable;
}
bb5: {
_9 = discriminant((*_1));
switchInt(move _9) -> [0: bb8, 1: bb7, otherwise: bb4];
}
bb6: {
return;
}
bb7: {
_8 = <String as From<&str>>::from(const "idle") -> [return: bb9, unwind continue];
}
bb8: {
_10 = &(((*_1) as Active).0: i32);
_8 = <String as From<&str>>::from(const "active") -> [return: bb9, unwind continue];
}
bb9: {
drop(_8) -> [return: bb2, unwind continue];
}
}
alloc2 (size: 4, align: 1) {
69 64 6c 65 │ idle
}
alloc1 (size: 6, align: 1) {
61 63 74 69 76 65 │ active
}
fn Status::Active(_1: i32) -> Status {
let mut _0: Status;
bb0: {
_0 = Status::Active(move _1);
return;
}
}
// MIR FOR CTFE
fn Status::Active(_1: i32) -> Status {
let mut _0: Status;
bb0: {
_0 = Status::Active(move _1);
return;
}
}
Model
This is an excellent example of MIR. You’ve struck gold with this snippet because it perfectly demonstrates control flow flattening, which is exactly what you need to solve your match and for loop desugaring problems.
Here is the breakdown of exactly what you are looking at and how it solves the architectural problems you described.
1. The Anatomy of MIR
Unlike your AST or HIR (which are trees), MIR is a Control Flow Graph (CFG). This means the concept of "nesting" (nested ifs, nested loops) is completely gone.
- Locals (
_0,_1...): All variables are declared at the top._0: Always the return value._1..._n: Arguments to the function._n+1...: Temporary variables generated by the compiler. Notice how many there are (_2to_10). MIR is very explicit about memory.
- Basic Blocks (
bb0,bb1...): The code is sliced into chunks. A "Basic Block" is a sequence of straight-line code that always ends with a "Terminator" (a jump, a switch, or a return). - Terminators: These are the traffic cops.
goto,switchInt,return,unreachable.
2. How match is Desugared (The switchInt)
You mentioned you were stuck on how to desugar match while keeping semantics. Look at bb5 in your output. This is the heart of your match s { ... }.
bb5: {
// 1. Read the "tag" or variant index of the enum
_9 = discriminant((*_1));
// 2. Branch based on that integer
switchInt(move _9) -> [0: bb8, 1: bb7, otherwise: bb4];
}
discriminant: This is a special MIR instruction that extracts the integer value of the enum variant (e.g.,Active=0,Idle=1).switchInt: This is the low-level version ofmatch. It takes an integer and jumps to a specific block.- If
0(Active) $\to$ Go tobb8(where the code forActivelives). - If
1(Idle) $\to$ Go tobb7(where the code forIdlelives). otherwise$\to$ Go tobb4(which containsunreachable, handling the "impossible" case).
- If
Why this helps you: By the time you get to MIR, you don't need to know about "patterns" or "exhaustiveness." You already checked that in THIR. Now you just need raw integer comparisons and jumps. This is incredibly friendly for Codegen.
3. How for is Desugared (The Loop)
You can see the for loop has been completely exploded into the Iterator protocol.
- Setup (
bb0): Creates the range0..10(_3) and converts it into an iterator (_2). - Loop Header (
bb2): This is where the loop "starts". - The Check (
bb3): This is the desugared condition.bb3: { // Call .next() on the iterator _5 = <...>::next(copy _6) -> ...; // Check if result is Some (1) or None (0) _7 = discriminant(_5); switchInt(move _7) -> [0: bb6, 1: bb5, otherwise: bb4]; }- If
0(None) $\to$ Go tobb6(Exit loop/Return). - If
1(Some) $\to$ Go tobb5(Execute loop body / The Match).
- If
4. How to Apply This to Your Compiler
You are currently struggling because you are trying to do two incompatible things in one IR:
1. High-Level Analysis: Type checking, exhaustiveness (Requires match to be a tree).
2. Low-Level Execution: Bytecode generation (Requires match to be jumps).
The Solution: Add the MIR Phase
You should insert a phase between THIR and Codegen.
Phase 1: THIR (Tree-based, Typed)
* Keep match as a complex node: Match { scrutinizer, arms: [...] }.
* Error Checking: Run your exhaustiveness check here. If a case is missing, you have the tree structure to easily report "Missing match arm for Status::Idle".
Phase 2: MIR Construction (The "Lowering" step)
* This is where you convert the Tree into the Graph.
* Algorithm for match:
1. Create a "start" block (e.g., bb5).
2. Emit discriminant instruction.
3. Create a block for each arm (bb7, bb8).
4. Emit a switchInt in the start block pointing to those arm blocks.
5. Recursively lower the code inside the arms.
6. Make all arms jump to a common "merge" block (end of match) or loop header.
Phase 3: Codegen
* Now Codegen is trivial! You just iterate over the list of Basic Blocks.
* switchInt becomes a jump_table or cmp/jmp instructions in your VM.
* goto becomes a jmp.
* You don't need to manage recursion or stack depth in your codegen logic because the MIR is already flat.
5. Handling Errors in Desugared Code
Revisiting your worry about errors:
"If there's error in desugared code... errors point at it, not at my desugared code."
In MIR, every statement carries a Span.
- Example: In
bb3, the call tonext()is generated by the compiler. - The Span: You attach the span of the
forkeyword (or theinkeyword) from the original AST to thisnext()call instruction in the MIR. - The Error: If
next()fails (e.g., borrow checker says the iterator is moved), the error engine looks at the MIR instruction, grabs the span (which points tofor), and prints: "Error at line X: Borrow of moved value in for loop".
The user never sees bb3 or next(). They just see their loop highlighted.
Summary Recommendation
Do not try to desugar match inside your AST or HIR. It is too early.
- Keep
matchstructured until THIR is finished. - Build a MIR (Control Flow Graph) specifically for lowering
matchintoswitchesandloopsintogotos. - Codegen from MIR. It will make your VM bytecode generation incredibly simple (almost 1-to-1 mapping).
User
Interesting. I know CPython uses CFG as well, can you show me an example
Model
CPython absolutely uses a CFG, though it exposes it a bit differently than Rust. In Python, the CFG generation happens deep inside the compiler (in compile.c) just before the final bytecode is emitted.
While Rust's MIR is a distinct intermediate language you can dump and read, Python's CFG is usually invisible unless you disassemble the final bytecode. However, the structure is exactly the same: Basic Blocks connected by Jumps.
Here is an example using Python's dis (disassembler) module, annotated to show you the hidden Control Flow Graph.
The Python Source
Let's use a loop with a conditional, similar to your Rust example, to see how the control flow splits and merges.
The "CFG" (Disassembled Bytecode)
If you run import dis; dis.dis(generic_example), you get the linear bytecode. I have grouped the instructions below into Basic Blocks (BB) so you can see the graph structure that the compiler built.
Note: The >> arrows indicate jump targets (entry points to a basic block).
BB0: Loop Setup & Header
This is the entry point. It immediately checks the loop condition.
2 0 SETUP_LOOP 36 (to 38) # Pushes a block for 'break' handling
>> 2 LOAD_FAST 0 (x)
4 LOAD_CONST 1 (0)
6 COMPARE_OP 4 (>)
8 POP_JUMP_IF_FALSE 36 # TERMINATOR: If False, Jump to BB4 (End)
BB1: The if Condition
We entered the loop. Now we check the if.
3 10 LOAD_FAST 0 (x)
12 LOAD_CONST 2 (2)
14 BINARY_MODULO
16 LOAD_CONST 1 (0)
18 COMPARE_OP 2 (==)
20 POP_JUMP_IF_FALSE 30 # TERMINATOR: If False, Jump to BB3 (Else)
BB2: The then Branch
The "True" path of the if statement.
4 22 LOAD_GLOBAL 0 (print)
24 LOAD_CONST 3 ('even')
26 CALL_FUNCTION 1
28 POP_TOP
29 JUMP_ABSOLUTE 32 # TERMINATOR: Jump to BB3-Merge
BB3: The else Branch & Merge
Notice this block handles the else logic and the code that happens after the if/else (the decrement).
6 >> 30 LOAD_GLOBAL 0 (print) # Target of "else" jump
32 LOAD_CONST 4 ('odd')
34 CALL_FUNCTION 1
36 POP_TOP
7 >> 32 LOAD_FAST 0 (x) # Target of "then" jump (Merge point)
34 LOAD_CONST 5 (1)
36 INPLACE_SUBTRACT
38 STORE_FAST 0 (x)
40 JUMP_ABSOLUTE 2 # TERMINATOR: Jump back to BB0 (Loop Header)
BB4: Exit
Where POP_JUMP_IF_FALSE sends us when x > 0 is finally false.
Key Takeaways for Your Compiler
- Linearization: CPython's compiler builds the graph in memory (objects pointing to other objects), but the final output is a flat array of bytes. The "jumps" are just integer offsets (e.g.,
POP_JUMP_IF_FALSE 30means "goto index 30"). - Stack Depth Calculation: One of the main reasons CPython builds a CFG is to calculate the maximum stack depth required for the frame. It traverses the graph to see how many items are pushed/popped in every possible path.
- Basic Block structure:
- Rust MIR: Explicit
bb0,bb1labels. - CPython: Implicit blocks defined by jump targets. Any instruction that is the target of a jump starts a new Basic Block. Any instruction that jumps ends a Basic Block.
- Rust MIR: Explicit
How this helps your "DHIR/MIR" design
You don't need to output fancy graphviz diagrams or text like Rust does. For your compiler, your "MIR" can simply be a list of BasicBlock objects:
struct BasicBlock {
instructions: Vec<Instruction>,
terminator: Terminator, // Jump, Branch, Return
}
struct MirFunction {
blocks: Vec<BasicBlock>, // Just a flat list!
}
When you generate code (VM bytecode), you just:
1. Iterate through the blocks.
2. Keep a map of BlockID -> BytecodeOffset.
3. Emit instructions.
4. When you hit a Terminator (Jump), emit a placeholder.
5. Once done, go back and patch the placeholders with the real BytecodeOffset you calculated.
This separates the complex logic of "how flow works" (MIR) from the boring logic of "counting bytes" (Codegen).
User
Additional question: Can this be used for try-catch or that's handled elsewhere?
Model
Yes, absolutely. In fact, MIR (or any CFG-based IR) is the best place to model exception handling.
Handling try-catch in a tree-based structure (AST/HIR) is deceptively simple: it looks like just another block. But in terms of control flow, exceptions are effectively "goto-from-anywhere." Modeling this in a CFG is standard practice.
There are two main ways compilers handle this in the CFG. You should choose the one that fits your VM architecture.
1. The "Unwind Edge" Approach (Rust, LLVM, C++)
In this model, every Basic Block that contains a function call (or an instruction that might throw) has two exit paths: 1. Normal Control Flow: Where to go if the function returns successfully. 2. Unwind Control Flow: Where to go if the function throws an exception.
If you look back at the Rust MIR output I showed you earlier, notice the unwind continue annotations?
bb1. If it panics/throws, continue bubbling the exception up the stack."
If this code were inside a try block (or catch_unwind in Rust), the unwind target would point to a specific Cleanup Block (your catch handler) instead of just "continue."
How it looks in your MIR
You would modify your Terminator or Call instruction to hold two targets:
enum Terminator {
Goto(BlockId),
Return,
// A call that might crash/throw
Call {
func: String,
args: Vec<Operand>,
destination: BlockId, // Normal path
cleanup: Option<BlockId>, // "Catch" path (if inside a try block)
}
}
2. The Table-Based Approach (Java JVM, CPython, C# CLR)
This is likely the approach you want if you are building a VM.
Instead of drawing explicit arrows in the graph for every single instruction that might fail, you generate "ranges" of bytecode that are protected.
- MIR Phase: You designate specific Basic Blocks as "protected." You associate them with a "Handler Block."
- Codegen Phase: You generate an Exception Table.
The Exception Table usually looks like this:
| Start PC | End PC | Handler PC | Exception Type |
| :--- | :--- | :--- | :--- |
| 0x0010 | 0x0050 | 0x0090 | ValueError |
| 0x0010 | 0x0050 | 0x00A5 | IOError |
How the VM uses it:
When an exception is thrown at PC 0x0020:
1. The VM pauses execution.
2. It scans the Exception Table.
3. "Is 0x0020 inside 0x0010...0x0050?" -> Yes.
4. "Is the exception a ValueError?" -> Yes.
5. Set PC to 0x0090 (The Handler).
Example: Desugaring try-catch to MIR
Let's assume a simplified Python-like syntax:
Here is how this looks as a Control Flow Graph:
BB0 (Start of Try):
* This block is marked as "Protected" (conceptually).
* Instruction: Call danger().
* Note: If danger() throws, the VM/Runtime looks up the table and jumps to BB2.
* Instruction: Goto BB1.
BB1 (Rest of Try - "Success"):
* Instruction: Print "Success".
* Instruction: Goto BB3 (Skip the catch block).
BB2 (The Catch Block):
* Entry: This is where the VM jumps if an error happened in BB0.
* Instruction: Call handle_error().
* Instruction: Goto BB3.
BB3 (Join/Finally):
* Instruction: Print "Done".
* Instruction: Return.
Summary: Where does this logic live?
It lives in MIR, but the specific implementation details affect Codegen.
- In AST/HIR:
TryCatchis just a node with abodyand ahandler. - In MIR Construction (Lowering):
- You create a new Basic Block for the
catchbody. - You emit the code for the
trybody. - Crucial Step: You register the
tryblocks in a list of "Landing Pads" or "Exception Scopes."
- You create a new Basic Block for the
- In Codegen:
- You compute the exact Bytecode Offsets (Start/End) for the blocks inside the
try. - You write those offsets into the Exception Table at the end of your bytecode file.
- You compute the exact Bytecode Offsets (Start/End) for the blocks inside the
Recommendation for your compiler: Use the Table-Based Approach. It keeps your MIR graph cleaner (you don't need double edges on every call) and it is the standard for virtually all bytecode interpreters (JVM, CLR, Python). The VM does the heavy lifting of checking the table when an exception actually occurs.