The parser transforms a flat token stream from the lexer into an abstract syntax tree (AST) representing a complete C translation unit. It implements a hand-written recursive descent parser for the C language, covering C99/C11 semantics plus a broad set of GCC and Clang extensions required for compatibility with real-world system headers and codebases.
- Why Recursive Descent
- Module Organization
- The Parser Struct
- Expression Parsing and Precedence Climbing
- The Typedef/Identifier Ambiguity
- Declarator Parsing and the Inside-Out Rule
- Type Specifier Parsing
- Declaration Parsing
- Statement Parsing
- Inline Assembly (GCC Extended Syntax)
- AST Node Types
- Packed Bitfield Attributes
- Error Recovery Strategy
- Known Limitations
The parser is written entirely by hand rather than generated by a tool like yacc, bison, or LALR table generators. This is a deliberate choice driven by several properties of the C grammar:
-
Context sensitivity. C's grammar is not context-free. The meaning of an identifier depends on whether it has been previously declared as a
typedefname. A hand-written parser can consult a live typedef table during parsing, which is awkward or impossible to express in a declarative grammar file. -
Ambiguity resolution. Many C constructs are syntactically ambiguous. A
(token could begin a cast expression, a compound literal, a parenthesized expression, or a parenthesized declarator. The parser resolves these with speculative parsing (save position, try one interpretation, backtrack on failure) -- a technique that maps naturally to imperative Rust code. -
GCC extension coverage. Real-world C code (particularly Linux kernel headers and glibc) uses many GCC extensions:
__attribute__, statement expressions, computed gotos, inline assembly,typeof,__auto_type, case ranges, label addresses, and more. These extensions are easier to add incrementally to a hand-written parser than to graft onto a generated grammar. -
Error reporting. A recursive descent parser knows exactly what construct it was attempting to parse when something goes wrong, enabling precise diagnostic messages ("expected
;after return statement") with fix-it hints and notes pointing back to matching delimiters.
The Parser struct is defined once in parse.rs, but its methods are spread
across six files via separate impl Parser blocks. Each file is a private
module that adds domain-specific parsing methods:
parser/
mod.rs -- Module declarations; re-exports Parser
parse.rs -- Parser struct, constructor, token helpers, entry point
expressions.rs -- Expression parsing (precedence climbing)
types.rs -- Type specifier collection and resolution
declarations.rs -- External and local declarations, initializers, function defs
declarators.rs -- C declarator syntax (pointers, arrays, function pointers)
statements.rs -- All statement types + inline assembly
ast.rs -- AST type definitions (pure data, no parsing logic)
All parsing methods are pub(super) so they can call each other across module
boundaries within the parser crate, while remaining invisible outside it. The
public API consists of Parser::new(), Parser::parse(),
Parser::set_diagnostics(), Parser::take_diagnostics(), the error_count
field, and the AST types exported from ast.rs.
This split keeps each file focused on one aspect of the grammar. The type
specifier parser (types.rs) handles the combinatorial complexity of C's
type keywords; the declarator parser (declarators.rs) handles the recursive
inside-out syntax; the expression parser (expressions.rs) handles operator
precedence; and so on. Despite spanning six files, there is a single Parser
struct with a single token stream and a single parse position -- no separate
sub-parsers or intermediate representations.
The core state of the parser is:
pub struct Parser {
tokens: Vec<Token>, // Full token stream from the lexer
pos: usize, // Current read position
typedefs: FxHashSet<String>, // Known typedef names
shadowed_typedefs: FxHashSet<String>, // Typedefs shadowed by local vars
attrs: ParsedDeclAttrs, // Accumulated decl attributes (reset per decl)
pragma_pack_stack: Vec<Option<usize>>, // #pragma pack push/pop stack
pragma_pack_align: Option<usize>, // Current #pragma pack alignment
pragma_visibility_stack: Vec<String>, // #pragma GCC visibility push/pop
pragma_default_visibility: Option<String>,
error_count: usize,
diagnostics: DiagnosticEngine,
enum_constants: FxHashMap<String, i64>, // For constant-expr eval
unevaluable_enum_constants: FxHashSet<String>, // Enum consts with unknown values
struct_tag_alignments: FxHashMap<String, usize>, // For __alignof__ lookups
}Key design points:
-
Token array + position index. The parser operates on a pre-lexed
Vec<Token>rather than a streaming lexer. This enables O(1) lookahead at any depth and trivial backtracking by saving and restoringpos. -
Typedef tracking. The
typedefsset is pre-seeded with ~90 common standard library type names (size_t,uint32_t,FILE,va_list, etc.) and grows astypedefdeclarations are parsed. Theshadowed_typedefsset tracks typedef names that have been redeclared as local variables in the current scope, which is necessary for correct disambiguation. -
Attribute accumulator.
ParsedDeclAttrsis a temporary buffer that collects storage-class specifiers, type qualifiers, and GCC attributes as they are encountered during type specifier parsing. It is consumed and reset at each declaration boundary. Its lifecycle is "set duringparse_type_specifier, consumed by declaration builders indeclarations.rs." -
Pragma state. The parser maintains push/pop stacks for
#pragma pack(struct field alignment) and#pragma GCC visibility(ELF symbol visibility). These are emitted by the preprocessor as synthetic tokens and consumed during parsing. -
Semantic caches.
enum_constantsmaps enum constant names to their integer values, enabling the parser to evaluate constant expressions in contexts like__attribute__((aligned(1 << ENUM_CONST))).unevaluable_enum_constantstracks enum constants whose values couldn't be computed at parse time (e.g.,SIZE = sizeof(some_typedef)); these are recognized as constants by_Static_asserteven though their values are unknown.struct_tag_alignmentsstores computed struct/union alignments for__alignof__lookups on tag-only references.
The main entry point is Parser::parse(), which loops over
parse_external_decl() until EOF, collecting ExternalDecl nodes into a
TranslationUnit.
The expression parser uses operator precedence climbing via a table-driven
design. Rather than writing a separate function for each of C's ten binary
precedence levels, a single parse_binary_expr method is parameterized by a
PrecedenceLevel enum:
enum PrecedenceLevel {
LogicalOr, // ||
LogicalAnd, // &&
BitwiseOr, // |
BitwiseXor, // ^
BitwiseAnd, // &
Equality, // == !=
Relational, // < <= > >=
Shift, // << >>
Additive, // + -
Multiplicative, // * / %
}The token_to_binop method on Parser maps a (&TokenKind, PrecedenceLevel)
pair to an optional BinOp. The shared loop in parse_binary_expr repeatedly:
- Parses the next-tighter level via
parse_next_tighter. - Checks if the current token maps to an operator at this level.
- If so, consumes the operator, parses the right-hand operand (at the
next-tighter level for left-associativity), and wraps both sides in a
BinaryOpnode. - Repeats until no operator matches.
This eliminates what would otherwise be ten nearly identical recursive functions while preserving correct associativity and precedence.
The full call hierarchy from loosest to tightest binding is:
parse_expr -- comma operator (,)
parse_assignment_expr -- = += -= *= /= %= <<= >>= &= ^= |=
parse_conditional_expr -- ? : (ternary, including GNU omitted-middle)
parse_binary_expr(LogicalOr)
...10 levels...
parse_binary_expr(Multiplicative)
parse_cast_expr -- (type)expr, compound literals
parse_unary_expr -- prefix ++/-- +/- ~ ! & * sizeof _Alignof
parse_postfix_expr -- postfix ++/-- [] . -> fn()
parse_primary_expr -- literals, identifiers, parens, builtins
Assignment is parsed as right-associative by having parse_assignment_expr
recurse to itself for the right operand. The ternary operator similarly
recurses for both the then-branch (via parse_expr, allowing comma
expressions) and the else-branch (via parse_conditional_expr, disallowing
commas).
parse_cast_expr demonstrates speculative parsing. When it sees (, it:
- Saves the current position and attribute state.
- Advances past
(and checksis_type_specifier(). - If a type specifier is found, parses it through abstract declarator suffixes.
- If
)follows, checks for{(compound literal) or parses the operand as a cast. - If any step fails, restores position and falls through to
parse_unary_expr(treating the(as a parenthesized expression).
This same speculative pattern is used by parse_sizeof_expr, where
sizeof(type) and sizeof expr must be disambiguated.
C's grammar is famously context-sensitive because an identifier can be either a
type name (if previously declared as a typedef) or a variable/function name.
This creates genuine ambiguity:
typedef int T;
void f(void) {
T * x; // declaration: pointer to T named x? or expression: T times x?
T(y); // declaration: T y? or function call: T(y)?
}The parser resolves this with four mechanisms:
-
Typedef set.
self.typedefstracks all names introduced bytypedefdeclarations.is_type_specifier()returnstruefor identifiers in this set, causing the parser to enter the declaration path rather than the expression path. -
Typedef shadowing. When a local variable declaration uses a name that shadows a typedef (e.g.,
int size_t = 42;), the name is added toshadowed_typedefs. Theis_type_specifier()check excludes shadowed names, so subsequent uses parse as identifiers. The shadowing set is saved/restored at compound statement boundaries to handle scope correctly. -
Label vs. declaration. A typedef name followed by
:is a label, not a declaration (e.g.,size_t: stmt;). Theis_typedef_label()check prevents the parser from misinterpreting labeled statements as declarations. -
Builtin typedefs. The parser pre-seeds
typedefswith approximately 90 common standard library type names (size_t,int32_t,FILE,va_list,pthread_t, GCC vector types, etc.). These built-in entries ensure that code using standard types parses correctly even when compiling without real system headers (e.g., cross-compiling without a sysroot).
C declarators are notoriously difficult to read because they follow an inside-out rule: you start at the declared name, then alternate between suffixes (arrays, function parameters) and prefixes (pointers), reading outward through parentheses. For example:
int (*fp)(int) // fp is a pointer to a function(int) returning int
int (*arr)[3][6] // arr is a pointer to an array[3] of array[6] of int
int *(*fps[10])(int) // fps is array[10] of pointer to function(int) returning int*
int (**fpp)(int) // fpp is a pointer to a pointer to a function(int) returning intThe declarator parser in declarators.rs handles this by decomposing the
parse into three parts:
-
Outer pointers. Leading
*tokens (with optional cv-qualifiers) before the name or parenthesized group. -
Inner derived declarators. If a
(is found that starts a parenthesized declarator (not a parameter list), the parser recurses intoparse_declarator()to parse the inner declarator, then expects). -
Outer suffixes. After the name (or closing paren), the parser collects array dimensions
[expr]and function parameter lists(params).
These three parts are then combined by combine_declarator_parts, which
applies C's inside-out reading to produce a linear Vec<DerivedDeclarator>.
The method handles several cases:
- Function pointers:
int (*fp)(int)produces[Pointer, FunctionPointer([int])]. - Pointer-to-array:
int (*p)[3]produces[Array(3), Pointer]. - Array of function pointers:
int (*fps[10])(int)produces[Pointer, FunctionPointer([int]), Array(10)]. - Double pointers to functions:
int (**fpp)(int)produces[Pointer, FunctionPointer([int]), Pointer]. - Nested function pointers returning function pointers: handled via
ParenAbstractDecl::NestedFnPtrfor types likevoid(*(*)(void*))(void).
The is_paren_declarator() helper distinguishes a parenthesized declarator
(like (*fp)) from a function parameter list (like (int x, int y)). It
peeks at the token after (:
Token after ( |
Interpretation |
|---|---|
*, ^, (, [ |
Declarator grouping |
| Non-typedef identifier | Named declarator |
__attribute__, __extension__ |
Declarator grouping |
Type keyword (int, void, struct, ...) |
Parameter list |
), ... |
Parameter list (empty or variadic) |
| Typedef name | Parameter list (since (size_t) is a parameter) |
parse_declarator_with_attrs also collects GCC __attribute__ annotations
that may appear before and after the declarator proper. These include
mode(QI/HI/SI/DI/TI) for integer width overrides, common for tentative
definitions, and aligned(N) for alignment. Pre-declarator and
post-declarator alignment values are merged by taking the maximum, per C11
section 6.7.5.
C allows type specifier keywords in any order: unsigned long int,
long unsigned int, and int long unsigned are all the same type. The
parser in types.rs handles this by collecting boolean flags into a
TypeSpecFlags struct:
struct TypeSpecFlags {
has_void: bool,
has_bool: bool,
has_float: bool,
has_double: bool,
has_complex: bool,
has_char: bool,
has_short: bool,
has_int: bool,
has_unsigned: bool,
has_signed: bool,
has_struct: bool,
has_union: bool,
has_enum: bool,
has_typeof: bool,
long_count: u32, // tracks "long" vs "long long"
typedef_name: Option<String>,
}The parse_type_specifier() method loops, consuming tokens and setting
flags, until it hits a token that is not a type specifier, qualifier, or
storage class. Then resolve_type_flags() maps the flag combination to a
concrete TypeSpecifier variant:
| Flags | Result |
|---|---|
unsigned + long_count=2 |
UnsignedLongLong |
signed + char |
Char |
long + double |
LongDouble |
complex + float |
ComplexFloat |
complex + double |
ComplexDouble |
just unsigned |
UnsignedInt |
just long_count=1 |
Long |
none of the above + has_int |
Int |
Storage-class specifiers (static, extern, typedef, inline,
_Thread_local) and type qualifiers (const, volatile, restrict) are
accumulated into the ParsedDeclAttrs buffer during the same loop, rather
than being part of the returned TypeSpecifier.
The method also handles:
struct/union/enumdefinitions and forward declarations, including packed structs and#pragma packalignment propagation.typeof(expr)andtypeof(type)-- GCC extensions for computed types.__auto_type-- GCC type inference from initializer._Atomic(type)-- C11 atomic type specifier (parsed but atomic qualifier not tracked)._Alignas(N)and_Alignas(type)-- C11 alignment specifiers.__attribute__((mode(...)))-- GCC integer width override that transforms the base type to a specific bit-width (QI=8,HI=16,SI=32,DI=64,TI=128) while preserving signedness.__int128and__uint128_t-- 128-bit integer types (rejected on 32-bit targets with a diagnostic).
Declarations are parsed in declarations.rs. The top-level entry point
parse_external_decl() handles:
-
Top-level
asmdirectives --asm("...")outside any function, emitted verbatim asExternalDecl::TopLevelAsm. -
_Static_assert-- the assertion expression is evaluated at parse time; if it evaluates to zero, a compile error is emitted with the assertion message. No meaningful AST node is produced (at file scope, an emptyDeclarationsentinel is returned to satisfy theExternalDeclreturn type). -
Implicit
int-- C89 compatibility: if no type specifier is found but the token sequence looks likeidentifier(, it is treated as a function with implicitintreturn type. -
Type specifier + declarator -- The method parses a type specifier, then a declarator (with GCC attributes), then inspects what follows:
- If the declarator ends with
Function(...)and the next token is{(or a type specifier for K&R style), it is a function definition. - Otherwise, it is a declaration (variable, typedef, function prototype).
- If the declarator ends with
parse_function_def() handles both modern and K&R-style parameter
declarations:
// Modern (prototype) style:
int add(int a, int b) { return a + b; }
// K&R (identifier list) style:
int add(a, b)
int a;
int b;
{ return a + b; }For K&R functions, parse_kr_params() matches parameter names from the
identifier list with type declarations that follow.
The method also shadows typedef names that collide with parameter names for
the duration of the function body, restoring the shadowing set afterward.
The return type is computed from the derived declarator chain by
build_return_type, which applies post-Function and pre-Function derivations
(handling cases like int (*func())[3] where the return type is a pointer to
an array).
parse_declaration_rest() handles multi-declarator declarations
(int x = 1, y = 2, *z;), initializer parsing (including nested designated
initializers with C99 .field and [index] syntax plus GCC range
designators [lo ... hi]), and the transfer of accumulated GCC attributes
to per-declarator DeclAttributes.
A DeclContext struct groups per-declaration attribute state (alignment,
_Alignas type, common flag, per-declarator attributes) to avoid threading
many individual parameters between parse_external_decl and
parse_declaration_rest.
When a typedef declaration is parsed, the declared names are added to the
typedefs set, making them available for disambiguation in subsequent parsing.
GCC attributes can appear in multiple positions on a declaration:
__attribute__((aligned(16))) int __attribute__((weak)) x __attribute__((section(".data")));The parser collects attributes from each position (type-level, declarator-level,
post-declarator-level) and merges them. For alignment, the maximum value is
taken per C11 section 6.7.5. For visibility, an explicit
__attribute__((visibility(...))) takes priority over the current
#pragma GCC visibility default.
The statement parser in statements.rs handles all C statement forms:
| Statement | AST variant |
|---|---|
expr; (expression statement) |
Stmt::Expr |
return expr; |
Stmt::Return |
if (cond) then else |
Stmt::If |
while (cond) body |
Stmt::While |
do body while (cond); |
Stmt::DoWhile |
for (init; cond; inc) body |
Stmt::For |
{ items } |
Stmt::Compound |
break; / continue; |
Stmt::Break / Stmt::Continue |
switch (expr) body |
Stmt::Switch |
case expr: / default: |
Stmt::Case / Stmt::Default |
case lo ... hi: |
Stmt::CaseRange (GCC extension) |
goto label; |
Stmt::Goto |
goto *expr; |
Stmt::GotoIndirect (GCC computed goto) |
label: stmt |
Stmt::Label |
asm(...) |
Stmt::InlineAsm |
| declaration in statement position | Stmt::Declaration (C23/GNU) |
parse_compound_stmt() implements scope management. At entry it saves the
typedef shadowing set and the declaration attribute flags; at exit it restores
both. This prevents storage-class specifiers from declarations inside the
block from leaking into the enclosing context -- critical for GCC statement
expressions used inside typeof():
typeof(({ extern void f(void); 42; })) x = 10; // 'extern' must not leakThe method also handles GNU __label__ declarations at the start of blocks,
which declare local labels scoped to the compound statement.
Within a compound statement, the parser distinguishes declarations from
statements by calling is_type_specifier(). If the current token begins a
type specifier (and is not a typedef-name used as a label), it takes the
declaration path; otherwise it takes the statement path.
for statements receive special handling because C99 allows a declaration in
the initializer position:
for (int i = 0; i < n; i++) { ... }The parser checks is_type_specifier() in the initializer to distinguish
ForInit::Declaration from ForInit::Expr.
The parser supports GCC extended inline assembly with the full four-colon syntax:
asm [volatile] [goto] [inline] (
"template string" // template with %0, %1, %[name] placeholders
: [output operands] // first colon
: [input operands] // second colon
: [clobber list] // third colon
: [goto labels] // fourth colon (asm goto)
);Each operand is parsed as:
[symbolic_name] "constraint" (c_expression)
The symbolic name (in brackets) is optional. Constraints are string literals
like "=r", "+m", "i". The constraint string may be formed from
concatenated string literals. Clobbers are bare string literals ("memory",
"cc", register names). Goto labels are comma-separated identifiers.
All four operand sections are optional; parsing stops at each colon boundary if there are no more sections. Template strings support concatenation of adjacent string literals.
Top-level asm("...") directives (outside any function) are also supported
and produce ExternalDecl::TopLevelAsm nodes for verbatim emission into the
assembly output.
The AST is defined as a collection of enums and structs in ast.rs. The
root is TranslationUnit, which contains a Vec<ExternalDecl>.
TranslationUnit
Vec<ExternalDecl>
FunctionDef(FunctionDef) -- function definition with body
Declaration(Declaration) -- variable/typedef/prototype declaration
TopLevelAsm(String) -- asm("...") directive at file scope
Represents a function definition: return type, name, parameter list, variadic flag, compound statement body, function attributes, K&R flag, and source span.
Represents variable declarations, typedef declarations, and function
prototypes. Contains a base TypeSpecifier, a list of InitDeclarators (each
with a name, derived declarator chain, optional initializer, and per-declarator
DeclAttributes), and packed boolean flags for storage class and qualifiers.
Also carries alignment overrides, address space qualifiers (__seg_gs,
__seg_fs), and vector size attributes.
A large enum covering all C type forms:
- Primitive types:
Void,Char,Short,Int,Long,LongLong,Float,Double,LongDouble,Signed,Unsigned,Bool,Int128 - Unsigned variants:
UnsignedCharthroughUnsignedInt128 - Complex types:
ComplexFloat,ComplexDouble,ComplexLongDouble - Aggregate types:
Struct(name, fields, packed, pragma_pack, aligned),Union(...),Enum(name, variants, packed) - Derived types:
Pointer(inner, address_space),Array(inner, size),FunctionPointer(ret, params, variadic),BareFunction(ret, params, variadic) - Name-based:
TypedefName(String) - Deferred:
Typeof(Expr),TypeofType(TypeSpecifier),AutoType - Vector:
Vector(inner, total_bytes)— wraps a base element type with__attribute__((vector_size(N)))in cast/compound-literal/sizeof contexts
Note: Signed and Unsigned are standalone variants that exist for
completeness in type resolution but are not currently emitted by the parser
(they are resolved into their concrete forms like Int or UnsignedInt
during resolve_type_flags).
The expression enum has 43 variants covering:
- Literals:
IntLiteral,UIntLiteral,LongLiteral,ULongLiteral,LongLongLiteral,ULongLongLiteral,FloatLiteral,FloatLiteralF32,FloatLiteralLongDouble(with full 128-bit precision bytes),StringLiteral,WideStringLiteral,Char16StringLiteral,CharLiteral,ImaginaryLiteral,ImaginaryLiteralF32,ImaginaryLiteralLongDouble - Names:
Identifier - Operators:
BinaryOp,UnaryOp,PostfixOp,Assign,CompoundAssign - Access:
ArraySubscript,MemberAccess,PointerMemberAccess,AddressOf,Deref - Control flow:
Conditional(ternary),GnuConditional(omitted-middlecond ?: else),Comma - Type operations:
Cast,Sizeof,Alignof,GnuAlignof(preferred alignment),AlignofExpr,GnuAlignofExpr,VaArg - Compound constructs:
FunctionCall,CompoundLiteral,StmtExpr(GCC statement expression({ ... })) - Generic selection:
GenericSelection(C11_Generic) - GCC extensions:
LabelAddr(&&labelfor computed goto),BuiltinTypesCompatibleP
Every Expr variant carries a Span for source location tracking. The
ExprId newtype wraps a stable pointer address for use as a cache key in
downstream type-checking and constant-evaluation phases.
Covers all C statement forms as described in the Statement Parsing section. The full set of variants:
Expr(Option<Expr>)-- expression statement (the most common statement type)Return,If,While,DoWhile,For,CompoundBreak,Continue,Switch,Case,DefaultCaseRange(GCC extension),Goto,GotoIndirect(GCC computed goto)Label,Declaration,InlineAsm
Each statement variant that represents a control-flow construct carries a Span
for source location, accessible via Stmt::span().
Describes type modifications applied by a declarator:
enum DerivedDeclarator {
Pointer, // *
Array(Option<Box<Expr>>), // [expr] or []
Function(Vec<ParamDecl>, bool), // (params) with variadic flag
FunctionPointer(Vec<ParamDecl>, bool), // (*)(params) -- pointer to function
}The distinction between Function and FunctionPointer is essential:
Function appears in function definitions and prototypes, while
FunctionPointer appears in declarators like void (*fp)(int) where the
(*...) syntax introduces a pointer-to-function.
enum Initializer {
Expr(Expr), // simple: int x = 42;
List(Vec<InitializerItem>), // braced: int a[] = {1, 2, 3};
}
struct InitializerItem {
designators: Vec<Designator>, // .field, [index], [lo...hi]
init: Initializer, // recursively nested
}
enum Designator {
Index(Expr), // [expr]
Range(Expr, Expr), // [lo ... hi] (GCC extension)
Field(String), // .fieldname
}Four AST and parser-internal types use packed bitfield integers to store
boolean flags instead of individual bool fields:
| Struct | Flag type | Booleans stored |
|---|---|---|
FunctionAttributes |
u16 (13 bits) |
static, inline, extern, gnu_inline, always_inline, noinline, constructor, destructor, weak, used, fastcall, naked, noreturn |
Declaration |
u16 (9 bits) |
static, extern, typedef, const, volatile, common, thread_local, transparent_union, inline |
DeclAttributes |
u16 (8 bits) |
constructor, destructor, weak, error_attr, noreturn, used, fastcall, naked |
ParsedDeclAttrs |
u32 (19 bits) |
typedef, static, extern, thread_local, inline, const, volatile, constructor, destructor, weak, used, gnu_inline, always_inline, noinline, noreturn, error_attr, transparent_union, fastcall, naked |
Each flag is a named constant (power of two) in a companion module
(e.g., func_attr_flag::STATIC = 1 << 0). Getters use flags & MASK != 0
and setters use flags |= MASK / flags &= !MASK.
This design achieves three goals:
-
Memory savings. 13 booleans in
FunctionAttributescollapse from 13 bytes to 2 bytes. ForParsedDeclAttrs, 19 booleans collapse from 19 bytes to 4 bytes. Since AST nodes are allocated in large numbers, these savings are meaningful. -
Bulk save/restore.
ParsedDeclAttrs::save_flags()andrestore_flags()capture and restore all 19 boolean flags as a singleu32, used to prevent storage-class specifiers from leaking across compound statement scope boundaries. -
API ergonomics. Callers use named getter/setter methods (
is_static(),set_inline(true)) rather than raw bit manipulation. TheDebugimplementations expand flags into named booleans for readable output.
Non-boolean attributes (section name, visibility string, alias target, asm
register, cleanup function name, vector size, alignment overrides) remain as
Option<String> or Option<usize> fields alongside the packed flags.
The parser employs several error recovery strategies to continue parsing after encountering invalid input:
-
Emit and continue. Most
expect*methods report an error but do not halt. If the expected token is missing, the parser records the error and continues from the current position, producing a potentially malformed but complete AST. -
Contextual error messages. Four
expectvariants produce increasingly specific diagnostics:expect(token)-- generic: "expected;before}"expect_after(token, context)-- positional: "expected;after return statement" with a fix-it hintexpect_context(token, context)-- contextual: "expected(afterif" with a fix-it hintexpect_closing(token, open_span)-- matching: "expected)before;" with a note "to match this(at file.c:10:5" and a fix-it hint
-
Fix-it hints. Error diagnostics include insertion suggestions (e.g., "insert
;") that an IDE or error display could use to suggest corrections. -
Speculative backtracking. For ambiguous constructs (casts vs. parenthesized expressions, type specifiers vs. identifiers), the parser saves its position, attempts one interpretation, and backtracks on failure. This avoids committing to a parse that might be wrong.
-
Top-level skip. In
parse(), ifparse_external_decl()returnsNonefor a non-EOF, non-semicolon token, the parser reports an error and advances past the unrecognized token to resynchronize. -
Error counting.
error_counttracks the total number of parse errors, allowing the driver to decide whether to abort compilation after parsing.
-
Parser does not invoke the preprocessor. The compiler has a fully integrated C preprocessor (see
frontend/preprocessor/), but the driver orchestrates the pipeline: preprocessing runs first, then lexing, then parsing. From the parser's perspective, input tokens are already macro-expanded and include-resolved. The parser also pre-seeds a set of builtin typedef names (see below) so that standard types likesize_tparse correctly even when compiling without real system headers. -
No type checking during parsing. The parser produces an untyped AST. Type resolution, implicit conversions, and semantic checks happen in later lowering phases. This means some constructs that are syntactically valid but semantically illegal will parse successfully.
-
Typedef scope is approximated. Typedef names are tracked globally with a shadowing set for local scopes. This is correct for the vast majority of code but does not model C's full block-scope typedef rules with complete precision (e.g., a typedef introduced inside a
forloop initializer may remain visible longer than the standard specifies). -
_Atomicis parsed but not fully modeled._Atomic(T)is treated as equivalent toT-- the atomic qualification is syntactically consumed but not preserved in the AST. -
restrictis consumed but ignored. Therestrictqualifier is recognized and skipped but does not appear in the AST, since it is an optimization hint that does not affect code generation in this compiler. -
Constant expression evaluation is limited. The parser can evaluate simple constant expressions (for enum values and alignment attributes) using a lookup table of previously parsed enum constants, but cannot evaluate arbitrary compile-time expressions involving sizeof of composite types.
-
_Genericconst tracking is partial. The parser tracksis_constflags on_Genericassociations and parameter declarations to distinguish const-qualified pointer types, but CType itself does not carry qualifiers. This means const-sensitivity works for the common cases but not for complex expressions like casts or subscripts. -
No C++ support. This is a C parser only. C++ constructs (classes, templates, namespaces, references, overloading) are not recognized.