Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Parser Subsystem

The parser transforms a flat token stream from the lexer into an abstract syntax tree (AST) representing a complete C translation unit. It implements a hand-written recursive descent parser for the C language, covering C99/C11 semantics plus a broad set of GCC and Clang extensions required for compatibility with real-world system headers and codebases.


Table of Contents

  1. Why Recursive Descent
  2. Module Organization
  3. The Parser Struct
  4. Expression Parsing and Precedence Climbing
  5. The Typedef/Identifier Ambiguity
  6. Declarator Parsing and the Inside-Out Rule
  7. Type Specifier Parsing
  8. Declaration Parsing
  9. Statement Parsing
  10. Inline Assembly (GCC Extended Syntax)
  11. AST Node Types
  12. Packed Bitfield Attributes
  13. Error Recovery Strategy
  14. Known Limitations

Why Recursive Descent

The parser is written entirely by hand rather than generated by a tool like yacc, bison, or LALR table generators. This is a deliberate choice driven by several properties of the C grammar:

  • Context sensitivity. C's grammar is not context-free. The meaning of an identifier depends on whether it has been previously declared as a typedef name. A hand-written parser can consult a live typedef table during parsing, which is awkward or impossible to express in a declarative grammar file.

  • Ambiguity resolution. Many C constructs are syntactically ambiguous. A ( token could begin a cast expression, a compound literal, a parenthesized expression, or a parenthesized declarator. The parser resolves these with speculative parsing (save position, try one interpretation, backtrack on failure) -- a technique that maps naturally to imperative Rust code.

  • GCC extension coverage. Real-world C code (particularly Linux kernel headers and glibc) uses many GCC extensions: __attribute__, statement expressions, computed gotos, inline assembly, typeof, __auto_type, case ranges, label addresses, and more. These extensions are easier to add incrementally to a hand-written parser than to graft onto a generated grammar.

  • Error reporting. A recursive descent parser knows exactly what construct it was attempting to parse when something goes wrong, enabling precise diagnostic messages ("expected ; after return statement") with fix-it hints and notes pointing back to matching delimiters.


Module Organization

The Parser struct is defined once in parse.rs, but its methods are spread across six files via separate impl Parser blocks. Each file is a private module that adds domain-specific parsing methods:

parser/
  mod.rs            -- Module declarations; re-exports Parser
  parse.rs          -- Parser struct, constructor, token helpers, entry point
  expressions.rs    -- Expression parsing (precedence climbing)
  types.rs          -- Type specifier collection and resolution
  declarations.rs   -- External and local declarations, initializers, function defs
  declarators.rs    -- C declarator syntax (pointers, arrays, function pointers)
  statements.rs     -- All statement types + inline assembly
  ast.rs            -- AST type definitions (pure data, no parsing logic)

All parsing methods are pub(super) so they can call each other across module boundaries within the parser crate, while remaining invisible outside it. The public API consists of Parser::new(), Parser::parse(), Parser::set_diagnostics(), Parser::take_diagnostics(), the error_count field, and the AST types exported from ast.rs.

This split keeps each file focused on one aspect of the grammar. The type specifier parser (types.rs) handles the combinatorial complexity of C's type keywords; the declarator parser (declarators.rs) handles the recursive inside-out syntax; the expression parser (expressions.rs) handles operator precedence; and so on. Despite spanning six files, there is a single Parser struct with a single token stream and a single parse position -- no separate sub-parsers or intermediate representations.


The Parser Struct

The core state of the parser is:

pub struct Parser {
    tokens: Vec<Token>,           // Full token stream from the lexer
    pos: usize,                   // Current read position

    typedefs: FxHashSet<String>,           // Known typedef names
    shadowed_typedefs: FxHashSet<String>,  // Typedefs shadowed by local vars

    attrs: ParsedDeclAttrs,       // Accumulated decl attributes (reset per decl)

    pragma_pack_stack: Vec<Option<usize>>,  // #pragma pack push/pop stack
    pragma_pack_align: Option<usize>,       // Current #pragma pack alignment
    pragma_visibility_stack: Vec<String>,   // #pragma GCC visibility push/pop
    pragma_default_visibility: Option<String>,

    error_count: usize,
    diagnostics: DiagnosticEngine,

    enum_constants: FxHashMap<String, i64>,          // For constant-expr eval
    unevaluable_enum_constants: FxHashSet<String>,   // Enum consts with unknown values
    struct_tag_alignments: FxHashMap<String, usize>, // For __alignof__ lookups
}

Key design points:

  • Token array + position index. The parser operates on a pre-lexed Vec<Token> rather than a streaming lexer. This enables O(1) lookahead at any depth and trivial backtracking by saving and restoring pos.

  • Typedef tracking. The typedefs set is pre-seeded with ~90 common standard library type names (size_t, uint32_t, FILE, va_list, etc.) and grows as typedef declarations are parsed. The shadowed_typedefs set tracks typedef names that have been redeclared as local variables in the current scope, which is necessary for correct disambiguation.

  • Attribute accumulator. ParsedDeclAttrs is a temporary buffer that collects storage-class specifiers, type qualifiers, and GCC attributes as they are encountered during type specifier parsing. It is consumed and reset at each declaration boundary. Its lifecycle is "set during parse_type_specifier, consumed by declaration builders in declarations.rs."

  • Pragma state. The parser maintains push/pop stacks for #pragma pack (struct field alignment) and #pragma GCC visibility (ELF symbol visibility). These are emitted by the preprocessor as synthetic tokens and consumed during parsing.

  • Semantic caches. enum_constants maps enum constant names to their integer values, enabling the parser to evaluate constant expressions in contexts like __attribute__((aligned(1 << ENUM_CONST))). unevaluable_enum_constants tracks enum constants whose values couldn't be computed at parse time (e.g., SIZE = sizeof(some_typedef)); these are recognized as constants by _Static_assert even though their values are unknown. struct_tag_alignments stores computed struct/union alignments for __alignof__ lookups on tag-only references.

The main entry point is Parser::parse(), which loops over parse_external_decl() until EOF, collecting ExternalDecl nodes into a TranslationUnit.


Expression Parsing and Precedence Climbing

The expression parser uses operator precedence climbing via a table-driven design. Rather than writing a separate function for each of C's ten binary precedence levels, a single parse_binary_expr method is parameterized by a PrecedenceLevel enum:

enum PrecedenceLevel {
    LogicalOr,       // ||
    LogicalAnd,      // &&
    BitwiseOr,       // |
    BitwiseXor,      // ^
    BitwiseAnd,      // &
    Equality,        // == !=
    Relational,      // < <= > >=
    Shift,           // << >>
    Additive,        // + -
    Multiplicative,  // * / %
}

The token_to_binop method on Parser maps a (&TokenKind, PrecedenceLevel) pair to an optional BinOp. The shared loop in parse_binary_expr repeatedly:

  1. Parses the next-tighter level via parse_next_tighter.
  2. Checks if the current token maps to an operator at this level.
  3. If so, consumes the operator, parses the right-hand operand (at the next-tighter level for left-associativity), and wraps both sides in a BinaryOp node.
  4. Repeats until no operator matches.

This eliminates what would otherwise be ten nearly identical recursive functions while preserving correct associativity and precedence.

The full call hierarchy from loosest to tightest binding is:

parse_expr                 -- comma operator (,)
  parse_assignment_expr    -- = += -= *= /= %= <<= >>= &= ^= |=
    parse_conditional_expr -- ? : (ternary, including GNU omitted-middle)
      parse_binary_expr(LogicalOr)
        ...10 levels...
          parse_binary_expr(Multiplicative)
            parse_cast_expr    -- (type)expr, compound literals
              parse_unary_expr -- prefix ++/-- +/- ~ ! & * sizeof _Alignof
                parse_postfix_expr -- postfix ++/-- [] . -> fn()
                  parse_primary_expr -- literals, identifiers, parens, builtins

Assignment is parsed as right-associative by having parse_assignment_expr recurse to itself for the right operand. The ternary operator similarly recurses for both the then-branch (via parse_expr, allowing comma expressions) and the else-branch (via parse_conditional_expr, disallowing commas).

Cast Expression Ambiguity

parse_cast_expr demonstrates speculative parsing. When it sees (, it:

  1. Saves the current position and attribute state.
  2. Advances past ( and checks is_type_specifier().
  3. If a type specifier is found, parses it through abstract declarator suffixes.
  4. If ) follows, checks for { (compound literal) or parses the operand as a cast.
  5. If any step fails, restores position and falls through to parse_unary_expr (treating the ( as a parenthesized expression).

This same speculative pattern is used by parse_sizeof_expr, where sizeof(type) and sizeof expr must be disambiguated.


The Typedef/Identifier Ambiguity

C's grammar is famously context-sensitive because an identifier can be either a type name (if previously declared as a typedef) or a variable/function name. This creates genuine ambiguity:

typedef int T;
void f(void) {
    T * x;    // declaration: pointer to T named x? or expression: T times x?
    T(y);     // declaration: T y? or function call: T(y)?
}

The parser resolves this with four mechanisms:

  1. Typedef set. self.typedefs tracks all names introduced by typedef declarations. is_type_specifier() returns true for identifiers in this set, causing the parser to enter the declaration path rather than the expression path.

  2. Typedef shadowing. When a local variable declaration uses a name that shadows a typedef (e.g., int size_t = 42;), the name is added to shadowed_typedefs. The is_type_specifier() check excludes shadowed names, so subsequent uses parse as identifiers. The shadowing set is saved/restored at compound statement boundaries to handle scope correctly.

  3. Label vs. declaration. A typedef name followed by : is a label, not a declaration (e.g., size_t: stmt;). The is_typedef_label() check prevents the parser from misinterpreting labeled statements as declarations.

  4. Builtin typedefs. The parser pre-seeds typedefs with approximately 90 common standard library type names (size_t, int32_t, FILE, va_list, pthread_t, GCC vector types, etc.). These built-in entries ensure that code using standard types parses correctly even when compiling without real system headers (e.g., cross-compiling without a sysroot).


Declarator Parsing and the Inside-Out Rule

C declarators are notoriously difficult to read because they follow an inside-out rule: you start at the declared name, then alternate between suffixes (arrays, function parameters) and prefixes (pointers), reading outward through parentheses. For example:

int (*fp)(int)        // fp is a pointer to a function(int) returning int
int (*arr)[3][6]      // arr is a pointer to an array[3] of array[6] of int
int *(*fps[10])(int)  // fps is array[10] of pointer to function(int) returning int*
int (**fpp)(int)      // fpp is a pointer to a pointer to a function(int) returning int

The declarator parser in declarators.rs handles this by decomposing the parse into three parts:

  1. Outer pointers. Leading * tokens (with optional cv-qualifiers) before the name or parenthesized group.

  2. Inner derived declarators. If a ( is found that starts a parenthesized declarator (not a parameter list), the parser recurses into parse_declarator() to parse the inner declarator, then expects ).

  3. Outer suffixes. After the name (or closing paren), the parser collects array dimensions [expr] and function parameter lists (params).

These three parts are then combined by combine_declarator_parts, which applies C's inside-out reading to produce a linear Vec<DerivedDeclarator>. The method handles several cases:

  • Function pointers: int (*fp)(int) produces [Pointer, FunctionPointer([int])].
  • Pointer-to-array: int (*p)[3] produces [Array(3), Pointer].
  • Array of function pointers: int (*fps[10])(int) produces [Pointer, FunctionPointer([int]), Array(10)].
  • Double pointers to functions: int (**fpp)(int) produces [Pointer, FunctionPointer([int]), Pointer].
  • Nested function pointers returning function pointers: handled via ParenAbstractDecl::NestedFnPtr for types like void(*(*)(void*))(void).

Disambiguating ( in Declarators

The is_paren_declarator() helper distinguishes a parenthesized declarator (like (*fp)) from a function parameter list (like (int x, int y)). It peeks at the token after (:

Token after ( Interpretation
*, ^, (, [ Declarator grouping
Non-typedef identifier Named declarator
__attribute__, __extension__ Declarator grouping
Type keyword (int, void, struct, ...) Parameter list
), ... Parameter list (empty or variadic)
Typedef name Parameter list (since (size_t) is a parameter)

GCC Attributes on Declarators

parse_declarator_with_attrs also collects GCC __attribute__ annotations that may appear before and after the declarator proper. These include mode(QI/HI/SI/DI/TI) for integer width overrides, common for tentative definitions, and aligned(N) for alignment. Pre-declarator and post-declarator alignment values are merged by taking the maximum, per C11 section 6.7.5.


Type Specifier Parsing

C allows type specifier keywords in any order: unsigned long int, long unsigned int, and int long unsigned are all the same type. The parser in types.rs handles this by collecting boolean flags into a TypeSpecFlags struct:

struct TypeSpecFlags {
    has_void: bool,
    has_bool: bool,
    has_float: bool,
    has_double: bool,
    has_complex: bool,
    has_char: bool,
    has_short: bool,
    has_int: bool,
    has_unsigned: bool,
    has_signed: bool,
    has_struct: bool,
    has_union: bool,
    has_enum: bool,
    has_typeof: bool,
    long_count: u32,         // tracks "long" vs "long long"
    typedef_name: Option<String>,
}

The parse_type_specifier() method loops, consuming tokens and setting flags, until it hits a token that is not a type specifier, qualifier, or storage class. Then resolve_type_flags() maps the flag combination to a concrete TypeSpecifier variant:

Flags Result
unsigned + long_count=2 UnsignedLongLong
signed + char Char
long + double LongDouble
complex + float ComplexFloat
complex + double ComplexDouble
just unsigned UnsignedInt
just long_count=1 Long
none of the above + has_int Int

Storage-class specifiers (static, extern, typedef, inline, _Thread_local) and type qualifiers (const, volatile, restrict) are accumulated into the ParsedDeclAttrs buffer during the same loop, rather than being part of the returned TypeSpecifier.

The method also handles:

  • struct/union/enum definitions and forward declarations, including packed structs and #pragma pack alignment propagation.
  • typeof(expr) and typeof(type) -- GCC extensions for computed types.
  • __auto_type -- GCC type inference from initializer.
  • _Atomic(type) -- C11 atomic type specifier (parsed but atomic qualifier not tracked).
  • _Alignas(N) and _Alignas(type) -- C11 alignment specifiers.
  • __attribute__((mode(...))) -- GCC integer width override that transforms the base type to a specific bit-width (QI=8, HI=16, SI=32, DI=64, TI=128) while preserving signedness.
  • __int128 and __uint128_t -- 128-bit integer types (rejected on 32-bit targets with a diagnostic).

Declaration Parsing

Declarations are parsed in declarations.rs. The top-level entry point parse_external_decl() handles:

  1. Top-level asm directives -- asm("...") outside any function, emitted verbatim as ExternalDecl::TopLevelAsm.

  2. _Static_assert -- the assertion expression is evaluated at parse time; if it evaluates to zero, a compile error is emitted with the assertion message. No meaningful AST node is produced (at file scope, an empty Declaration sentinel is returned to satisfy the ExternalDecl return type).

  3. Implicit int -- C89 compatibility: if no type specifier is found but the token sequence looks like identifier(, it is treated as a function with implicit int return type.

  4. Type specifier + declarator -- The method parses a type specifier, then a declarator (with GCC attributes), then inspects what follows:

    • If the declarator ends with Function(...) and the next token is { (or a type specifier for K&R style), it is a function definition.
    • Otherwise, it is a declaration (variable, typedef, function prototype).

Function Definitions

parse_function_def() handles both modern and K&R-style parameter declarations:

// Modern (prototype) style:
int add(int a, int b) { return a + b; }

// K&R (identifier list) style:
int add(a, b)
    int a;
    int b;
{ return a + b; }

For K&R functions, parse_kr_params() matches parameter names from the identifier list with type declarations that follow.

The method also shadows typedef names that collide with parameter names for the duration of the function body, restoring the shadowing set afterward. The return type is computed from the derived declarator chain by build_return_type, which applies post-Function and pre-Function derivations (handling cases like int (*func())[3] where the return type is a pointer to an array).

Variable Declarations

parse_declaration_rest() handles multi-declarator declarations (int x = 1, y = 2, *z;), initializer parsing (including nested designated initializers with C99 .field and [index] syntax plus GCC range designators [lo ... hi]), and the transfer of accumulated GCC attributes to per-declarator DeclAttributes.

A DeclContext struct groups per-declaration attribute state (alignment, _Alignas type, common flag, per-declarator attributes) to avoid threading many individual parameters between parse_external_decl and parse_declaration_rest.

When a typedef declaration is parsed, the declared names are added to the typedefs set, making them available for disambiguation in subsequent parsing.

Attribute Merging

GCC attributes can appear in multiple positions on a declaration:

__attribute__((aligned(16))) int __attribute__((weak)) x __attribute__((section(".data")));

The parser collects attributes from each position (type-level, declarator-level, post-declarator-level) and merges them. For alignment, the maximum value is taken per C11 section 6.7.5. For visibility, an explicit __attribute__((visibility(...))) takes priority over the current #pragma GCC visibility default.


Statement Parsing

The statement parser in statements.rs handles all C statement forms:

Statement AST variant
expr; (expression statement) Stmt::Expr
return expr; Stmt::Return
if (cond) then else Stmt::If
while (cond) body Stmt::While
do body while (cond); Stmt::DoWhile
for (init; cond; inc) body Stmt::For
{ items } Stmt::Compound
break; / continue; Stmt::Break / Stmt::Continue
switch (expr) body Stmt::Switch
case expr: / default: Stmt::Case / Stmt::Default
case lo ... hi: Stmt::CaseRange (GCC extension)
goto label; Stmt::Goto
goto *expr; Stmt::GotoIndirect (GCC computed goto)
label: stmt Stmt::Label
asm(...) Stmt::InlineAsm
declaration in statement position Stmt::Declaration (C23/GNU)

Compound Statements and Scope Management

parse_compound_stmt() implements scope management. At entry it saves the typedef shadowing set and the declaration attribute flags; at exit it restores both. This prevents storage-class specifiers from declarations inside the block from leaking into the enclosing context -- critical for GCC statement expressions used inside typeof():

typeof(({ extern void f(void); 42; })) x = 10;  // 'extern' must not leak

The method also handles GNU __label__ declarations at the start of blocks, which declare local labels scoped to the compound statement.

Within a compound statement, the parser distinguishes declarations from statements by calling is_type_specifier(). If the current token begins a type specifier (and is not a typedef-name used as a label), it takes the declaration path; otherwise it takes the statement path.

For Statements

for statements receive special handling because C99 allows a declaration in the initializer position:

for (int i = 0; i < n; i++) { ... }

The parser checks is_type_specifier() in the initializer to distinguish ForInit::Declaration from ForInit::Expr.


Inline Assembly (GCC Extended Syntax)

The parser supports GCC extended inline assembly with the full four-colon syntax:

asm [volatile] [goto] [inline] (
    "template string"           // template with %0, %1, %[name] placeholders
    : [output operands]         // first colon
    : [input operands]          // second colon
    : [clobber list]            // third colon
    : [goto labels]             // fourth colon (asm goto)
);

Each operand is parsed as:

[symbolic_name] "constraint" (c_expression)

The symbolic name (in brackets) is optional. Constraints are string literals like "=r", "+m", "i". The constraint string may be formed from concatenated string literals. Clobbers are bare string literals ("memory", "cc", register names). Goto labels are comma-separated identifiers.

All four operand sections are optional; parsing stops at each colon boundary if there are no more sections. Template strings support concatenation of adjacent string literals.

Top-level asm("...") directives (outside any function) are also supported and produce ExternalDecl::TopLevelAsm nodes for verbatim emission into the assembly output.


AST Node Types

The AST is defined as a collection of enums and structs in ast.rs. The root is TranslationUnit, which contains a Vec<ExternalDecl>.

Top-Level Structure

TranslationUnit
  Vec<ExternalDecl>
    FunctionDef(FunctionDef)     -- function definition with body
    Declaration(Declaration)     -- variable/typedef/prototype declaration
    TopLevelAsm(String)          -- asm("...") directive at file scope

FunctionDef

Represents a function definition: return type, name, parameter list, variadic flag, compound statement body, function attributes, K&R flag, and source span.

Declaration

Represents variable declarations, typedef declarations, and function prototypes. Contains a base TypeSpecifier, a list of InitDeclarators (each with a name, derived declarator chain, optional initializer, and per-declarator DeclAttributes), and packed boolean flags for storage class and qualifiers. Also carries alignment overrides, address space qualifiers (__seg_gs, __seg_fs), and vector size attributes.

TypeSpecifier

A large enum covering all C type forms:

  • Primitive types: Void, Char, Short, Int, Long, LongLong, Float, Double, LongDouble, Signed, Unsigned, Bool, Int128
  • Unsigned variants: UnsignedChar through UnsignedInt128
  • Complex types: ComplexFloat, ComplexDouble, ComplexLongDouble
  • Aggregate types: Struct(name, fields, packed, pragma_pack, aligned), Union(...), Enum(name, variants, packed)
  • Derived types: Pointer(inner, address_space), Array(inner, size), FunctionPointer(ret, params, variadic), BareFunction(ret, params, variadic)
  • Name-based: TypedefName(String)
  • Deferred: Typeof(Expr), TypeofType(TypeSpecifier), AutoType
  • Vector: Vector(inner, total_bytes) — wraps a base element type with __attribute__((vector_size(N))) in cast/compound-literal/sizeof contexts

Note: Signed and Unsigned are standalone variants that exist for completeness in type resolution but are not currently emitted by the parser (they are resolved into their concrete forms like Int or UnsignedInt during resolve_type_flags).

Expr

The expression enum has 43 variants covering:

  • Literals: IntLiteral, UIntLiteral, LongLiteral, ULongLiteral, LongLongLiteral, ULongLongLiteral, FloatLiteral, FloatLiteralF32, FloatLiteralLongDouble (with full 128-bit precision bytes), StringLiteral, WideStringLiteral, Char16StringLiteral, CharLiteral, ImaginaryLiteral, ImaginaryLiteralF32, ImaginaryLiteralLongDouble
  • Names: Identifier
  • Operators: BinaryOp, UnaryOp, PostfixOp, Assign, CompoundAssign
  • Access: ArraySubscript, MemberAccess, PointerMemberAccess, AddressOf, Deref
  • Control flow: Conditional (ternary), GnuConditional (omitted-middle cond ?: else), Comma
  • Type operations: Cast, Sizeof, Alignof, GnuAlignof (preferred alignment), AlignofExpr, GnuAlignofExpr, VaArg
  • Compound constructs: FunctionCall, CompoundLiteral, StmtExpr (GCC statement expression ({ ... }))
  • Generic selection: GenericSelection (C11 _Generic)
  • GCC extensions: LabelAddr (&&label for computed goto), BuiltinTypesCompatibleP

Every Expr variant carries a Span for source location tracking. The ExprId newtype wraps a stable pointer address for use as a cache key in downstream type-checking and constant-evaluation phases.

Stmt

Covers all C statement forms as described in the Statement Parsing section. The full set of variants:

  • Expr(Option<Expr>) -- expression statement (the most common statement type)
  • Return, If, While, DoWhile, For, Compound
  • Break, Continue, Switch, Case, Default
  • CaseRange (GCC extension), Goto, GotoIndirect (GCC computed goto)
  • Label, Declaration, InlineAsm

Each statement variant that represents a control-flow construct carries a Span for source location, accessible via Stmt::span().

DerivedDeclarator

Describes type modifications applied by a declarator:

enum DerivedDeclarator {
    Pointer,                                // *
    Array(Option<Box<Expr>>),               // [expr] or []
    Function(Vec<ParamDecl>, bool),         // (params) with variadic flag
    FunctionPointer(Vec<ParamDecl>, bool),  // (*)(params) -- pointer to function
}

The distinction between Function and FunctionPointer is essential: Function appears in function definitions and prototypes, while FunctionPointer appears in declarators like void (*fp)(int) where the (*...) syntax introduces a pointer-to-function.

Initializers

enum Initializer {
    Expr(Expr),                         // simple: int x = 42;
    List(Vec<InitializerItem>),         // braced: int a[] = {1, 2, 3};
}

struct InitializerItem {
    designators: Vec<Designator>,       // .field, [index], [lo...hi]
    init: Initializer,                  // recursively nested
}

enum Designator {
    Index(Expr),          // [expr]
    Range(Expr, Expr),    // [lo ... hi]  (GCC extension)
    Field(String),        // .fieldname
}

Packed Bitfield Attributes

Four AST and parser-internal types use packed bitfield integers to store boolean flags instead of individual bool fields:

Struct Flag type Booleans stored
FunctionAttributes u16 (13 bits) static, inline, extern, gnu_inline, always_inline, noinline, constructor, destructor, weak, used, fastcall, naked, noreturn
Declaration u16 (9 bits) static, extern, typedef, const, volatile, common, thread_local, transparent_union, inline
DeclAttributes u16 (8 bits) constructor, destructor, weak, error_attr, noreturn, used, fastcall, naked
ParsedDeclAttrs u32 (19 bits) typedef, static, extern, thread_local, inline, const, volatile, constructor, destructor, weak, used, gnu_inline, always_inline, noinline, noreturn, error_attr, transparent_union, fastcall, naked

Each flag is a named constant (power of two) in a companion module (e.g., func_attr_flag::STATIC = 1 << 0). Getters use flags & MASK != 0 and setters use flags |= MASK / flags &= !MASK.

This design achieves three goals:

  • Memory savings. 13 booleans in FunctionAttributes collapse from 13 bytes to 2 bytes. For ParsedDeclAttrs, 19 booleans collapse from 19 bytes to 4 bytes. Since AST nodes are allocated in large numbers, these savings are meaningful.

  • Bulk save/restore. ParsedDeclAttrs::save_flags() and restore_flags() capture and restore all 19 boolean flags as a single u32, used to prevent storage-class specifiers from leaking across compound statement scope boundaries.

  • API ergonomics. Callers use named getter/setter methods (is_static(), set_inline(true)) rather than raw bit manipulation. The Debug implementations expand flags into named booleans for readable output.

Non-boolean attributes (section name, visibility string, alias target, asm register, cleanup function name, vector size, alignment overrides) remain as Option<String> or Option<usize> fields alongside the packed flags.


Error Recovery Strategy

The parser employs several error recovery strategies to continue parsing after encountering invalid input:

  1. Emit and continue. Most expect* methods report an error but do not halt. If the expected token is missing, the parser records the error and continues from the current position, producing a potentially malformed but complete AST.

  2. Contextual error messages. Four expect variants produce increasingly specific diagnostics:

    • expect(token) -- generic: "expected ; before }"
    • expect_after(token, context) -- positional: "expected ; after return statement" with a fix-it hint
    • expect_context(token, context) -- contextual: "expected ( after if" with a fix-it hint
    • expect_closing(token, open_span) -- matching: "expected ) before ;" with a note "to match this ( at file.c:10:5" and a fix-it hint
  3. Fix-it hints. Error diagnostics include insertion suggestions (e.g., "insert ;") that an IDE or error display could use to suggest corrections.

  4. Speculative backtracking. For ambiguous constructs (casts vs. parenthesized expressions, type specifiers vs. identifiers), the parser saves its position, attempts one interpretation, and backtracks on failure. This avoids committing to a parse that might be wrong.

  5. Top-level skip. In parse(), if parse_external_decl() returns None for a non-EOF, non-semicolon token, the parser reports an error and advances past the unrecognized token to resynchronize.

  6. Error counting. error_count tracks the total number of parse errors, allowing the driver to decide whether to abort compilation after parsing.


Known Limitations

  • Parser does not invoke the preprocessor. The compiler has a fully integrated C preprocessor (see frontend/preprocessor/), but the driver orchestrates the pipeline: preprocessing runs first, then lexing, then parsing. From the parser's perspective, input tokens are already macro-expanded and include-resolved. The parser also pre-seeds a set of builtin typedef names (see below) so that standard types like size_t parse correctly even when compiling without real system headers.

  • No type checking during parsing. The parser produces an untyped AST. Type resolution, implicit conversions, and semantic checks happen in later lowering phases. This means some constructs that are syntactically valid but semantically illegal will parse successfully.

  • Typedef scope is approximated. Typedef names are tracked globally with a shadowing set for local scopes. This is correct for the vast majority of code but does not model C's full block-scope typedef rules with complete precision (e.g., a typedef introduced inside a for loop initializer may remain visible longer than the standard specifies).

  • _Atomic is parsed but not fully modeled. _Atomic(T) is treated as equivalent to T -- the atomic qualification is syntactically consumed but not preserved in the AST.

  • restrict is consumed but ignored. The restrict qualifier is recognized and skipped but does not appear in the AST, since it is an optimization hint that does not affect code generation in this compiler.

  • Constant expression evaluation is limited. The parser can evaluate simple constant expressions (for enum values and alignment attributes) using a lookup table of previously parsed enum constants, but cannot evaluate arbitrary compile-time expressions involving sizeof of composite types.

  • _Generic const tracking is partial. The parser tracks is_const flags on _Generic associations and parameter declarations to distinguish const-qualified pointer types, but CType itself does not carry qualifiers. This means const-sensitivity works for the common cases but not for complex expressions like casts or subscripts.

  • No C++ support. This is a C parser only. C++ constructs (classes, templates, namespaces, references, overloading) are not recognized.