GoLexer

A comprehensive lexical analyzer (tokenizer) for Go that transforms source code into structured tokens. Built for creating compilers, interpreters, DSLs, configuration parsers, and code analysis tools.

What is GoLexer?

A lexical analyzer breaks down source code into meaningful units called tokens. For example, let x = 42 + y; becomes tokens: LET, IDENT(x), ASSIGN(=), NUMBER(42), PLUS(+), IDENT(y), SEMICOLON(;). This is the foundation for building programming language tools.

Features

Rich Token Set: 50+ built-in token types covering modern programming constructs
Multiple Number Formats: Decimal, hex, binary, octal, scientific notation with full validation
Advanced String Processing: Regular strings, raw backtick strings, character literals with complete escape sequences
JSON Configuration System: Extend the lexer with custom keywords, operators, and punctuation without code changes
Robust Error Recovery: Continues processing after errors, collecting all issues with precise position tracking
UTF-8 Unicode Support: Full support for international identifiers and multibyte characters
High Performance: Single-pass tokenization with minimal memory allocations
Comprehensive Testing: Validated with 1700+ tokens across complex real-world code patterns

Installation

go get github.com/codetesla51/golexer

Requirements: Go 1.21 or later

Project Structure

golexer/
├── cmd/
│   └── main.go          # CLI demonstration tool
├── examples/
│   ├── config.json      # Example configuration for custom tokens
│   └── test.lang        # Comprehensive test file (400+ lines)
├── golexer/
│   ├── config.go        # Configuration loading and merging
│   ├── errors.go        # Error types and handling
│   ├── lexer.go         # Core lexical analyzer
│   ├── lexer_test.go    # Unit tests
│   └── token.go         # Token types and keywords
├── go.mod
├── LICENSE
└── README.md

Quick Start

Basic Tokenization

package main

import (
    "fmt"
    "github.com/codetesla51/golexer/golexer"
)

func main() {
    source := `let total = 42 + 3.14 * count;`
    lexer := golexer.NewLexer(source)
    
    for {
        token := lexer.NextToken()
        if token.Type == golexer.EOF {
            break
        }
        fmt.Printf("%-12s %-10s (Line %d, Col %d)\n", 
            token.Type, token.Literal, token.Line, token.Column)
    }
}

Output:

LET          let        (Line 1, Col 1)
IDENT        total      (Line 1, Col 5)
=            =          (Line 1, Col 11)
NUMBER       42         (Line 1, Col 13)
+            +          (Line 1, Col 16)
NUMBER       3.14       (Line 1, Col 18)
*            *          (Line 1, Col 23)
IDENT        count      (Line 1, Col 25)
;            ;          (Line 1, Col 30)

Batch Processing with Error Handling

lexer := golexer.NewLexer(source)
tokens, errors := lexer.TokenizeAll()

fmt.Printf("Generated %d tokens\n", len(tokens))

if len(errors) > 0 {
    fmt.Printf("Found %d errors:\n", len(errors))
    for _, err := range errors {
        fmt.Printf("  %s\n", err.Error())
    }
} else {
    fmt.Println("No lexical errors found")
}

Configuration System

Extend the lexer functionality using JSON configuration files. This allows you to add domain-specific keywords and operators without modifying the source code.

Creating Configuration

Create examples/config.json:

{
  "additionalKeywords": {
    "unless": "UNLESS",
    "until": "UNTIL",
    "async": "ASYNC",
    "await": "AWAIT"
  },
  "additionalOperators": {
    //single char operators
  },
  "additionalPunctuation": {
    "@": "AT_SYMBOL",
    "#": "HASH",
    "$": "DOLLAR"
  }
}

Using Configuration

lexer := golexer.NewLexerWithConfig(source, "examples/config.json")

// Now recognizes extended syntax:
source := `
unless error {
    result = value ** 2 ?? fallback
    data = object ?. property
    user = @currentUser
}
`

Graceful Error Handling

If the config file is missing or invalid, the lexer shows a warning and continues with defaults:

Warning: failed to load config file 'config.json': no such file or directory
Continuing with default configuration...

Supported Tokens

Number Formats

GoLexer supports all modern number formats with comprehensive validation:

Format	Examples	Description
Decimal	`42`, `0`, `1000`	Standard integers
Float	`3.14`, `0.5`, `42.0`	Decimal points
Scientific	`1e10`, `2.5e-3`, `1E+5`	Exponential notation
Hexadecimal	`0xFF`, `0x1a2b`, `0X1A2B`	Base-16 with 0x prefix
Binary	`0b1010`, `0B1111`	Base-2 with 0b prefix
Octal Modern	`0o777`, `0O123`	Base-8 with 0o prefix
Octal Legacy	`0755`, `0123`	Traditional format

String and Character Literals

Regular Strings

Complete escape sequence support:

"Hello, World!"           // Simple string
"Line 1\nLine 2"         // Newline
"Quote: \"Hello\""       // Escaped quote
"Tab\tSeparated"         // Tab character
"Hex: \x41\x42"          // Hex escapes (AB)

Raw Strings

No escape processing - literal text including backslashes:

`Raw string with \n literal backslashes`
`File path: C:\Users\Name\file.txt`
`Multi
line
string`

Character Literals

'a', 'Z', '0'            // Regular characters
'\n', '\t', '\r'         // Control characters
'\'', '\\'               // Escaped quotes
'\x41'                   // Hex escape for 'A'

Escape Sequences

Escape	Result	Description
`\n`	newline	Line feed
`\t`	tab	Horizontal tab
`\r`	return	Carriage return
`\\`	backslash	Literal backslash
`\"`	quote	Double quote
`\'`	apostrophe	Single quote
`\0`	null	Null character
`\xNN`	hex char	Character by hex code

Keywords and Identifiers

Built-in Keywords

let const fn if else while for return break continue true false null
int float string bool char

Valid Identifiers

Must start with letter or underscore
Can contain letters, digits, underscores
Unicode support: café, résumé, 变量
Examples: variable1, _private, camelCase, snake_case

Operators

Arithmetic

+    -    *    /    %     // Basic operations
+=   -=   *=   /=   %=    // Compound assignment

Comparison

==   !=                   // Equality
<    <=   >    >=        // Relational

Logical

&&   ||   !              // AND, OR, NOT

Note: Single & and | produce helpful error messages suggesting the compound forms.

Punctuation and Delimiters

Grouping

( ) { } [ ]              // Parentheses, braces, brackets

Separators

, ; : .                  // Comma, semicolon, colon, dot

Comments

// Line comments - rest of line ignored
let x = 42; // End of line comment

/* Block comments - can span multiple lines */
let y = /* inline */ 10;

API Reference

Core Functions

// Create basic lexer
func NewLexer(input string) *Lexer

// Create lexer with JSON configuration
func NewLexerWithConfig(input, configFile string) *Lexer

Tokenization Methods

// Get next token (streaming)
func (l *Lexer) NextToken() Token

// Get all tokens at once (batch)
func (l *Lexer) TokenizeAll() ([]Token, []*LexError)

Error Handling

// Check for errors
func (l *Lexer) HasErrors() bool

// Get error details
func (l *Lexer) GetErrors() []*LexError

Data Structures

Token

type Token struct {
    Type    TokenType  // Token classification
    Literal string     // Original text
    Line    int        // Line number (1-indexed)
    Column  int        // Column position (1-indexed)
}

Error

type LexError struct {
    Message string     // Error description
    Line    int        // Error line
    Column  int        // Error column
}

Error Handling and Recovery

The lexer provides comprehensive error detection while continuing to process input, finding all problems in a single pass.

Common Error Types

Input	Error Message	Explanation
`123abc`	`invalid number: numbers cannot be followed by letters`	Invalid number format
`0xGHI`	`invalid hexadecimal number: contains non-hex characters`	Bad hex digits
`0b123`	`invalid binary number: contains non-binary characters`	Invalid binary digits
`"hello`	`unterminated string literal`	Missing closing quote
`"test\q"`	`unknown escape sequence '\q'`	Invalid escape sequence
`&`	`unexpected character '&' - did you mean '&&'?`	Helpful suggestion

Error Recovery Example

problemSource := `
let x = 123abc;      // Error: invalid number
let y = "valid";     // Continues processing
let z = 0xGHI;       // Error: invalid hex
let a = 42;          // Still processes correctly
`

lexer := golexer.NewLexer(problemSource)
tokens, errors := lexer.TokenizeAll()

// Result: 
// - tokens contains all valid tokens including "valid" and 42
// - errors contains detailed info about both problems
// - Processing never stops due to errors

Advanced Usage Examples

Building a Compiler Frontend

type Compiler struct {
    lexer *golexer.Lexer
}

func (c *Compiler) CompileFile(filename string) error {
    source, err := os.ReadFile(filename)
    if err != nil {
        return err
    }
    
    c.lexer = golexer.NewLexer(string(source))
    tokens, errors := c.lexer.TokenizeAll()
    
    if len(errors) > 0 {
        return fmt.Errorf("lexical errors: %v", errors)
    }
    
    // Pass tokens to parser
    return c.parse(tokens)
}

Configuration File Parser

func ParseConfig(configFile string) (*AppConfig, error) {
    content, err := os.ReadFile(configFile)
    if err != nil {
        return nil, err
    }
    
    lexer := golexer.NewLexerWithConfig(string(content), "config-lang.json")
    config := &AppConfig{}
    
    for {
        tok := lexer.NextToken()
        if tok.Type == golexer.EOF {
            break
        }
        
        // Parse key: value pairs
        if tok.Type == golexer.IDENT {
            key := tok.Literal
            if colon := lexer.NextToken(); colon.Type == golexer.COLON {
                value := lexer.NextToken()
                config.Set(key, parseValue(value))
            }
        }
    }
    
    return config, nil
}

Code Analysis Tool

func AnalyzeCode(source string) {
    lexer := golexer.NewLexer(source)
    tokens, errors := lexer.TokenizeAll()
    
    // Generate statistics
    tokenCounts := make(map[golexer.TokenType]int)
    for _, token := range tokens {
        tokenCounts[token.Type]++
    }
    
    fmt.Printf("Analysis Results:\n")
    fmt.Printf("Total tokens: %d\n", len(tokens))
    fmt.Printf("Unique types: %d\n", len(tokenCounts))
    fmt.Printf("Errors found: %d\n", len(errors))
    
    // Show token distribution
    for tokenType, count := range tokenCounts {
        percentage := float64(count) / float64(len(tokens)) * 100
        fmt.Printf("  %-15s: %4d (%5.1f%%)\n", tokenType, count, percentage)
    }
}

Syntax Highlighter

func GenerateHighlighting(source string) []HighlightToken {
    lexer := golexer.NewLexer(source)
    var highlights []HighlightToken
    
    for {
        tok := lexer.NextToken()
        if tok.Type == golexer.EOF {
            break
        }
        
        highlights = append(highlights, HighlightToken{
            Text:   tok.Literal,
            Type:   mapToHighlightType(tok.Type),
            Line:   tok.Line,
            Column: tok.Column,
        })
    }
    
    return highlights
}

func mapToHighlightType(tokenType golexer.TokenType) string {
    switch tokenType {
    case golexer.LET, golexer.IF, golexer.WHILE:
        return "keyword"
    case golexer.STRING, golexer.CHAR:
        return "string"
    case golexer.NUMBER:
        return "number"
    default:
        return "default"
    }
}

Performance

Time Complexity: O(n) linear processing
Memory Usage: O(1) streaming, O(n) batch processing
UTF-8 Handling: Proper multibyte character support
Benchmark: Processes 1700+ tokens across 400+ lines instantly

When to Use Each Method

Streaming (NextToken): Large files, memory constraints, real-time processing Batch (TokenizeAll): Complete analysis, small to medium files, when you need all tokens upfront

Testing and Validation

Running Tests

# Clone and test
git clone https://github.com/codetesla51/golexer.git
cd golexer

# Unit tests
go test ./golexer -v

# Comprehensive integration test
go run cmd/main.go examples/test.lang

Expected Results

The test suite processes a comprehensive example file with:

1700+ tokens across 400+ lines
All number formats and string types
Complete operator and keyword coverage
Complex nested structures
Zero lexical errors

Expected output:

=== Summary ===
File: test.lang
Lines processed: 400+
Tokens generated: 1700+
Unique token types: 45+
Lexical errors: 0
Status: ✓ PASSED

Command Line Interface

The included CLI demonstrates all lexer capabilities:

# Analyze any file
go run cmd/main.go yourfile.txt

# Test with comprehensive example  
go run cmd/main.go examples/test.lang

The CLI provides:

Token-by-token output: Each token with position
Batch statistics: Token counts and distribution
Error reporting: Detailed error messages with locations
Summary: Overall processing results

Contributing

Fork the repository
Create feature branch: git checkout -b feature/amazing-feature
Add tests for new functionality
Ensure go run cmd/main.go test.lang passes with 0 errors
Submit pull request

Development Guidelines

Follow Go conventions and use gofmt
Add comprehensive comments for new features
Include both positive and negative test cases
Update documentation for significant changes
Test with various input types and edge cases

License

MIT License - see LICENSE file for details.

You can use GoLexer in commercial projects, modify it, and distribute your changes. The only requirement is including the original license notice.

Acknowledgments

This project was inspired by the Monkey lexer from the excellent book "Writing An Interpreter In Go" by Thorsten Ball.

The foundational concepts of lexical analysis and token processing from that work provided the foundation for building this more comprehensive, production-ready lexer with extended number format support, configurable extensions, robust error recovery, and full Unicode handling.

Author: Uthman Dev
Repository: https://github.com/codetesla51/golexer

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
cmd		cmd
examples		examples
golexer		golexer
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod

Folders and files

Latest commit

History

Repository files navigation

GoLexer

What is GoLexer?

Features

Installation

Project Structure

Quick Start

Basic Tokenization

Batch Processing with Error Handling

Configuration System

Creating Configuration

Using Configuration

Graceful Error Handling

Supported Tokens

Number Formats

String and Character Literals

Regular Strings

Raw Strings

Character Literals

Escape Sequences

Keywords and Identifiers

Built-in Keywords

Valid Identifiers

Operators

Arithmetic

Comparison

Logical

Punctuation and Delimiters

Grouping

Separators

Comments

API Reference

Core Functions

Tokenization Methods

Error Handling

Data Structures

Token

Error

Error Handling and Recovery

Common Error Types

Error Recovery Example

Advanced Usage Examples

Building a Compiler Frontend

Configuration File Parser

Code Analysis Tool

Syntax Highlighter

Performance

When to Use Each Method

Testing and Validation

Running Tests

Expected Results

Command Line Interface

Contributing

Development Guidelines

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages