A comprehensive lexical analyzer (tokenizer) for Go that transforms source code into structured tokens. Built for creating compilers, interpreters, DSLs, configuration parsers, and code analysis tools.
A lexical analyzer breaks down source code into meaningful units called tokens. For example, let x = 42 + y; becomes tokens: LET, IDENT(x), ASSIGN(=), NUMBER(42), PLUS(+), IDENT(y), SEMICOLON(;). This is the foundation for building programming language tools.
- Rich Token Set: 50+ built-in token types covering modern programming constructs
- Multiple Number Formats: Decimal, hex, binary, octal, scientific notation with full validation
- Advanced String Processing: Regular strings, raw backtick strings, character literals with complete escape sequences
- JSON Configuration System: Extend the lexer with custom keywords, operators, and punctuation without code changes
- Robust Error Recovery: Continues processing after errors, collecting all issues with precise position tracking
- UTF-8 Unicode Support: Full support for international identifiers and multibyte characters
- High Performance: Single-pass tokenization with minimal memory allocations
- Comprehensive Testing: Validated with 1700+ tokens across complex real-world code patterns
go get github.com/codetesla51/golexerRequirements: Go 1.21 or later
golexer/
├── cmd/
│ └── main.go # CLI demonstration tool
├── examples/
│ ├── config.json # Example configuration for custom tokens
│ └── test.lang # Comprehensive test file (400+ lines)
├── golexer/
│ ├── config.go # Configuration loading and merging
│ ├── errors.go # Error types and handling
│ ├── lexer.go # Core lexical analyzer
│ ├── lexer_test.go # Unit tests
│ └── token.go # Token types and keywords
├── go.mod
├── LICENSE
└── README.md
package main
import (
"fmt"
"github.com/codetesla51/golexer/golexer"
)
func main() {
source := `let total = 42 + 3.14 * count;`
lexer := golexer.NewLexer(source)
for {
token := lexer.NextToken()
if token.Type == golexer.EOF {
break
}
fmt.Printf("%-12s %-10s (Line %d, Col %d)\n",
token.Type, token.Literal, token.Line, token.Column)
}
}Output:
LET let (Line 1, Col 1)
IDENT total (Line 1, Col 5)
= = (Line 1, Col 11)
NUMBER 42 (Line 1, Col 13)
+ + (Line 1, Col 16)
NUMBER 3.14 (Line 1, Col 18)
* * (Line 1, Col 23)
IDENT count (Line 1, Col 25)
; ; (Line 1, Col 30)
lexer := golexer.NewLexer(source)
tokens, errors := lexer.TokenizeAll()
fmt.Printf("Generated %d tokens\n", len(tokens))
if len(errors) > 0 {
fmt.Printf("Found %d errors:\n", len(errors))
for _, err := range errors {
fmt.Printf(" %s\n", err.Error())
}
} else {
fmt.Println("No lexical errors found")
}Extend the lexer functionality using JSON configuration files. This allows you to add domain-specific keywords and operators without modifying the source code.
Create examples/config.json:
{
"additionalKeywords": {
"unless": "UNLESS",
"until": "UNTIL",
"async": "ASYNC",
"await": "AWAIT"
},
"additionalOperators": {
//single char operators
},
"additionalPunctuation": {
"@": "AT_SYMBOL",
"#": "HASH",
"$": "DOLLAR"
}
}lexer := golexer.NewLexerWithConfig(source, "examples/config.json")
// Now recognizes extended syntax:
source := `
unless error {
result = value ** 2 ?? fallback
data = object ?. property
user = @currentUser
}
`If the config file is missing or invalid, the lexer shows a warning and continues with defaults:
Warning: failed to load config file 'config.json': no such file or directory
Continuing with default configuration...
GoLexer supports all modern number formats with comprehensive validation:
| Format | Examples | Description |
|---|---|---|
| Decimal | 42, 0, 1000 |
Standard integers |
| Float | 3.14, 0.5, 42.0 |
Decimal points |
| Scientific | 1e10, 2.5e-3, 1E+5 |
Exponential notation |
| Hexadecimal | 0xFF, 0x1a2b, 0X1A2B |
Base-16 with 0x prefix |
| Binary | 0b1010, 0B1111 |
Base-2 with 0b prefix |
| Octal Modern | 0o777, 0O123 |
Base-8 with 0o prefix |
| Octal Legacy | 0755, 0123 |
Traditional format |
Complete escape sequence support:
"Hello, World!" // Simple string
"Line 1\nLine 2" // Newline
"Quote: \"Hello\"" // Escaped quote
"Tab\tSeparated" // Tab character
"Hex: \x41\x42" // Hex escapes (AB)No escape processing - literal text including backslashes:
`Raw string with \n literal backslashes`
`File path: C:\Users\Name\file.txt`
`Multi
line
string`'a', 'Z', '0' // Regular characters
'\n', '\t', '\r' // Control characters
'\'', '\\' // Escaped quotes
'\x41' // Hex escape for 'A'| Escape | Result | Description |
|---|---|---|
\n |
newline | Line feed |
\t |
tab | Horizontal tab |
\r |
return | Carriage return |
\\ |
backslash | Literal backslash |
\" |
quote | Double quote |
\' |
apostrophe | Single quote |
\0 |
null | Null character |
\xNN |
hex char | Character by hex code |
let const fn if else while for return break continue true false null
int float string bool char
- Must start with letter or underscore
- Can contain letters, digits, underscores
- Unicode support:
café,résumé,变量 - Examples:
variable1,_private,camelCase,snake_case
+ - * / % // Basic operations
+= -= *= /= %= // Compound assignment
== != // Equality
< <= > >= // Relational
&& || ! // AND, OR, NOT
Note: Single & and | produce helpful error messages suggesting the compound forms.
( ) { } [ ] // Parentheses, braces, brackets
, ; : . // Comma, semicolon, colon, dot
// Line comments - rest of line ignored
let x = 42; // End of line comment
/* Block comments - can span multiple lines */
let y = /* inline */ 10;// Create basic lexer
func NewLexer(input string) *Lexer
// Create lexer with JSON configuration
func NewLexerWithConfig(input, configFile string) *Lexer// Get next token (streaming)
func (l *Lexer) NextToken() Token
// Get all tokens at once (batch)
func (l *Lexer) TokenizeAll() ([]Token, []*LexError)// Check for errors
func (l *Lexer) HasErrors() bool
// Get error details
func (l *Lexer) GetErrors() []*LexErrortype Token struct {
Type TokenType // Token classification
Literal string // Original text
Line int // Line number (1-indexed)
Column int // Column position (1-indexed)
}type LexError struct {
Message string // Error description
Line int // Error line
Column int // Error column
}The lexer provides comprehensive error detection while continuing to process input, finding all problems in a single pass.
| Input | Error Message | Explanation |
|---|---|---|
123abc |
invalid number: numbers cannot be followed by letters |
Invalid number format |
0xGHI |
invalid hexadecimal number: contains non-hex characters |
Bad hex digits |
0b123 |
invalid binary number: contains non-binary characters |
Invalid binary digits |
"hello |
unterminated string literal |
Missing closing quote |
"test\q" |
unknown escape sequence '\q' |
Invalid escape sequence |
& |
unexpected character '&' - did you mean '&&'? |
Helpful suggestion |
problemSource := `
let x = 123abc; // Error: invalid number
let y = "valid"; // Continues processing
let z = 0xGHI; // Error: invalid hex
let a = 42; // Still processes correctly
`
lexer := golexer.NewLexer(problemSource)
tokens, errors := lexer.TokenizeAll()
// Result:
// - tokens contains all valid tokens including "valid" and 42
// - errors contains detailed info about both problems
// - Processing never stops due to errorstype Compiler struct {
lexer *golexer.Lexer
}
func (c *Compiler) CompileFile(filename string) error {
source, err := os.ReadFile(filename)
if err != nil {
return err
}
c.lexer = golexer.NewLexer(string(source))
tokens, errors := c.lexer.TokenizeAll()
if len(errors) > 0 {
return fmt.Errorf("lexical errors: %v", errors)
}
// Pass tokens to parser
return c.parse(tokens)
}func ParseConfig(configFile string) (*AppConfig, error) {
content, err := os.ReadFile(configFile)
if err != nil {
return nil, err
}
lexer := golexer.NewLexerWithConfig(string(content), "config-lang.json")
config := &AppConfig{}
for {
tok := lexer.NextToken()
if tok.Type == golexer.EOF {
break
}
// Parse key: value pairs
if tok.Type == golexer.IDENT {
key := tok.Literal
if colon := lexer.NextToken(); colon.Type == golexer.COLON {
value := lexer.NextToken()
config.Set(key, parseValue(value))
}
}
}
return config, nil
}func AnalyzeCode(source string) {
lexer := golexer.NewLexer(source)
tokens, errors := lexer.TokenizeAll()
// Generate statistics
tokenCounts := make(map[golexer.TokenType]int)
for _, token := range tokens {
tokenCounts[token.Type]++
}
fmt.Printf("Analysis Results:\n")
fmt.Printf("Total tokens: %d\n", len(tokens))
fmt.Printf("Unique types: %d\n", len(tokenCounts))
fmt.Printf("Errors found: %d\n", len(errors))
// Show token distribution
for tokenType, count := range tokenCounts {
percentage := float64(count) / float64(len(tokens)) * 100
fmt.Printf(" %-15s: %4d (%5.1f%%)\n", tokenType, count, percentage)
}
}func GenerateHighlighting(source string) []HighlightToken {
lexer := golexer.NewLexer(source)
var highlights []HighlightToken
for {
tok := lexer.NextToken()
if tok.Type == golexer.EOF {
break
}
highlights = append(highlights, HighlightToken{
Text: tok.Literal,
Type: mapToHighlightType(tok.Type),
Line: tok.Line,
Column: tok.Column,
})
}
return highlights
}
func mapToHighlightType(tokenType golexer.TokenType) string {
switch tokenType {
case golexer.LET, golexer.IF, golexer.WHILE:
return "keyword"
case golexer.STRING, golexer.CHAR:
return "string"
case golexer.NUMBER:
return "number"
default:
return "default"
}
}- Time Complexity: O(n) linear processing
- Memory Usage: O(1) streaming, O(n) batch processing
- UTF-8 Handling: Proper multibyte character support
- Benchmark: Processes 1700+ tokens across 400+ lines instantly
Streaming (NextToken): Large files, memory constraints, real-time processing Batch (TokenizeAll): Complete analysis, small to medium files, when you need all tokens upfront
# Clone and test
git clone https://github.com/codetesla51/golexer.git
cd golexer
# Unit tests
go test ./golexer -v
# Comprehensive integration test
go run cmd/main.go examples/test.langThe test suite processes a comprehensive example file with:
- 1700+ tokens across 400+ lines
- All number formats and string types
- Complete operator and keyword coverage
- Complex nested structures
- Zero lexical errors
Expected output:
=== Summary ===
File: test.lang
Lines processed: 400+
Tokens generated: 1700+
Unique token types: 45+
Lexical errors: 0
Status: ✓ PASSED
The included CLI demonstrates all lexer capabilities:
# Analyze any file
go run cmd/main.go yourfile.txt
# Test with comprehensive example
go run cmd/main.go examples/test.langThe CLI provides:
- Token-by-token output: Each token with position
- Batch statistics: Token counts and distribution
- Error reporting: Detailed error messages with locations
- Summary: Overall processing results
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Add tests for new functionality
- Ensure
go run cmd/main.go test.langpasses with 0 errors - Submit pull request
- Follow Go conventions and use
gofmt - Add comprehensive comments for new features
- Include both positive and negative test cases
- Update documentation for significant changes
- Test with various input types and edge cases
MIT License - see LICENSE file for details.
You can use GoLexer in commercial projects, modify it, and distribute your changes. The only requirement is including the original license notice.
This project was inspired by the Monkey lexer from the excellent book "Writing An Interpreter In Go" by Thorsten Ball.
The foundational concepts of lexical analysis and token processing from that work provided the foundation for building this more comprehensive, production-ready lexer with extended number format support, configurable extensions, robust error recovery, and full Unicode handling.
Author: Uthman Dev
Repository: https://github.com/codetesla51/golexer