Skip to content

ilvcyy/metaparse

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

164 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

metaparse

This is a tool for instant parsing with full power in native Python environment[1].

Moreover, You might be amazed that defining a Python class[2] just suffices to define a language, including

  • lexical
  • syntatical
  • semantic

definitions altogether, as well as the parser for it.

Then parsing work gets done by simply calling its parse method.

[1]. This module is motivated by instaparse in Clojure, but travels another way more like PLY.
[2]. Python 3 preferred.

Table of Contents

  1. Quick Example
  2. Design and Usage
  3. Generalized LALR Parsing
  4. API

Quick Example

In metaparse, language syntax and semantics can be simply defined with class methods. To illustrate this, we create a tiny calculator which can read basic arithmetic expressions and register variable bindings in a table (aka. context).

Firstly, we design the grammar on a paper, as in textbooks,

assign → ID = expr
expr → NUM
expr → ID
expr → expr₁ + expr₂
expr → expr₁ * expr₂
expr → expr₁ ** expr₂

then think about some similarity with def signatures in Python:

def assign(ID, EQ, expr): ...
def expr(NUM): ...
def expr(ID): ...
def expr(expr_1, ADD, expr_2): ...
def expr(expr_1, MUL, expr_2): ...
def expr(expr_1, POW, expr_2): ...

and finally write down the semantics with SDT-style (cf. Yacc).

from metaparse import LALR

# Global context/environment for language semantics.
context = {}

class pCalc(metaclass=LALR.meta):

    "A language for calculating expressions."

    # ===== Lexical patterns / Terminals =====
    # - Patterns specified with regular expressions
    # - Patterns will be tested in declaration order during tokenizing

    IGNORED = r'\s+'             # Special pattern to be ignored.

    EQ  = r'='
    POW = r'\*\*', 3             # Can specify precedence of token (for LALR conflict resolution)
    POW = r'\^'  , 3             # Alternative patterns can share the same name
    MUL = r'\*'  , 2
    ADD = r'\+'  , 1

    ID  = r'[_a-zA-Z]\w*'
    NUM = r'[1-9][0-9]*'
    def NUM(value):              # Can specify handler for lexical pattern!
        return int(value)

    # ===== Syntactic/Semantic rules in SDT-style =====

    def assign(ID, EQ, expr):        # May access global context.
        context[ID] = expr
        return expr

    def expr(NUM):                   # May compute result purely.
        return NUM                   # NUM is passed as (int) due to the handler!

    def expr(ID):
        return context[ID]

    def expr(expr_1, ADD, expr_2):   # With TeX-subscripts, meaning (expr → expr₁ + expr₂)
        return expr_1 + expr_2

    def expr(expr, MUL, expr_1):     # Can ignore one of the subscripts.
        return expr * expr_1

    def expr(expr, POW, expr_1):
        return expr ** expr_1

Then we get a LALR parser object:

>>> type(pCalc)
<class 'metaparse.LALR>

Now we are done and it's quite easy to try it out.

>>> pCalc.interpret("x = 1 + 4 * 3 ** 2 + 5")
42
>>> pCalc.interpret("y = 5 + x * 2")
89
>>> pCalc.interpret("z = 9 ^ 2")
81

>>> context
{'y': 89, 'x': 42, 'z': 81}

IMO, tools under state-of-the-art could hardly get more handy than this.

Note metaclass=LALR.meta only works in Python 3. There is an alternative form which also works in Python 2, as well expose the API of this module more explicitly.

Design and Usage

The design of this module targets "native parsing" (like instaparse and Parsec). Users might find metaparse remarkable due to its

  • native structure representing grammar rules
    • like def E(E, plus, T), def T(F) ...
    • rather than literal string notations like "E = E + T", "T = F" ...
  • language semantics in pure Python,
  • easy to play with (e.g. REPL),
  • no DSL feeling[3],
  • dump/load utilities,
  • no dependencies,
  • no helper/intermediate files generated,
  • optional precedence specification (for LALR),
  • nice error reporting,
  • and etc.

[3]. may be untrue.

Though this slim module does not intend to replace more extensive tools like Bison and ANTLR, it is extremely handy and its integration in Python projects can be seamless.

The following contents are unnecessary for using this module with dirty hands, but gives better explanation about core utilities of this module.

Retrieving the Parse Tree

Continuing the first example, if merely the parse tree is needed rather than the semantic result, use method parse instead of interpret:

tr = pCalc.parse(" w  = 1 + 2 * 3 ** 4 + 5 ")

>>> tr
('assign',
 [('ID', 'w'),
  ('EQ', '='),
  ('expr',
   [('expr',
     [('expr', [('NUM', '1')]),
      ('ADD', '+'),
      ('expr',
       [('expr', [('NUM', '2')]),
        ('MUL', '*'),
        ('expr',
         [('expr', [('NUM', '3')]),
          ('POW', '**'),
          ('expr', [('NUM', '4')])])])]),
    ('ADD', '+'),
    ('expr', [('NUM', '5')])])])

The result is a ParseTree object with tuple representation. A parse leaf is just a Token object represented as (<token-name>, '<lexeme>').

Save with Persistence

It should be useful to save the created parser persistently for future use (since creating a new LALR parser instance may be time consuming in some cases). Using dumps/loads or dump/load the parser instance (more precisely, the underlying automaton) will be encoded into Python structure form.

pCalc.dumps('./eg_demo_dump.py')

Since our parser is created given access to a global variable named context, we should provide the runtime environment in the user module, i.e. globals, when loading this parser instance:

# Another file using the parser

from metaparse import LALR

# Let loaded parser be able to access current runtime env `globals()`.
qCalc = LALR.load('./eg_demo_dump.py', globals())

# Context instance to be accessed by the loaded parser
context = {}

qCalc.interpret('foo = 1 + 9')

print (context)
# {'foo': 10}

Since the runtime environment in Python is simply a dict, more tricks would be possible by passing a user-defined runtime environment instance to the load method - it would be treated as the context for executing the semantic __code__ object (more basic details see the documents for exec and code object).

Error Reporting

During designing a language, various errors may occur. metaparse provides nice error reporting - for example, executing the following

from metaparse import LALR

class pExpr(metaclass=LALR.meta):

    NUM = '\d+'
    PLUS = '\+'

    def expr(expr, PLUS, term):
        return expr + term

    def expr(expr, TIMES, term):
        return expr * term

    def expr(term):
        return term

    def term(NUM):
        return int(NUM)

    def factor(NUM):
        return int(NUM)

would result in error report:

metaparse.LanguageError: No lexical pattern provided for terminal symbol: TIMES
- in 2th rule (expr = expr TIMES term)
- with helping traceback (if available): 
  File "c:/Users/Shellay/Documents/GitHub/metaparse/tests/test_make_error.py", line 11, in expr
- declared lexes: Lexer{
[('NUM', re.compile('\\d+')),
 ('PLUS', re.compile('\\+')),
 ('IGNORED', re.compile('\\s+'))]}

After giving the missing terminal symbol TIMES and re-running, another error appears:

metaparse.LanguageError: There are unreachable nonterminal at 5th rule: {'factor'}.
- with helping traceback: 
  File "c:/Users/Shellay/Documents/GitHub/metaparse/tests/test_make_error.py", line 21, in factor

The precise error information with Python traceback format can guide human or editors to the exact erroneous place conveniently.

Generalized LALR and Dealing with Ambiguity

metaparse supplies an interesting extension: the GLR parser with look-ahead, which can cope with many non-singular ambiguous grammars.

Given the famous ambiguous Dangling-Else grammar, trying to build it using LALR:

from metaparse import GLR, LALR

class pIfThenElse(metaclass=LALR.meta):

    IF     = r'if'
    THEN   = r'then'
    ELSE   = r'else'
    EXPR   = r'\d+'
    SINGLE = r'[_a-zA-Z]+'

    def stmt(ifstmt):
        return ifstmt 

    def stmt(SINGLE):
        return SINGLE 

    def ifstmt(IF, EXPR, THEN, stmt_1, ELSE, stmt_2):
        return ('ite', EXPR, stmt_1, stmt_2) 

    def ifstmt(IF, EXPR, THEN, stmt):
        return ('it', EXPR, stmt)

would result in a shift/reduce conflict on the token ELSE with error hints:

metaparse.Error: 
Handling item set: 
['(ifstmt = IF EXPR THEN stmt.ELSE stmt)', '(ifstmt = IF EXPR THEN stmt.)']
Conflict on lookahead: ELSE 
- ('reduce', (ifstmt = IF EXPR THEN stmt))
- ('shift', ['(ifstmt = IF EXPR THEN stmt ELSE.stmt)'])

Using GLR.meta instead of LALR.meta, and interpret_generalized respectively:

>>> pIfThenElse.interpret_generalized('if 1 then if 2 then if 3 then a else b else c')
[('ite', '1', ('ite', '2', ('it', '3', 'a'), 'b'), 'c'),
 ('ite', '1', ('it', '2', ('ite', '3', 'a', 'b')), 'c'),
 ('it', '1', ('ite', '2', ('ite', '3', 'a', 'b'), 'c'))]

the parser delivers all ambiguous parse results properly. Note that ambiguous parsing relying on interpret on-the-fly is dangerous since some latterly rejected parses might already get interpreted producing side-effects (so it is advised to use side-effects-free semantics when using GLR parsers!).

Using Precedence to Resolve Conflicts

Though GLR is quite powerful, we may not want to deal with ambiguity in practical cases and prefer LALR for its clarity and performance.

By specifying ELSE a higher associative precedence than THEN (just like the calculator example treating operators), meaning ELSE would be preferred to combine the nearest THEN leftwards:

class pIfThenElse(metaclass=LALR.meta):
    ...
    THEN = r'then', 1
    ELSE = r'else', 2
    ...

we then get rid of ambiguity. The successful LALR parser delivers

>>> pIfThenElse.interpret('if 1 then if 2 then if 3 then a else b else c')
('it', '1', ('ite', '2', ('ite', '3', 'a', 'b'), 'c'))

However, rather than the examples here, precedence specification can be highly complex and involving in practical cases.

API

The following contents give more details about the underlying utilities.

Explicitly Registering Lexical Patterns and Syntactic Rules

The following version of declaring the language in the first example works for both Python 2 and Python 3, with the more verbose but more explicit style, heavily relying on using decorators.

from metaparse import LALR

pCalc = LALR()

lex  = pCalc.lexer
rule = pCalc.rule

# lex(<terminal-symbol> = <pattern>)
lex(IGNORED = r'\s+')
lex(NUM = r'[0-9]+')
lex(EQ  = r'=')
lex(ID  = r'[_a-zA-Z]\w*')

# lex(... , p = <precedence>)
lex(POW = r'\*\*', p=3)
lex(POW = r'\^')                # No need to give the precedence twice for POW.
lex(MUL = r'\*'  , p=2)
lex(ADD = r'\+'  , p=1)

# @rule
# def <lhs> ( <rhs> ):
#     <semantics>
@rule
def assign(ID, EQ, expr):
    context[ID] = expr
    return expr

@rule
def expr(ID):
    return context[ID]

@rule
def expr(NUM):
    return int(NUM)

@rule
def expr(expr_1, ADD, expr_2):
    return expr_1 + expr_2

@rule
def expr(expr, MUL, expr_1):
    return expr * expr_1

@rule
def expr(expr, POW, expr_1):
    return expr ** expr_1

# Complete making the parser after collecting things!
pCalc.make()

Explanation in short:

  • lex is the Lexer instance associated with pCalc, which is also able to collect definition of lexical patterns.

  • rule is a decorator which extracted syntactic rule information from the function signature and register this rule, as well as register this function as semantics associating this rule.

The Underlying Lexical Analyzer

After declaring the language like above, a lexical analyzer is created as a utility for the usable parser, which maintains a list of terminal symbols of the language defined, preserving their appearance order in the declaration.

>>> pCalc.lexer
Lexer{
[('IGNORED', re.compile('\\s+')),
 ('EQ', re.compile('=')),
 ('NUM', re.compile('[1-9][0-9]*')),
 ('ID', re.compile('[_a-zA-Z]\\w*')),
 ('POW', re.compile('\\*\\*')),
 ('MUL', re.compile('\\*')),
 ('ADD', re.compile('\\+'))]}

It works by calling tokenize and generates tokens with informative attributes. During tokenizing, the patterns are tested for matching with respect to the list order.

Note there is a special lexical element IGNORED:

  • When Lexer reads a string prefix matching the pattern associating IGNORED, no token is generated for such string prefix;

  • If IGNORED is not given explicitly in the user's language declaration, it would be given default pattern r'\s+'.

We can print out the tracing of lexcial analyzing process:

>>> for token in pCalc.lexer.tokenize(" foo  = 1 + bar * 2"):
...     print(token.pos,
...           token.end,
...           token.symbol,
...           repr(token.lexeme),   # (lexeme) is something literal.
...           repr(token.value))    # (value) is something computed by handler, if exists.

1 4 ID 'foo' 'foo'
6 7 EQ '=' '='
8 9 NUM '1' 1
10 11 ADD '+' '+'
12 15 ID 'bar' 'bar'
16 17 MUL '*' '*'
18 19 NUM '2' 2

Moreover, it is no problem to declare more lexical patterns with the same name, which may occasionally be useful like

class pCalc(metaclass=LALR.meta):
    ...
    IGNORED = r' '
    IGNORED = r'\t'
    IGNORED = r'#'
    ...
    POW     = r'\*\*'
    POW     = r'\^'
    ...

avoiding clustering alternative sub-patterns in one re pattern.

You often totally need no access to the Lexer object to use the parser. This section only serve to be informative.

Online-Parsing behind the Scene

The parse and interpret methods for outmost usage are based on the more subtle parsing coroutine, which is designed with online behavior, i.e.

<get-token> —→ <process-actions> —→ <wait-for-next-token>

The following block of code calls the routine directly, starts it, and traces the intermediate states:

# Prepare a parsing routine
p = pCalc.prepare()

# Start this routine
next(p)

# Send tokens one-by-one
for token in pCalc.lexer.tokenize('bar = 1 + 2 + + 3', with_end=True):
    print("Sends: ", token)
    r = p.send(token)
    print("Got:   ", r)
    print()

we get the following output, where successful each intermediate result is wrapped by Just and failure reported by ParseError containing tracing information (returned rather than thrown).

Sends:  ('ID', 'bar')
Got:    Just(result=('ID', 'bar'))

Sends:  ('EQ', '=')
Got:    Just(result=('EQ', '='))

Sends:  ('NUM', '1')
Got:    Just(result=('NUM', '1'))

Sends:  ('ADD', '+')
Got:    Just(result=('ADD', '+'))

Sends:  ('NUM', '2')
Got:    Just(result=('NUM', '2'))

Sends:  ('ADD', '+')
Got:    Just(result=('ADD', '+'))

Sends:  ('ADD', '+')
Got:    Unexpected token ('ADD', '+') at (14:15)
while expecting actions 
{'ID': ('shift', 5), 'NUM': ('shift', 6)}
with state stack 
[['(assign^ = .assign)'],
 ['(assign = ID.EQ expr)'],
 ['(assign = ID EQ.expr)'],
 ['(assign = ID EQ expr.)',
  '(expr = expr.ADD expr)',
  '(expr = expr.MUL expr)',
  '(expr = expr.POW expr)'],
 ['(expr = expr ADD.expr)']]
and subtree stack 
['bar', '=', 3, '+']


Sends:  ('NUM', '3')
Got:    Just(result=('NUM', '3'))

Sends:  ('\x03', None)
Got:    Just(result=6)

The structured and detailed error reporting should be useful for applications potentially built on this API.

Limitations

Though this module provides advantageous features, there are also limitations:

  • Parsing grammars with loops is yet to be supported. For example, the grammar

    P → Q | a
    Q → P
    

    is infinitely ambiguous, which has infinite number of derivations while processing only finite input, e.g. "a":

    P ⇒ a
    P ⇒ Q ⇒ P ⇒ a
    ...
    P ⇒ Q ⇒ ... ⇒ P ⇒ a
    

    where each derivation corresponds to a parse tree. Eager generation of these trees lead to non-termination during parsing.

  • Only legal Python identifier, rather than non-alphabetic symbols (like <fo#o>, ==, raise, etc) can be used as symbols in grammar (seems no serious).

  • Algorithms in pure Python lowers performance, but speed-up is possible.

About

A powerful but handy tool for instant parsing/compiling with Python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%