Skip to content

Latest commit

 

History

History
202 lines (160 loc) · 6.3 KB

File metadata and controls

202 lines (160 loc) · 6.3 KB

Usage

API Support

python-hyperscan currently exposes most of the C API, with the following caveats or exceptions:

  • Chimera is supported by instantiating hyperscan.Database(chimera=True); see the Chimera documentation for the feature matrix.
  • No stream compression support.
  • No custom allocator support.
  • hs_expression_info, hs_expression_ext_info, hs_populate_platform, and hs_serialized_database_info not exposed yet.

!!! tip

Refer to the [Hyperscan documentation][4] to gain an understanding of
how the Hyperscan compiler C API works, including supported pattern
constructs and matching modes.

The packaged wheels vendor Vectorscan 5.4.12 (Linux/macOS) or Hyperscan 5.4.2 (Windows). When targeting a system-provided engine, ensure it is Hyperscan/Vectorscan 5.4 or newer.

Please create an issue to request prioritization of certain C API features, report inconsistencies between the C API and this Python wrapper, and of course, report any bugs.

Building a Database

The only required parameter to hyperscan.Database is expressions, which should be a sequence of regular expressions. The rest of the parameters, including ids, elements, and flags are optional.

import hyperscan

db = hyperscan.Database()
patterns = (
    # expression,  id, flags
    (br'fo+',      0,  0),
    (br'^foobar$', 1,  hyperscan.HS_FLAG_CASELESS),
    (br'BAR',      2,  hyperscan.HS_FLAG_CASELESS
                       | hyperscan.HS_FLAG_SOM_LEFTMOST),
)
expressions, ids, flags = zip(*patterns)
db.compile(
    expressions=expressions, ids=ids, elements=len(patterns), flags=flags
)
print(db.info().decode())
# Version: 5.4.12 Features: AVX2 Mode: BLOCK

Match Event Handling

Match handler callbacks will be invoked with parameters mirroring the Hyperscan C API.

The match offset argument is exposed as from_ in Python to avoid the reserved keyword from:

# Type annotated Hyperscan match handler signature
def on_match(
    id: int,
    from_: int,
    to: int,
    flags: int,
    context: Optional[Any] = None
) -> Optional[bool]:
    ...

Refer to the Hyperscan documentation for match_event_handler for details about each parameter. Note that context in this case is any Python object passed to a scan method.

The return value determines whether or not Hyperscan should halt scanning. If the match handler returns anything other than None that is truthy, scanning will be halted and any subsequent calls to Database.scan or Stream.scan will throw a hyperscan.error.

Pattern Scanning

python-hyperscan manages Hyperscan's scratch spaces behind the scenes, so performing the actual scanning is extremely trivial.

!!! note

Mirroring the behavior of the Hyperscan C API, both block and
stream mode ``scan`` methods do not require a
**match_event_handler** callback function to be provided. Not
passing a match callback will suppress match production entirely.

One possible use case for this behavior is error checking or
performing a dry run before performing a scan with a registered
match handler.

Block Mode

db.scan(b'foobar', match_event_handler=on_match)
# Or, to provide a context object:
db.scan(b'foobar', match_event_handler=on_match, context='foo')

Streaming Mode

First, ensure the Database object was created with streaming mode enabled.

db = hyperscan.Database(mode=hyperscan.HS_MODE_STREAM)

Next, simply use the Database.stream method, which provides the Stream context manager. The Database.stream can be passed a match_event_handler and context object which will be used for all invocations of Stream.scan, unless overridden.

with db.stream(match_event_handler=on_match, context=2345) as stream:
    stream.scan(b'foobar')
    # Override context only for one chunk
    stream.scan(b'barfoofoobarbarfoobar', context=1234)
    # Override match handler only for one chunk
    stream.scan(b'qux', match_event_handler=on_qux_match)

Vectored Mode

db = hyperscan.Database(mode=hyperscan.HS_MODE_VECTORED)
buffers = [
    bytearray(b'xxxfooxxx'),
    bytearray(b'xxfoxbarx'),
    bytearray(b'barxxxxxx'),
]
db.scan(buffers, match_event_handler=on_match)

Extended Parameters

Refer to the Hyperscan documentation for a list of parameter names and behaviours. python-hyperscan provides a helper named tuple, ExpressionExt, which is used to construct an hs_expr_ext_t structure. Only the appropriate field name for the given flag(s) need to be provided, all other parameters default to 0.

db.compile(
    expressions=[b'foobar'],
    flags=hyperscan.HS_FLAG_SOM_LEFTMOST,
    ext=[
        hyperscan.ExpressionExt(
            flags=hyperscan.HS_EXT_FLAG_MIN_OFFSET, min_offset=12
        )
    ],
)
# Matches the second `foobar`
db.scan(b'foobarfoobar', match_event_handler=callback)

Serialization

Refer to the Hyperscan documentation for more information on serialization, its use cases, and caveats. Usage is simple:

# Serializing (dumping to bytes)
serialized = hyperscan.dumpb(db)
with open('hs.db', 'wb') as f:
    f.write(serialized)

# Deserializing (loading from bytes):
db = hyperscan.loadb(serialized)

Chimera Mode

chimera_db = hyperscan.Database(chimera=True)
chimera_db.compile(expressions=[br'(foo)+', br'b(ar|az)'])
chimera_db.scan(b'foobaz', match_event_handler=on_match)

Chimera mixes PCRE literals with Hyperscan's multi-pattern engine. When using it, reuse a Scratch object per thread to avoid reallocations caused by the larger databases.