python-hyperscan currently exposes most of the C API, with the
following caveats or exceptions:
- Chimera is supported by instantiating
hyperscan.Database(chimera=True); see the Chimera documentation for the feature matrix. - No stream compression support.
- No custom allocator support.
hs_expression_info,hs_expression_ext_info,hs_populate_platform, andhs_serialized_database_infonot exposed yet.
!!! tip
Refer to the [Hyperscan documentation][4] to gain an understanding of
how the Hyperscan compiler C API works, including supported pattern
constructs and matching modes.
The packaged wheels vendor Vectorscan 5.4.12 (Linux/macOS) or Hyperscan
5.4.2 (Windows). When targeting a system-provided engine, ensure it is
Hyperscan/Vectorscan 5.4 or newer.
Please create an issue to request prioritization of certain C API features, report inconsistencies between the C API and this Python wrapper, and of course, report any bugs.
The only required parameter to hyperscan.Database is
expressions, which should be a sequence of regular expressions. The
rest of the parameters, including ids, elements, and flags
are optional.
import hyperscan
db = hyperscan.Database()
patterns = (
# expression, id, flags
(br'fo+', 0, 0),
(br'^foobar$', 1, hyperscan.HS_FLAG_CASELESS),
(br'BAR', 2, hyperscan.HS_FLAG_CASELESS
| hyperscan.HS_FLAG_SOM_LEFTMOST),
)
expressions, ids, flags = zip(*patterns)
db.compile(
expressions=expressions, ids=ids, elements=len(patterns), flags=flags
)
print(db.info().decode())
# Version: 5.4.12 Features: AVX2 Mode: BLOCKMatch handler callbacks will be invoked with parameters mirroring the Hyperscan C API.
The match offset argument is exposed as from_ in Python to avoid the
reserved keyword from:
# Type annotated Hyperscan match handler signature
def on_match(
id: int,
from_: int,
to: int,
flags: int,
context: Optional[Any] = None
) -> Optional[bool]:
...Refer to the Hyperscan documentation for match_event_handler
for details about each parameter. Note that context in this case is
any Python object passed to a scan method.
The return value determines whether or not Hyperscan should halt
scanning. If the match handler returns anything other than None
that is truthy, scanning will be halted and any subsequent calls to
Database.scan or Stream.scan will throw a hyperscan.error.
python-hyperscan manages Hyperscan's scratch spaces behind the
scenes, so performing the actual scanning is extremely trivial.
!!! note
Mirroring the behavior of the Hyperscan C API, both block and
stream mode ``scan`` methods do not require a
**match_event_handler** callback function to be provided. Not
passing a match callback will suppress match production entirely.
One possible use case for this behavior is error checking or
performing a dry run before performing a scan with a registered
match handler.
db.scan(b'foobar', match_event_handler=on_match)
# Or, to provide a context object:
db.scan(b'foobar', match_event_handler=on_match, context='foo')First, ensure the Database object was created with streaming mode
enabled.
db = hyperscan.Database(mode=hyperscan.HS_MODE_STREAM)Next, simply use the Database.stream method, which provides the
Stream context manager. The Database.stream can be passed a
match_event_handler and context object which will be used for
all invocations of Stream.scan, unless overridden.
with db.stream(match_event_handler=on_match, context=2345) as stream:
stream.scan(b'foobar')
# Override context only for one chunk
stream.scan(b'barfoofoobarbarfoobar', context=1234)
# Override match handler only for one chunk
stream.scan(b'qux', match_event_handler=on_qux_match)db = hyperscan.Database(mode=hyperscan.HS_MODE_VECTORED)
buffers = [
bytearray(b'xxxfooxxx'),
bytearray(b'xxfoxbarx'),
bytearray(b'barxxxxxx'),
]
db.scan(buffers, match_event_handler=on_match)Refer to the Hyperscan documentation for a list of parameter names
and behaviours. python-hyperscan provides a helper named tuple,
ExpressionExt, which is used to construct an hs_expr_ext_t
structure. Only the appropriate field name for the given flag(s) need
to be provided, all other parameters default to 0.
db.compile(
expressions=[b'foobar'],
flags=hyperscan.HS_FLAG_SOM_LEFTMOST,
ext=[
hyperscan.ExpressionExt(
flags=hyperscan.HS_EXT_FLAG_MIN_OFFSET, min_offset=12
)
],
)
# Matches the second `foobar`
db.scan(b'foobarfoobar', match_event_handler=callback)Refer to the Hyperscan documentation for more information on serialization, its use cases, and caveats. Usage is simple:
# Serializing (dumping to bytes)
serialized = hyperscan.dumpb(db)
with open('hs.db', 'wb') as f:
f.write(serialized)
# Deserializing (loading from bytes):
db = hyperscan.loadb(serialized)chimera_db = hyperscan.Database(chimera=True)
chimera_db.compile(expressions=[br'(foo)+', br'b(ar|az)'])
chimera_db.scan(b'foobaz', match_event_handler=on_match)Chimera mixes PCRE literals with Hyperscan's multi-pattern engine. When
using it, reuse a Scratch object per thread to avoid reallocations
caused by the larger databases.