Skip to content

Commit 59f0224

Browse files
authored
Merge pull request hed-standard#1299 from VisLab/enhance_search
Added a string search option for HED (experimental)
2 parents 892bad7 + 126aab7 commit 59f0224

8 files changed

Lines changed: 2225 additions & 0 deletions

File tree

docs/api/models.rst

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,51 @@ search_hed_objs
177177

178178
.. autofunction:: hed.models.query_service.search_hed_objs
179179

180+
String-based search
181+
-------------------
182+
183+
Search functions that operate on raw HED strings without requiring pre-parsed ``HedString`` objects
184+
or a loaded schema. See also :doc:`/search_implementation` for a full comparison of all three
185+
search implementations.
186+
187+
StringQueryHandler
188+
~~~~~~~~~~~~~~~~~~
189+
190+
.. autoclass:: hed.models.string_search.StringQueryHandler
191+
:members:
192+
:undoc-members:
193+
:show-inheritance:
194+
195+
StringNode
196+
~~~~~~~~~~
197+
198+
.. autoclass:: hed.models.string_search.StringNode
199+
:members:
200+
:undoc-members:
201+
:show-inheritance:
202+
203+
parse_hed_string
204+
~~~~~~~~~~~~~~~~
205+
206+
.. autofunction:: hed.models.string_search.parse_hed_string
207+
208+
search_series
209+
~~~~~~~~~~~~~
210+
211+
.. autofunction:: hed.models.string_search.search_series
212+
213+
Schema lookup utilities
214+
~~~~~~~~~~~~~~~~~~~~~~~
215+
216+
Pre-generate and persist a tag-ancestor lookup dictionary from a :class:`~hed.schema.HedSchema`
217+
for use with :class:`~hed.models.string_search.StringQueryHandler`.
218+
219+
.. autofunction:: hed.models.schema_lookup.generate_schema_lookup
220+
221+
.. autofunction:: hed.models.schema_lookup.save_schema_lookup
222+
223+
.. autofunction:: hed.models.schema_lookup.load_schema_lookup
224+
180225
DataFrame utilities
181226
-------------------
182227

docs/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ in various formats:
2727
:maxdepth: 2
2828

2929
User guide <user_guide>
30+
Search implementations <search_implementation>
3031
API <api/index>
3132

3233
* :ref:`genindex`

docs/search_implementation.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
---
2+
html_meta:
3+
description: Comparison of the three HED string search implementations in hedtools - basic_search, QueryHandler, and StringQueryHandler
4+
keywords: HED search, string search, query handler, basic search, performance, hedtools, pattern matching
5+
---
6+
7+
```{index} search, string search, query, QueryHandler, StringQueryHandler, basic_search
8+
```
9+
10+
# HED search implementations
11+
12+
HEDtools provides three distinct mechanisms for searching HED-annotated data. They share a common goal — "does this HED string match this query?" — but differ substantially in their inputs, capabilities, schema requirements, and performance characteristics. Choosing the right implementation depends on whether you need schema-aware ancestor matching, full group-structural queries, or raw throughput on unannotated strings.
13+
14+
## Overview of the three implementations
15+
16+
### `basic_search` — regex-based flat matching
17+
18+
Located in {mod}`hed.models.basic_search`, the `find_matching()` function operates directly on a `pd.Series` of raw HED strings using compiled regular expressions. It requires no schema and no parsing step, making it the fastest option for bulk row filtering.
19+
20+
Key characteristics:
21+
22+
- Input is a `pd.Series` of raw strings; output is a `pd.Series[bool]` mask.
23+
- The query is compiled once into a regex and applied with `Series.str.contains`.
24+
- Matches are purely literal — `Event` does not match `Sensory-event`.
25+
- `@A` in a basic-search query means A **must be present** anywhere in the string (note: this is the **opposite** of what `@A` means in `QueryHandler`/`StringQueryHandler`).
26+
- `~A` means A must not appear anywhere (global negation).
27+
- `(A, B)` syntax checks that A and B appear at the same nesting level.
28+
- Wildcard `A*` expands to the regex `A.*?`, which can span `/` and match mid-token substrings.
29+
30+
Use `basic_search` when you are working with a large series of raw strings, don't need ancestor matching, and want maximum throughput. See {func}`hed.models.basic_search.find_matching`.
31+
32+
### `QueryHandler` — schema-backed object search
33+
34+
Located in {mod}`hed.models.query_handler`, `QueryHandler` is the full-featured search engine. It compiles a query string into an expression tree once, then evaluates that tree against `HedString` objects that have already been parsed against a loaded `HedSchema`.
35+
36+
Key characteristics:
37+
38+
- Input is a `HedString` object; a full `HedSchema` is required.
39+
- Output is a `list[SearchResult]` containing `HedTag` / `HedGroup` object references, useful for tag-level introspection (not just row filtering).
40+
- Supports the complete query language: `&&`, `||`, `~`, `@`, `{}`, `[]`, `{:}`, `?`, `??`, `???`.
41+
- `@A` means A must **not** appear anywhere in the string.
42+
- Ancestor matching is exact — the schema normalises both query and string tags to short form, so `Event` matches `Sensory-event` because the schema knows `Sensory-event` descends from `Event`.
43+
- Per-string cost includes a full HedString parse and schema tag resolution.
44+
45+
Use `QueryHandler` when you need schema-aware ancestor matching, or when you want object references (e.g., to retrieve the matched group for further processing). See {class}`hed.models.query_handler.QueryHandler`.
46+
47+
### `StringQueryHandler` — tree-based schema-optional search
48+
49+
Located in {mod}`hed.models.string_search`, `StringQueryHandler` is a new middle-ground implementation that inherits from `QueryHandler` and reuses the full expression-tree compiler, but operates on raw strings rather than pre-parsed `HedString` objects.
50+
51+
It parses each raw HED string into a lightweight {class}`~hed.models.string_search.StringNode` tree that duck-types the `HedGroup`/`HedTag` interfaces expected by the existing expression evaluators — so all `QueryHandler` query syntax works unchanged.
52+
53+
Key characteristics:
54+
55+
- Input is a raw string (or a `pd.Series` via {func}`~hed.models.string_search.search_series`).
56+
- Schema is **optional**: pass a `schema_lookup` dict (see {mod}`hed.models.schema_lookup`) to enable ancestor matching for short-form strings (e.g. `Event` matching `Sensory-event`); omit it for purely literal matching.
57+
- Output is a list (truthy/falsy) — row-filtering only, no object references.
58+
- Supports the same full query syntax as `QueryHandler` (`&&`, `||`, `~`, `@`, `{}`, etc.).
59+
- `@A` carries the same semantics as `QueryHandler` — A must **not** be present.
60+
- Long-form strings (`Event/Sensory-event`) support ancestor matching via slash-splitting even without a lookup. Short-form strings (`Sensory-event`) require a `schema_lookup` for ancestor matching; without one, matching is purely literal.
61+
- Parse cost is a lightweight recursive split — much cheaper than a full HedString + schema parse.
62+
63+
Use `StringQueryHandler` when you have raw strings (not `HedString` objects), need the full `QueryHandler` query syntax, and either don't have a schema available or want faster processing at the cost of losing full schema-aware ancestor matching. See {class}`hed.models.string_search.StringQueryHandler`.
64+
65+
### Generating a schema lookup
66+
67+
If you want `StringQueryHandler` to resolve ancestors for short-form strings (e.g. query `Event` matching `Sensory-event`) without a full schema parse per row, you can pre-generate a lookup dictionary from a `HedSchema`:
68+
69+
```python
70+
from hed import load_schema_version
71+
from hed import generate_schema_lookup, save_schema_lookup, load_schema_lookup
72+
73+
schema = load_schema_version("8.4.0")
74+
lookup = generate_schema_lookup(schema) # {short_name_casefold: tag_terms_tuple}
75+
76+
# Persist for reuse
77+
save_schema_lookup(lookup, "hed840_lookup.json")
78+
lookup = load_schema_lookup("hed840_lookup.json")
79+
```
80+
81+
See {func}`hed.models.schema_lookup.generate_schema_lookup`.
82+
83+
______________________________________________________________________
84+
85+
## Comparison tables
86+
87+
### Core characteristics
88+
89+
| Property | `basic_search` | `QueryHandler` | `StringQueryHandler` |
90+
| --------------------- | -------------------------- | -------------------------------------------------- | ----------------------------------------------- |
91+
| **Input** | `pd.Series` of raw strings | `HedString` object | Raw string or `pd.Series` (via `search_series`) |
92+
| **Schema required** | No | Yes — full `HedSchema` for tag parsing | No; optional `schema_lookup` dict |
93+
| **Output** | `pd.Series[bool]` mask | `list[SearchResult]` with `HedTag`/`HedGroup` refs | `list` (truthy/falsy); `StringNode` refs |
94+
| **Result usable for** | Row filtering | Row filtering + tag/group introspection | Row filtering only |
95+
| **Batch API** | Native (`series`) | Manual loop | `search_series(series, query)` |
96+
| **Parse cost** | Regex compilation once | Full `HedString` + schema parse per string | Lightweight tree parse per string |
97+
| **Unrecognised tags** | Matched literally | Silent match failure (`tag_terms = ()`) | Matched literally |
98+
99+
### Query syntax
100+
101+
| Feature | `basic_search` query syntax | `QueryHandler` / `StringQueryHandler` query syntax |
102+
| ---------------------------- | --------------------------------------------------- | -------------------------------------------------- |
103+
| **AND** | Space or comma between terms (context-dependent) | `A && B` or `A, B` |
104+
| **OR** | Not supported | `A \|\| B` |
105+
| **Absent from string (`@`)** | ⚠️ `@A` means A **must be present** anywhere | `@A` means A must **not** appear anywhere |
106+
| **Must-not-appear (`~`)** | `~A` — A must not appear anywhere (global) | `~A` — negation within group context (local) |
107+
| **Prefix wildcard** | `A*` → regex `A.*?` (spans `/`, matches substrings) | `A*` → prefix on short form only |
108+
| **Full regex per term** | Yes (`regex=True` mode) | No |
109+
| **Quoted exact match** | No | `"A"` — exact match, no ancestor search |
110+
| **Implicit default** | If no `(` or `@`: all terms become "anywhere" | No implicit conversion — must be explicit |
111+
112+
### Group / structural operators
113+
114+
| Feature | `basic_search` | `QueryHandler` | `StringQueryHandler` |
115+
| --------------------------------- | ----------------------------------------- | -------------------------------------------- | ---------------------- |
116+
| **Same nesting level** | `(A, B)` — A and B at same relative level | N/A — use `{A, B}` | N/A — use `{A, B}` |
117+
| **Same parenthesised group `{}`** | No | `{A, B}` — must share a direct parent group | Same as `QueryHandler` |
118+
| **Exact group `{:}`** | No | `{A, B:}` — same group, no other children | Same |
119+
| **Optional exact group** | No | `{A, B: C}` — A and B required, C optional | Same |
120+
| **Descendant group `[]`** | No | `[A, B]` — both in same subtree at any depth | Same |
121+
| **Any child `?`** | No | `?` — any tag or group child | Same |
122+
| **Any tag child `??`** | No | `??` — any leaf (non-group) child | Same |
123+
| **Any group child `???`** | No | `???` — any parenthesised group child | Same |
124+
| **Nested query operators** | No | Yes — full recursive composition | Same |
125+
126+
### Ancestor / cross-form search
127+
128+
| Scenario | `basic_search` | `QueryHandler` | `StringQueryHandler` |
129+
| ------------------------------------------------------- | ------------------------------------------------- | --------------------------------------- | ---------------------------------------------------------------- |
130+
| Query `Event`, string `Sensory-event` (short form) | ❌ literal only |`tag_terms` from schema | ✅ with `schema_lookup`; ❌ without |
131+
| Query `Event`, string `Event/Sensory-event` (long form) |`Event``Event/Sensory-event` | ✅ schema normalises | ✅ slash-split produces `tag_terms = ("event", "sensory-event")` |
132+
| Query `Event/Sensory-event`, string `Sensory-event` || ✅ schema normalises both to short form | ❌ no schema to normalise |
133+
| Schema-free ancestor search | `convert_query()` + long-form series (workaround) | N/A — schema always required | ✅ works natively for long-form strings |
134+
| Tag `Def/Name` matched by query `Def` | ❌ literal prefix mismatch |`short_base_tag = "Def"` |`tag_terms` contains `"def"` |
135+
136+
### Critical semantic traps
137+
138+
These differences are silent — no error, just wrong answers if you mix up query strings across implementations:
139+
140+
| Operator | `basic_search` | `QueryHandler` / `StringQueryHandler` |
141+
| ----------------- | -------------------------------------------------------- | ----------------------------------------------------------------------------------- |
142+
| `@A` | A **must** appear anywhere in the string | A must **not** appear anywhere in the string |
143+
| `~A` | A must not appear **anywhere** (global) | A must not appear in any group that also matches the rest of the expression (local) |
144+
| `*` wildcard | Regex `.*?` — spans `/` and matches mid-token substrings | Strict prefix on the tag's short form — anchored to start |
145+
| No-operator `A B` | Both present anywhere (implicit `@@`) | Parse error — `&&` required |
146+
147+
______________________________________________________________________
148+
149+
## Performance
150+
151+
*Performance benchmarks will be added here.*
152+
153+
Preliminary guidance:
154+
155+
- For large-scale row filtering on raw strings where schema awareness is not needed, `basic_search` is likely fastest due to vectorised regex on the full series with no per-row parsing.
156+
- `StringQueryHandler` trades some throughput for full query-language support and optional ancestor matching; parse cost per row is a lightweight recursive split.
157+
- `QueryHandler` has the highest per-string cost because it requires a pre-parsed `HedString` (including schema tag resolution), but provides the richest result objects.

hed/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
from hed.models.definition_dict import DefinitionDict
1010
from hed.models.query_handler import QueryHandler
1111
from hed.models.query_service import get_query_handlers, search_hed_objs
12+
from hed.models.string_search import StringQueryHandler, parse_hed_string, search_series
13+
from hed.models.schema_lookup import generate_schema_lookup, load_schema_lookup, save_schema_lookup
1214

1315
from hed.schema.hed_schema import HedSchema
1416
from hed.schema.hed_schema_group import HedSchemaGroup

0 commit comments

Comments
 (0)