|
| 1 | +--- |
| 2 | +html_meta: |
| 3 | + description: Comparison of the three HED string search implementations in hedtools - basic_search, QueryHandler, and StringQueryHandler |
| 4 | + keywords: HED search, string search, query handler, basic search, performance, hedtools, pattern matching |
| 5 | +--- |
| 6 | + |
| 7 | +```{index} search, string search, query, QueryHandler, StringQueryHandler, basic_search |
| 8 | +``` |
| 9 | + |
| 10 | +# HED search implementations |
| 11 | + |
| 12 | +HEDtools provides three distinct mechanisms for searching HED-annotated data. They share a common goal — "does this HED string match this query?" — but differ substantially in their inputs, capabilities, schema requirements, and performance characteristics. Choosing the right implementation depends on whether you need schema-aware ancestor matching, full group-structural queries, or raw throughput on unannotated strings. |
| 13 | + |
| 14 | +## Overview of the three implementations |
| 15 | + |
| 16 | +### `basic_search` — regex-based flat matching |
| 17 | + |
| 18 | +Located in {mod}`hed.models.basic_search`, the `find_matching()` function operates directly on a `pd.Series` of raw HED strings using compiled regular expressions. It requires no schema and no parsing step, making it the fastest option for bulk row filtering. |
| 19 | + |
| 20 | +Key characteristics: |
| 21 | + |
| 22 | +- Input is a `pd.Series` of raw strings; output is a `pd.Series[bool]` mask. |
| 23 | +- The query is compiled once into a regex and applied with `Series.str.contains`. |
| 24 | +- Matches are purely literal — `Event` does not match `Sensory-event`. |
| 25 | +- `@A` in a basic-search query means A **must be present** anywhere in the string (note: this is the **opposite** of what `@A` means in `QueryHandler`/`StringQueryHandler`). |
| 26 | +- `~A` means A must not appear anywhere (global negation). |
| 27 | +- `(A, B)` syntax checks that A and B appear at the same nesting level. |
| 28 | +- Wildcard `A*` expands to the regex `A.*?`, which can span `/` and match mid-token substrings. |
| 29 | + |
| 30 | +Use `basic_search` when you are working with a large series of raw strings, don't need ancestor matching, and want maximum throughput. See {func}`hed.models.basic_search.find_matching`. |
| 31 | + |
| 32 | +### `QueryHandler` — schema-backed object search |
| 33 | + |
| 34 | +Located in {mod}`hed.models.query_handler`, `QueryHandler` is the full-featured search engine. It compiles a query string into an expression tree once, then evaluates that tree against `HedString` objects that have already been parsed against a loaded `HedSchema`. |
| 35 | + |
| 36 | +Key characteristics: |
| 37 | + |
| 38 | +- Input is a `HedString` object; a full `HedSchema` is required. |
| 39 | +- Output is a `list[SearchResult]` containing `HedTag` / `HedGroup` object references, useful for tag-level introspection (not just row filtering). |
| 40 | +- Supports the complete query language: `&&`, `||`, `~`, `@`, `{}`, `[]`, `{:}`, `?`, `??`, `???`. |
| 41 | +- `@A` means A must **not** appear anywhere in the string. |
| 42 | +- Ancestor matching is exact — the schema normalises both query and string tags to short form, so `Event` matches `Sensory-event` because the schema knows `Sensory-event` descends from `Event`. |
| 43 | +- Per-string cost includes a full HedString parse and schema tag resolution. |
| 44 | + |
| 45 | +Use `QueryHandler` when you need schema-aware ancestor matching, or when you want object references (e.g., to retrieve the matched group for further processing). See {class}`hed.models.query_handler.QueryHandler`. |
| 46 | + |
| 47 | +### `StringQueryHandler` — tree-based schema-optional search |
| 48 | + |
| 49 | +Located in {mod}`hed.models.string_search`, `StringQueryHandler` is a new middle-ground implementation that inherits from `QueryHandler` and reuses the full expression-tree compiler, but operates on raw strings rather than pre-parsed `HedString` objects. |
| 50 | + |
| 51 | +It parses each raw HED string into a lightweight {class}`~hed.models.string_search.StringNode` tree that duck-types the `HedGroup`/`HedTag` interfaces expected by the existing expression evaluators — so all `QueryHandler` query syntax works unchanged. |
| 52 | + |
| 53 | +Key characteristics: |
| 54 | + |
| 55 | +- Input is a raw string (or a `pd.Series` via {func}`~hed.models.string_search.search_series`). |
| 56 | +- Schema is **optional**: pass a `schema_lookup` dict (see {mod}`hed.models.schema_lookup`) to enable ancestor matching for short-form strings (e.g. `Event` matching `Sensory-event`); omit it for purely literal matching. |
| 57 | +- Output is a list (truthy/falsy) — row-filtering only, no object references. |
| 58 | +- Supports the same full query syntax as `QueryHandler` (`&&`, `||`, `~`, `@`, `{}`, etc.). |
| 59 | +- `@A` carries the same semantics as `QueryHandler` — A must **not** be present. |
| 60 | +- Long-form strings (`Event/Sensory-event`) support ancestor matching via slash-splitting even without a lookup. Short-form strings (`Sensory-event`) require a `schema_lookup` for ancestor matching; without one, matching is purely literal. |
| 61 | +- Parse cost is a lightweight recursive split — much cheaper than a full HedString + schema parse. |
| 62 | + |
| 63 | +Use `StringQueryHandler` when you have raw strings (not `HedString` objects), need the full `QueryHandler` query syntax, and either don't have a schema available or want faster processing at the cost of losing full schema-aware ancestor matching. See {class}`hed.models.string_search.StringQueryHandler`. |
| 64 | + |
| 65 | +### Generating a schema lookup |
| 66 | + |
| 67 | +If you want `StringQueryHandler` to resolve ancestors for short-form strings (e.g. query `Event` matching `Sensory-event`) without a full schema parse per row, you can pre-generate a lookup dictionary from a `HedSchema`: |
| 68 | + |
| 69 | +```python |
| 70 | +from hed import load_schema_version |
| 71 | +from hed import generate_schema_lookup, save_schema_lookup, load_schema_lookup |
| 72 | + |
| 73 | +schema = load_schema_version("8.4.0") |
| 74 | +lookup = generate_schema_lookup(schema) # {short_name_casefold: tag_terms_tuple} |
| 75 | + |
| 76 | +# Persist for reuse |
| 77 | +save_schema_lookup(lookup, "hed840_lookup.json") |
| 78 | +lookup = load_schema_lookup("hed840_lookup.json") |
| 79 | +``` |
| 80 | + |
| 81 | +See {func}`hed.models.schema_lookup.generate_schema_lookup`. |
| 82 | + |
| 83 | +______________________________________________________________________ |
| 84 | + |
| 85 | +## Comparison tables |
| 86 | + |
| 87 | +### Core characteristics |
| 88 | + |
| 89 | +| Property | `basic_search` | `QueryHandler` | `StringQueryHandler` | |
| 90 | +| --------------------- | -------------------------- | -------------------------------------------------- | ----------------------------------------------- | |
| 91 | +| **Input** | `pd.Series` of raw strings | `HedString` object | Raw string or `pd.Series` (via `search_series`) | |
| 92 | +| **Schema required** | No | Yes — full `HedSchema` for tag parsing | No; optional `schema_lookup` dict | |
| 93 | +| **Output** | `pd.Series[bool]` mask | `list[SearchResult]` with `HedTag`/`HedGroup` refs | `list` (truthy/falsy); `StringNode` refs | |
| 94 | +| **Result usable for** | Row filtering | Row filtering + tag/group introspection | Row filtering only | |
| 95 | +| **Batch API** | Native (`series`) | Manual loop | `search_series(series, query)` | |
| 96 | +| **Parse cost** | Regex compilation once | Full `HedString` + schema parse per string | Lightweight tree parse per string | |
| 97 | +| **Unrecognised tags** | Matched literally | Silent match failure (`tag_terms = ()`) | Matched literally | |
| 98 | + |
| 99 | +### Query syntax |
| 100 | + |
| 101 | +| Feature | `basic_search` query syntax | `QueryHandler` / `StringQueryHandler` query syntax | |
| 102 | +| ---------------------------- | --------------------------------------------------- | -------------------------------------------------- | |
| 103 | +| **AND** | Space or comma between terms (context-dependent) | `A && B` or `A, B` | |
| 104 | +| **OR** | Not supported | `A \|\| B` | |
| 105 | +| **Absent from string (`@`)** | ⚠️ `@A` means A **must be present** anywhere | `@A` means A must **not** appear anywhere | |
| 106 | +| **Must-not-appear (`~`)** | `~A` — A must not appear anywhere (global) | `~A` — negation within group context (local) | |
| 107 | +| **Prefix wildcard** | `A*` → regex `A.*?` (spans `/`, matches substrings) | `A*` → prefix on short form only | |
| 108 | +| **Full regex per term** | Yes (`regex=True` mode) | No | |
| 109 | +| **Quoted exact match** | No | `"A"` — exact match, no ancestor search | |
| 110 | +| **Implicit default** | If no `(` or `@`: all terms become "anywhere" | No implicit conversion — must be explicit | |
| 111 | + |
| 112 | +### Group / structural operators |
| 113 | + |
| 114 | +| Feature | `basic_search` | `QueryHandler` | `StringQueryHandler` | |
| 115 | +| --------------------------------- | ----------------------------------------- | -------------------------------------------- | ---------------------- | |
| 116 | +| **Same nesting level** | `(A, B)` — A and B at same relative level | N/A — use `{A, B}` | N/A — use `{A, B}` | |
| 117 | +| **Same parenthesised group `{}`** | No | `{A, B}` — must share a direct parent group | Same as `QueryHandler` | |
| 118 | +| **Exact group `{:}`** | No | `{A, B:}` — same group, no other children | Same | |
| 119 | +| **Optional exact group** | No | `{A, B: C}` — A and B required, C optional | Same | |
| 120 | +| **Descendant group `[]`** | No | `[A, B]` — both in same subtree at any depth | Same | |
| 121 | +| **Any child `?`** | No | `?` — any tag or group child | Same | |
| 122 | +| **Any tag child `??`** | No | `??` — any leaf (non-group) child | Same | |
| 123 | +| **Any group child `???`** | No | `???` — any parenthesised group child | Same | |
| 124 | +| **Nested query operators** | No | Yes — full recursive composition | Same | |
| 125 | + |
| 126 | +### Ancestor / cross-form search |
| 127 | + |
| 128 | +| Scenario | `basic_search` | `QueryHandler` | `StringQueryHandler` | |
| 129 | +| ------------------------------------------------------- | ------------------------------------------------- | --------------------------------------- | ---------------------------------------------------------------- | |
| 130 | +| Query `Event`, string `Sensory-event` (short form) | ❌ literal only | ✅ `tag_terms` from schema | ✅ with `schema_lookup`; ❌ without | |
| 131 | +| Query `Event`, string `Event/Sensory-event` (long form) | ❌ `Event` ≠ `Event/Sensory-event` | ✅ schema normalises | ✅ slash-split produces `tag_terms = ("event", "sensory-event")` | |
| 132 | +| Query `Event/Sensory-event`, string `Sensory-event` | ❌ | ✅ schema normalises both to short form | ❌ no schema to normalise | |
| 133 | +| Schema-free ancestor search | `convert_query()` + long-form series (workaround) | N/A — schema always required | ✅ works natively for long-form strings | |
| 134 | +| Tag `Def/Name` matched by query `Def` | ❌ literal prefix mismatch | ✅ `short_base_tag = "Def"` | ✅ `tag_terms` contains `"def"` | |
| 135 | + |
| 136 | +### Critical semantic traps |
| 137 | + |
| 138 | +These differences are silent — no error, just wrong answers if you mix up query strings across implementations: |
| 139 | + |
| 140 | +| Operator | `basic_search` | `QueryHandler` / `StringQueryHandler` | |
| 141 | +| ----------------- | -------------------------------------------------------- | ----------------------------------------------------------------------------------- | |
| 142 | +| `@A` | A **must** appear anywhere in the string | A must **not** appear anywhere in the string | |
| 143 | +| `~A` | A must not appear **anywhere** (global) | A must not appear in any group that also matches the rest of the expression (local) | |
| 144 | +| `*` wildcard | Regex `.*?` — spans `/` and matches mid-token substrings | Strict prefix on the tag's short form — anchored to start | |
| 145 | +| No-operator `A B` | Both present anywhere (implicit `@@`) | Parse error — `&&` required | |
| 146 | + |
| 147 | +______________________________________________________________________ |
| 148 | + |
| 149 | +## Performance |
| 150 | + |
| 151 | +*Performance benchmarks will be added here.* |
| 152 | + |
| 153 | +Preliminary guidance: |
| 154 | + |
| 155 | +- For large-scale row filtering on raw strings where schema awareness is not needed, `basic_search` is likely fastest due to vectorised regex on the full series with no per-row parsing. |
| 156 | +- `StringQueryHandler` trades some throughput for full query-language support and optional ancestor matching; parse cost per row is a lightweight recursive split. |
| 157 | +- `QueryHandler` has the highest per-string cost because it requires a pre-parsed `HedString` (including schema tag resolution), but provides the richest result objects. |
0 commit comments