[Feature] Add validation hardening, sort/limit restrictions, and integration tests for vectorSearch()#5324
Conversation
- Enforce exactly one of k, max_distance, or min_score - Validate k is in [1, 10000] range - Add 6 tests: mutual exclusivity (3 combos), k too small, k too large, k boundary values (1 and 10000) Signed-off-by: Eric Wei <[email protected]>
VectorSearchQueryBuilder now accepts options map and rejects pushDownLimit when LIMIT exceeds k. Radial modes (max_distance, min_score) have no LIMIT restriction. Signed-off-by: Eric Wei <[email protected]>
- Create VectorSearchIndexTest: 7 tests covering buildKnnQueryJson() for top-k, max_distance, min_score, nested fields, multi-element and single-element vectors, numeric option rendering - Add edge case tests to VectorSearchTableFunctionImplementationTest: NaN vector component, empty option key/value, negative k, NaN for max_distance and min_score (6 new tests) - Add VectorSearchQueryBuilderTest: min_score radial mode LIMIT, pushDownSort delegation to parent (2 new tests) - Extract buildKnnQueryJson() as package-private for direct testing Signed-off-by: Eric Wei <[email protected]>
Test too-many (5) and zero arguments paths in VectorSearchTableFunctionResolver to complement existing too-few (2) test. Signed-off-by: Eric Wei <[email protected]>
- Cap radial mode (max_distance/min_score) results at maxResultWindow to prevent unbounded result sets - Reject ORDER BY on non-_score fields and _score ASC in vectorSearch since knn results are naturally sorted by _score DESC - Add 12 integration tests: 4 _explain DSL shape verification tests and 8 validation error path tests Signed-off-by: Eric Wei <[email protected]>
- Add multi-sort expression test: ORDER BY _score DESC, name ASC correctly rejects the non-_score field (VectorSearchQueryBuilderTest) - Add case-insensitive argument name lookup test to verify TABLE='x' resolves same as table='x' (Implementation test) - Add non-numeric option fallback test: verifies string options are quoted in JSON output (VectorSearchIndexTest) - Add 4 integration tests: ORDER BY _score DESC succeeds, ORDER BY non-score rejects, ORDER BY _score ASC rejects, LIMIT within k succeeds (VectorSearchIT, now 16 tests) Signed-off-by: Eric Wei <[email protected]>
The base OpenSearchIndexScanQueryBuilder.pushDownSort() pushes sort.getCount() as a limit when non-zero. Our override validated _score DESC and returned true, but did not preserve this contract. SQL always sets count=0, so this was not reachable today, but PPL or future callers may set a non-zero count to combine sort+limit in one LogicalSort node. Preserve the behavior defensively. Add focused test: LogicalSort(count=7) with _score DESC verifies the count is pushed down as request size. Signed-off-by: Eric Wei <[email protected]>
08ab0c6 to
534b5f4
Compare
| if (hasK) { | ||
| parseIntOption(options, "k"); | ||
| int k = Integer.parseInt(options.get("k")); | ||
| if (k < 1 || k > 10000) { |
There was a problem hiding this comment.
- Unit test: compound AND predicate survives pushdown into bool.filter - Integration test: compound WHERE (term + range) produces bool query - Integration test: radial max_distance with WHERE produces bool query Signed-off-by: Eric Wei <[email protected]>
5b0373c to
0e02488
Compare
RyanL1997
left a comment
There was a problem hiding this comment.
Hi @mengweieric , thanks for the change and I just left a quesiton.
| // Preserve the parent's sort.getCount() → limit pushdown contract: SQL always sets count=0, | ||
| // but PPL or future callers may set a non-zero count to combine sort+limit in one node. | ||
| if (sort.getCount() != 0) { | ||
| requestBuilder.pushDownLimit(sort.getCount(), 0); |
There was a problem hiding this comment.
nit:
pushDownSort calls requestBuilder.pushDownLimit(sort.getCount(), 0) directly, bypassing this.pushDownLimit() (L59-67) which has the LIMIT > k validation.
I understand vectorSearch is SQL-only for P0, and in the SQL path sort.getCount() is always 0 so this is a no-op today. But the comment on L90-91 explicitly anticipates the PPL path ("PPL or future callers may set a non-zero count") — so just want to confirm: if/when PPL does use this path with a non-zero count, should the LIMIT > k guard still apply? Or is it intentionally unchecked here?
There was a problem hiding this comment.
Good catch. The sort.getCount() path was not preserving the same LIMIT <= k invariant as pushDownLimit(). I fixed that by routing both paths through the same validation helper and added tests for the sort-count case. Thanks for calling it out.
pushDownSort() called requestBuilder.pushDownLimit() directly, bypassing the LIMIT > k guard in pushDownLimit(). Extract validateLimitWithinK() helper and call it from both paths so the invariant holds when PPL or future callers set a non-zero sort count. Signed-off-by: Eric Wei <[email protected]>
|
|
||
| @Test | ||
| public void testExplainTopKProducesKnnQuery() throws IOException { | ||
| String explain = |
There was a problem hiding this comment.
Can we move the Explain test to a separate ExplainIT suite so that we can view the results in a yaml or json file directly
ref: CalciteExplainIT
There was a problem hiding this comment.
Thanks, done. The _explain cases now live in VectorSearchExplainIT, and VectorSearchIT keeps only behavior/validation tests. I kept the current SQL explain assertions for now; happy to follow up with file-backed expected outputs if we want to standardize that too.
Move all explainQuery()-based DSL shape tests into a dedicated VectorSearchExplainIT suite. VectorSearchIT now contains only validation and error-path tests. Signed-off-by: Eric Wei <[email protected]>
b7b37d1
into
opensearch-project:feature/vector-search-p0
Summary
Adds validation hardening, query restrictions, and integration tests on top of the
core vectorSearch() execution pipeline in #5320.
What this PR adds
Validation hardening
k,max_distance,min_scorekrange validation:[1, 10000]LIMIT > krejection in top-k modeORDER BY _score DESCallowed; rejects non-_scorefields and
_score ASCmax_distance/min_scorequeries atmaxResultWindowto prevent unbounded result sets
pushDownSort() contract preservation
OpenSearchIndexScanQueryBuilder.pushDownSort()contract forsort.getCount()→ limit pushdown. SQL always setscount=0, but PPL or futurecallers may combine sort+limit in one
LogicalSortnode. Focused test verifiesLogicalSort(count=7)is correctly pushed down as request size.Test coverage
VectorSearchIndexTest(9 tests): knn JSON generation, options rendering,non-numeric option quoting, nested fields
VectorSearchIT(16 integration tests): 4_explainDSL shape verification +8 validation error tests + 3 ORDER BY restriction tests + 1 LIMIT success test
case-insensitive arg lookup, sort count preservation, radial LIMIT modes,
resolver argument count edge cases, NaN/Infinity edge cases
Benchmark results (against live OpenSearch + k-NN)
Test plan
./gradlew spotlessCheck— PASS./gradlew :opensearch:test :core:test :sql:test— PASS./gradlew :integ-test:integTest -Dtests.class="*VectorSearchIT"— 16/16 PASS