feat: Add non-entity retrieval support for ClickHouse offline store#6066
Conversation
b3f62c1 to
1e4cd7c
Compare
| ) | ||
|
|
||
|
|
||
| class TestNonEntityRetrieval: |
There was a problem hiding this comment.
You're right, removed the heavy mocking. For proper coverage, an integration test against a real ClickHouse instance would be more valuable than over-mocked unit tests — the unit tests here may not be necessary at all.
|
Why not have an integration test? |
| # Handle non-entity retrieval mode | ||
| if entity_df is None: | ||
| end_date = kwargs.get("end_date", None) | ||
| if end_date is None: | ||
| end_date = _utc_now() | ||
| else: | ||
| end_date = make_tzaware(end_date) | ||
|
|
||
| entity_df = pd.DataFrame({"event_timestamp": [end_date]}) |
There was a problem hiding this comment.
🔴 Non-entity retrieval silently ignores start_date kwarg unlike Postgres counterpart
When entity_df is None, the Clickhouse get_historical_features only reads end_date from kwargs (clickhouse.py:59) and completely ignores start_date. The caller feature_store.py:1369-1370 passes start_date as a kwarg when the user provides it. The Postgres implementation (postgres.py:132-160), which this code is modeled after, uses start_date to compute the entity_df timestamp and to bound the TTL-based data scan window. In the Clickhouse version, a user-provided start_date is silently dropped, meaning the point-in-time join will use end_date as the sole entity timestamp regardless of the user's intent — potentially returning different (and unexpected) feature data compared to the Postgres offline store for the same inputs.
Prompt for agents
In sdk/python/feast/infra/offline_stores/contrib/clickhouse_offline_store/clickhouse.py, lines 57-65, add handling for the start_date kwarg to match the Postgres implementation at sdk/python/feast/infra/offline_stores/contrib/postgres_offline_store/postgres.py lines 132-168. Specifically:
1. Before the `if entity_df is None:` block, extract start_date from kwargs: `start_date = kwargs.get("start_date", None)`
2. Inside the block, after computing end_date, add logic to compute start_date from TTL if not provided (matching postgres.py lines 145-160):
- If start_date is None, find the max TTL across feature_views and set start_date = end_date - max_ttl (or default to 30 days)
- If start_date is provided, make it tz-aware with make_tzaware(start_date)
3. You will also need to import timedelta from datetime at the top of the file.
4. Consider whether the entity_df should use start_date or end_date as the event_timestamp (the Postgres version uses start_date via pd.date_range[:1], while the current Clickhouse version uses end_date).
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Not a bug — this is intentional. The PIT join uses MAX(entity_timestamp) as the upper bound, so the timestamp in the synthetic entity_df IS the query upper bound. Using [end_date] gives the window [end_date - TTL, end_date], which is correct. The Postgres implementation using pd.date_range(start=start_date, ...)[:1] actually has the bug — it takes start_date as the sole timestamp, making end_date unreachable. Our implementation matches Dask and is the correct behavior.
@ntkathole
There was a problem hiding this comment.
Yes I noticed that
8fd4f11 to
8dd703c
Compare
|
@franciscojavierarceo @ntkathole take another look ! |
Enable get_historical_features() to be called with entity_df=None by passing end_date kwarg instead. When entity_df is None, a synthetic single-row DataFrame is created using end_date (defaults to now). The PIT join window is controlled by end_date and TTL. Includes integration test against a real ClickHouse container. Fixes feast-dev#5835 Signed-off-by: yassinnouh21 <[email protected]>
4662b91 to
ee93086
Compare
…_df for non-entity retrieval The non-entity retrieval path created a synthetic entity_df using pd.date_range(start=start_date, ...)[:1], which placed start_date as the event_timestamp. Since PIT joins use MAX(entity_timestamp) as the upper bound for feature data filtering, using start_date made end_date unreachable — no features after start_date would be returned. Fix: use [end_date] directly, matching the ClickHouse implementation (PR feast-dev#6066) and the Dask offline store behavior. Signed-off-by: yassinnouh21 <[email protected]>
…_df for non-entity retrieval The non-entity retrieval path created a synthetic entity_df using pd.date_range(start=start_date, ...)[:1], which placed start_date as the event_timestamp. Since PIT joins use MAX(entity_timestamp) as the upper bound for feature data filtering, using start_date made end_date unreachable — no features after start_date would be returned. Fix: use [end_date] directly, matching the ClickHouse implementation (PR feast-dev#6066) and the Dask offline store behavior. Signed-off-by: yassinnouh21 <[email protected]>
…rieval (#6110) * fix(postgres): Use end_date instead of start_date in synthetic entity_df for non-entity retrieval The non-entity retrieval path created a synthetic entity_df using pd.date_range(start=start_date, ...)[:1], which placed start_date as the event_timestamp. Since PIT joins use MAX(entity_timestamp) as the upper bound for feature data filtering, using start_date made end_date unreachable — no features after start_date would be returned. Fix: use [end_date] directly, matching the ClickHouse implementation (PR #6066) and the Dask offline store behavior. Signed-off-by: yassinnouh21 <[email protected]> * fix: preserve timestamp range for min_event_timestamp and fix formatting The entity_df fix alone would cause min_event_timestamp to be computed as end_date - TTL (instead of start_date - TTL), clipping valid data from the query window. Override entity_df_event_timestamp_range to (start_date, end_date) in non-entity mode so the full range is used. Also fix ruff formatting in the test file. Signed-off-by: yassinnouh21 <[email protected]> * test: add integration test for non-entity retrieval Signed-off-by: yassinnouh21 <[email protected]> --------- Signed-off-by: yassinnouh21 <[email protected]> Co-authored-by: Francisco Javier Arceo <[email protected]>
…rieval (feast-dev#6110) * fix(postgres): Use end_date instead of start_date in synthetic entity_df for non-entity retrieval The non-entity retrieval path created a synthetic entity_df using pd.date_range(start=start_date, ...)[:1], which placed start_date as the event_timestamp. Since PIT joins use MAX(entity_timestamp) as the upper bound for feature data filtering, using start_date made end_date unreachable — no features after start_date would be returned. Fix: use [end_date] directly, matching the ClickHouse implementation (PR feast-dev#6066) and the Dask offline store behavior. Signed-off-by: yassinnouh21 <[email protected]> * fix: preserve timestamp range for min_event_timestamp and fix formatting The entity_df fix alone would cause min_event_timestamp to be computed as end_date - TTL (instead of start_date - TTL), clipping valid data from the query window. Override entity_df_event_timestamp_range to (start_date, end_date) in non-entity mode so the full range is used. Also fix ruff formatting in the test file. Signed-off-by: yassinnouh21 <[email protected]> * test: add integration test for non-entity retrieval Signed-off-by: yassinnouh21 <[email protected]> --------- Signed-off-by: yassinnouh21 <[email protected]> Co-authored-by: Francisco Javier Arceo <[email protected]> Signed-off-by: aaronzuo <[email protected]>
…rieval (feast-dev#6110) * fix(postgres): Use end_date instead of start_date in synthetic entity_df for non-entity retrieval The non-entity retrieval path created a synthetic entity_df using pd.date_range(start=start_date, ...)[:1], which placed start_date as the event_timestamp. Since PIT joins use MAX(entity_timestamp) as the upper bound for feature data filtering, using start_date made end_date unreachable — no features after start_date would be returned. Fix: use [end_date] directly, matching the ClickHouse implementation (PR feast-dev#6066) and the Dask offline store behavior. Signed-off-by: yassinnouh21 <[email protected]> * fix: preserve timestamp range for min_event_timestamp and fix formatting The entity_df fix alone would cause min_event_timestamp to be computed as end_date - TTL (instead of start_date - TTL), clipping valid data from the query window. Override entity_df_event_timestamp_range to (start_date, end_date) in non-entity mode so the full range is used. Also fix ruff formatting in the test file. Signed-off-by: yassinnouh21 <[email protected]> * test: add integration test for non-entity retrieval Signed-off-by: yassinnouh21 <[email protected]> --------- Signed-off-by: yassinnouh21 <[email protected]> Co-authored-by: Francisco Javier Arceo <[email protected]> Signed-off-by: Shizoqua <[email protected]>
…rieval (feast-dev#6110) * fix(postgres): Use end_date instead of start_date in synthetic entity_df for non-entity retrieval The non-entity retrieval path created a synthetic entity_df using pd.date_range(start=start_date, ...)[:1], which placed start_date as the event_timestamp. Since PIT joins use MAX(entity_timestamp) as the upper bound for feature data filtering, using start_date made end_date unreachable — no features after start_date would be returned. Fix: use [end_date] directly, matching the ClickHouse implementation (PR feast-dev#6066) and the Dask offline store behavior. Signed-off-by: yassinnouh21 <[email protected]> * fix: preserve timestamp range for min_event_timestamp and fix formatting The entity_df fix alone would cause min_event_timestamp to be computed as end_date - TTL (instead of start_date - TTL), clipping valid data from the query window. Override entity_df_event_timestamp_range to (start_date, end_date) in non-entity mode so the full range is used. Also fix ruff formatting in the test file. Signed-off-by: yassinnouh21 <[email protected]> * test: add integration test for non-entity retrieval Signed-off-by: yassinnouh21 <[email protected]> --------- Signed-off-by: yassinnouh21 <[email protected]> Co-authored-by: Francisco Javier Arceo <[email protected]> Signed-off-by: Aniket Paluskar <[email protected]>
…rieval (feast-dev#6110) * fix(postgres): Use end_date instead of start_date in synthetic entity_df for non-entity retrieval The non-entity retrieval path created a synthetic entity_df using pd.date_range(start=start_date, ...)[:1], which placed start_date as the event_timestamp. Since PIT joins use MAX(entity_timestamp) as the upper bound for feature data filtering, using start_date made end_date unreachable — no features after start_date would be returned. Fix: use [end_date] directly, matching the ClickHouse implementation (PR feast-dev#6066) and the Dask offline store behavior. Signed-off-by: yassinnouh21 <[email protected]> * fix: preserve timestamp range for min_event_timestamp and fix formatting The entity_df fix alone would cause min_event_timestamp to be computed as end_date - TTL (instead of start_date - TTL), clipping valid data from the query window. Override entity_df_event_timestamp_range to (start_date, end_date) in non-entity mode so the full range is used. Also fix ruff formatting in the test file. Signed-off-by: yassinnouh21 <[email protected]> * test: add integration test for non-entity retrieval Signed-off-by: yassinnouh21 <[email protected]> --------- Signed-off-by: yassinnouh21 <[email protected]> Co-authored-by: Francisco Javier Arceo <[email protected]> Signed-off-by: yuanjun220 <[email protected]>
# [0.61.0](v0.60.0...v0.61.0) (2026-04-07) ### Bug Fixes * Add grpcio dependency group to transformation server Dockerfile ([2c2150a](2c2150a)) * Add https readiness check for rest-registry tests ([ea85e63](ea85e63)) * Add website build check for PRs and fix blog frontmatter YAML error ([#6079](#6079)) ([30a3a43](30a3a43)) * Added missing jackc/pgx/v5 entries ([94ad0e7](94ad0e7)) * Added MLflow metric charts across feature selection ([#6080](#6080)) ([a403361](a403361)) * Check duplicate names for feature view across types ([#5999](#5999)) ([95b9af8](95b9af8)) * Fix integration tests ([#6046](#6046)) ([02d5548](02d5548)) * Fix missing error handling for resource_counts endpoint ([d9706ce](d9706ce)) * Fix non-specific label selector on metrics service ([a1a160d](a1a160d)) * fix path feature_definitions.py ([7d7df68](7d7df68)) * Fix regstry Rest API tests intermittent failure ([d53a339](d53a339)) * Fixed IntegrityError on SqlRegistry ([#6047](#6047)) ([325e148](325e148)) * Fixed intermittent failures in get_historical_features ([c335ec7](c335ec7)) * Fixed pre-commit check ([114b7db](114b7db)) * Fixed the intermittent FeatureViewNotFoundException ([661ecc7](661ecc7)) * Fixed uv cache permission error for docker build on mac ([ad807be](ad807be)) * Fixes a `PydanticDeprecatedSince20` warning for trino_offline_store ([#5991](#5991)) ([abfd18a](abfd18a)) * Handle existing RBAC role gracefully in namespace registry ([b46a62b](b46a62b)) * Ignore ipynb files during apply ([#6151](#6151)) ([4ea123d](4ea123d)) * Integration test failures ([#6040](#6040)) ([9165870](9165870)) * Mount TLS volumes for init container ([080a9b5](080a9b5)) * **postgres:** Use end_date in synthetic entity_df for non-entity retrieval ([#6110](#6110)) ([088a802](088a802)), closes [#6066](#6066) * Ray offline store tests are duplicated across 3 workflows ([54f705a](54f705a)) * Reenable tests ([#6036](#6036)) ([82ee7f8](82ee7f8)) * SSL/TLS mode by default for postgres connection ([4844488](4844488)) * Use commitlint pre-commit hook instead of a separate action ([35a81e7](35a81e7)) ### Features * Add Claude Code agent skills for Feast ([#6081](#6081)) ([1e5b60f](1e5b60f)), closes [#5976](#5976) [#6007](#6007) * Add complex type support (Map, JSON, Struct) with schema validation ([#5974](#5974)) ([1200dbf](1200dbf)) * Add decimal to supported feature types ([#6029](#6029)) ([#6226](#6226)) ([cff6fbf](cff6fbf)) * Add feast apply init container to automate registry population on pod start ([#6106](#6106)) ([6b31a43](6b31a43)) * Add feature view versioning support to PostgreSQL and MySQL online stores ([#6193](#6193)) ([940e0f0](940e0f0)), closes [#6168](#6168) [#6169](#6169) [#2728](#2728) * Add materialization, feature freshness, request latency, and push metrics to feature server ([2c6be18](2c6be18)) * Add metadata statistics to registry api ([ef1d4fc](ef1d4fc)) * Add non-entity retrieval support for ClickHouse offline store ([4d08ddc](4d08ddc)), closes [#5835](#5835) * Add OnlineStore for MongoDB ([#6025](#6025)) ([bf4e3fa](bf4e3fa)), closes [golang/go#74462](golang/go#74462) * Add Oracle DB as Offline store in python sdk & operator ([#6017](#6017)) ([9d35368](9d35368)) * Add RBAC aggregation labels to FeatureStore ClusterRoles ([daf77c6](daf77c6)) * Add ServiceMonitor auto-generation for Prometheus discovery ([#6126](#6126)) ([56e6d21](56e6d21)) * Add typed_features field to grpc write request (([#6117](#6117)) ([#6118](#6118)) ([eeaa6db](eeaa6db)), closes [#6116](#6116) * Add UUID and TIME_UUID as feature types ([#5885](#5885)) ([#5951](#5951)) ([5d6e311](5d6e311)) * Add version indicators to lineage graph nodes ([#6187](#6187)) ([73805d3](73805d3)) * Add version tracking to FeatureView ([#6101](#6101)) ([ed4a4f2](ed4a4f2)) * Added Agent skills for AI Agents ([#6007](#6007)) ([99008c8](99008c8)) * Added CodeQL SAST scanning and detect-secrets pre-commit hook ([547b516](547b516)) * Added odfv transformations metrics ([8b5a526](8b5a526)) * Adding optional name to Aggregation (feast-dev[#5994](#5994)) ([#6083](#6083)) ([56469f7](56469f7)) * Created DocEmbedder class ([#5973](#5973)) ([0719c06](0719c06)) * Extended OIDC support to extract groups & namespaces and token injection with multiple methods ([#6089](#6089)) ([7c04026](7c04026)) * Feature Server High-Availability on Kubernetes ([#6028](#6028)) ([9c07b4c](9c07b4c)), closes [Hi#Availability](https://github.com/Hi/issues/Availability) [Hi#Availability](https://github.com/Hi/issues/Availability) * **go:** Implement metrics and tracing for http and grpc servers ([#5925](#5925)) ([2b4ec9a](2b4ec9a)) * Horizontal scaling support to the Feast operator ([#6000](#6000)) ([3ec13e6](3ec13e6)) * Making feature view source optional (feast-dev[#6074](#6074)) ([#6075](#6075)) ([76917b7](76917b7)) * Replace ORJSONResponse with Pydantic response models for faster JSON serialization ([65cf03c](65cf03c)) * Support arm docker build ([#6061](#6061)) ([1e1f5d9](1e1f5d9)) * Support distinct count aggregation [[#6116](#6116)] ([3639570](3639570)) * Support HTTP in MCP ([#6109](#6109)) ([e72b983](e72b983)) * Support nested collection types (Array/Set of Array/Set) ([#5947](#5947)) ([#6132](#6132)) ([ab61642](ab61642)) * Support podAnnotations on Deployment pod template ([1b3cdc1](1b3cdc1)) * Use orjson for faster JSON serialization in feature server ([6f5203a](6f5203a)) * Utilize date partition column in BigQuery ([#6076](#6076)) ([4ea9b32](4ea9b32)) ### Performance Improvements * Online feature response construction in a single pass over read rows ([113fb04](113fb04)) * Optimize protobuf parsing in Redis online store ([#6023](#6023)) ([59dfdb8](59dfdb8)) * Optimize timestamp conversion in _convert_rows_to_protobuf ([33a2e95](33a2e95)) * Parallelize DynamoDB batch reads in sync online_read ([#6024](#6024)) ([9699944](9699944)) * Remove redundant entity key serialization in online_read ([d87283f](d87283f))
# [0.62.0](v0.61.0...v0.62.0) (2026-04-08) ### Bug Fixes * Added missing jackc/pgx/v5 entries ([94ad0e7](94ad0e7)) * Fix missing error handling for resource_counts endpoint ([d9706ce](d9706ce)) * fix path feature_definitions.py ([7d7df68](7d7df68)) * Fix regstry Rest API tests intermittent failure ([d53a339](d53a339)) * Fixed intermittent failures in get_historical_features ([c335ec7](c335ec7)) * Fixed the intermittent FeatureViewNotFoundException ([661ecc7](661ecc7)) * Handle existing RBAC role gracefully in namespace registry ([b46a62b](b46a62b)) * Ignore ipynb files during apply ([#6151](#6151)) ([4ea123d](4ea123d)) * Mount TLS volumes for init container ([080a9b5](080a9b5)) * **postgres:** Use end_date in synthetic entity_df for non-entity retrieval ([#6110](#6110)) ([088a802](088a802)), closes [#6066](#6066) * SSL/TLS mode by default for postgres connection ([4844488](4844488)) * Sync v0.61-branch so v0.61.0 tag is reachable from master ([af66878](af66878)) ### Features * Add Claude Code agent skills for Feast ([#6081](#6081)) ([1e5b60f](1e5b60f)), closes [#5976](#5976) [#6007](#6007) * Add decimal to supported feature types ([#6029](#6029)) ([#6226](#6226)) ([cff6fbf](cff6fbf)) * Add feast apply init container to automate registry population on pod start ([#6106](#6106)) ([6b31a43](6b31a43)) * Add feature view versioning support to PostgreSQL and MySQL online stores ([#6193](#6193)) ([940e0f0](940e0f0)), closes [#6168](#6168) [#6169](#6169) [#2728](#2728) * Add metadata statistics to registry api ([ef1d4fc](ef1d4fc)) * Add Oracle DB as Offline store in python sdk & operator ([#6017](#6017)) ([9d35368](9d35368)) * Add RBAC aggregation labels to FeatureStore ClusterRoles ([daf77c6](daf77c6)) * Add ServiceMonitor auto-generation for Prometheus discovery ([#6126](#6126)) ([56e6d21](56e6d21)) * Add typed_features field to grpc write request (([#6117](#6117)) ([#6118](#6118)) ([eeaa6db](eeaa6db)), closes [#6116](#6116) * Add UUID and TIME_UUID as feature types ([#5885](#5885)) ([#5951](#5951)) ([5d6e311](5d6e311)) * Add version indicators to lineage graph nodes ([#6187](#6187)) ([73805d3](73805d3)) * Add version tracking to FeatureView ([#6101](#6101)) ([ed4a4f2](ed4a4f2)) * Added Agent skills for AI Agents ([#6007](#6007)) ([99008c8](99008c8)) * Added odfv transformations metrics ([8b5a526](8b5a526)) * Created DocEmbedder class ([#5973](#5973)) ([0719c06](0719c06)) * Extended OIDC support to extract groups & namespaces and token injection with multiple methods ([#6089](#6089)) ([7c04026](7c04026)) * Replace ORJSONResponse with Pydantic response models for faster JSON serialization ([65cf03c](65cf03c)) * Support distinct count aggregation [[#6116](#6116)] ([3639570](3639570)) * Support HTTP in MCP ([#6109](#6109)) ([e72b983](e72b983)) * Support nested collection types (Array/Set of Array/Set) ([#5947](#5947)) ([#6132](#6132)) ([ab61642](ab61642)) * Support podAnnotations on Deployment pod template ([1b3cdc1](1b3cdc1)) * Utilize date partition column in BigQuery ([#6076](#6076)) ([4ea9b32](4ea9b32)) ### Performance Improvements * Online feature response construction in a single pass over read rows ([113fb04](113fb04))
What this PR does / why we need it:
Adds support for non-entity historical retrieval (
entity_df=None) in the ClickHouse offline store, bringing it to parity with the PostgreSQL offline store.Changes:
ClickhouseOfflineStore.get_historical_features()to acceptentity_df=Nonewith optionalstart_date/end_datekwargsentity_dfisNone, a synthetic single-row DataFrame is created using the provided date range (or sensible defaults:end_date=now,start_datederived from max TTL or 30 days)Usage:
Which issue(s) this PR fixes:
Fixes #5835
Test plan
pytest tests/unit/infra/offline_stores/test_clickhouse.py)