Skip to content

perf(postgres): Optimizing feast offline Store for date-range multi-FV retrieval#6057

Open
Vperiodt wants to merge 15 commits intofeast-dev:masterfrom
Vperiodt:patch-query
Open

perf(postgres): Optimizing feast offline Store for date-range multi-FV retrieval#6057
Vperiodt wants to merge 15 commits intofeast-dev:masterfrom
Vperiodt:patch-query

Conversation

@Vperiodt
Copy link

@Vperiodt Vperiodt commented Mar 4, 2026

What this PR does / why we need it:

  1. Replaces the date-range multi-FV path in the Postgres offline store (previously base_entities + one LEFT JOIN LATERAL per feature view) with a set-based LOCF (Last Observation Carried Forward) implementation

  2. Uses a single timeline: stack (UNION ALL of spine + feature rows), COUNT for group boundaries, FIRST_VALUE for forward-fill, then filter to spine and apply per-FV TTL.

Which issue(s) this PR fixes:

Fixes slow get_historical_features on the Postgres offline store for date-range retrieval with multiple feature views. The LATERAL/inequality joins had O(N×M) cost. This PR switches to a LOCF path: one stacked timeline of size L, sort O(L log L), and O(L) window passes, which makes total cost as O(L log L).

Misc

Scenario: get_historical_features(features=[...], start_date=2023-01-01, end_date=2023-01-07) with no entity DataFrame and two feature views:

  • driver_fv: entity driver_id, TTL 1 day, feature score
  • customer_fv: entity customer_id, TTL 0, feature amount

Inputs

Parameter Value
start_date 2023-01-01
end_date 2023-01-07
entity_df None (non-entity mode)
FVs driver_fv (driver_id, score, TTL 1d), customer_fv (customer_id, amount, TTL 0)

1. Feature data window (__data_raw)

Feature data is explicitly pulled from lookback to end_date so LOCF can carry the last observation before start_date.

Time window used for feature data

FV TTL Time window
driver_fv 1 day [lookback_start_date, end_date]
customer_fv 0 [start_date, end_date]

With lookback_start_date = start_date - max_ttl (e.g. 2022-12-31 for 1-day TTL), driver_fv rows before 2023-01-01 are now included so LOCF can fill correctly.

-- Per-FV __data_raw: feature rows from lookback_start_date..end_date
"driver_fv__data_raw" AS (
  SELECT "ts" AS event_timestamp, "driver_id", "score" AS "score"
  FROM "driver_table" AS sub
  WHERE "ts" BETWEEN '2023-01-01 00:00:00+00' - interval '1 day' AND '2023-01-07 00:00:00+00'
),
"customer_fv__data_raw" AS (
  SELECT "ts" AS event_timestamp, "customer_id", "amount" AS "amount"
  FROM "customer_table" AS sub
  WHERE "ts" BETWEEN '2023-01-01 00:00:00+00' AND '2023-01-07 00:00:00+00'
),

2. Spine: unified (entity, timestamp) grid

Spine is built from each FV’s __data / __data_raw over [start_date, end_date]. When FVs have different entity sets, the template emits NULL AS "entity" for the FV that doesn’t have that entity:

spine AS (
  SELECT DISTINCT d.event_timestamp, d."driver_id", NULL AS "customer_id"
  FROM "driver_fv__data_raw" d
  WHERE d.event_timestamp BETWEEN '2023-01-01 00:00:00+00' AND '2023-01-07 00:00:00+00'
  UNION
  SELECT DISTINCT d.event_timestamp, NULL AS "driver_id", d."customer_id"
  FROM "customer_fv__data_raw" d
  WHERE d.event_timestamp BETWEEN '2023-01-01 00:00:00+00' AND '2023-01-07 00:00:00+00'
),

Example spine shape (

Spine = distinct (event_timestamp, driver_id, customer_id) from both FVs; missing entity is NULL.

event_timestamp driver_id customer_id
2023-01-01 00:00:00+00 101 NULL
2023-01-01 00:00:00+00 NULL 201
2023-01-02 00:00:00+00 101 NULL

3. TTL in the final SELECT

-- driver_fv has TTL 86400 (1 day)
CASE WHEN (spine.event_timestamp - "driver_fv__f"."driver_fv__filled_ts") <= make_interval(secs => 86400)
     THEN "driver_fv__f"."score" ELSE NULL END AS "score",
-- customer_fv has TTL 0: no CASE
"customer_fv__f"."amount" AS "amount"

Example final output shape

After LOCF and TTL: one row per (event_timestamp, driver_id, customer_id); score is NULL when outside TTL.

event_timestamp driver_id customer_id score amount
2023-01-01 00:00:00+00 101 NULL 0.8 NULL
2023-01-01 00:00:00+00 NULL 201 NULL 99.5
2023-01-02 00:00:00+00 101 NULL 0.9 NULL
Open with Devin

Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@Vperiodt Vperiodt marked this pull request as ready for review March 4, 2026 07:06
@Vperiodt Vperiodt requested a review from a team as a code owner March 4, 2026 07:06
devin-ai-integration[bot]

This comment was marked as resolved.

@tokoko
Copy link
Collaborator

tokoko commented Mar 4, 2026

@Vperiodt thanks for the PR. if it's not too much trouble, can you give a small example of result queries for say.. 2 fv scenario and highlight differences this PR introduces? there jinja templates are hard enough to read through even when one knows what the intent is 😄

@Vperiodt
Copy link
Author

Vperiodt commented Mar 4, 2026

@Vperiodt thanks for the PR. if it's not too much trouble, can you give a small example of result queries for say.. 2 fv scenario and highlight differences this PR introduces? there jinja templates are hard enough to read through even when one knows what the intent is 😄

sure ! will add it in the description itself

Vperiodt and others added 3 commits March 4, 2026 16:38
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
devin-ai-integration[bot]

This comment was marked as resolved.

@Vperiodt Vperiodt marked this pull request as draft March 5, 2026 17:18
Vperiodt and others added 3 commits March 9, 2026 10:43
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@Vperiodt Vperiodt marked this pull request as ready for review March 9, 2026 21:20
devin-ai-integration[bot]

This comment was marked as resolved.

Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Copy link
Collaborator

@YassinNouh21 YassinNouh21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LOCF approach is a solid improvement over the LATERAL join path — O(L log L) window passes vs O(N×M) inequality joins is a real win for large date-range queries. The per-FV independent pipeline is clean and avoids cross-FV interference.

A few issues to address before merge — one pre-existing bug that this PR should fix since it's modifying the non-entity path, dead code in the template, and some test coverage gaps.

else start_date
)

entity_df = pd.DataFrame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (pre-existing, should fix in this PR): This still uses pd.date_range(start=start_date, ...)[:1] which takes start_date as the entity timestamp, not end_date. This was identified in the PR #6066 review — the ClickHouse implementation correctly uses [end_date].

Since this PR is already modifying the non-entity retrieval path, it's a good opportunity to fix this:

entity_df = pd.DataFrame({"event_timestamp": [end_date]})

In the current code, the entity_df isn't used by the LOCF query path (the spine replaces it), but it's still uploaded to a temp table via _upload_entity_df — so at minimum it wastes a round-trip to Postgres. Even better: skip creating/uploading the entity_df entirely when start_date and end_date are set, since the LOCF path never references left_table_query_string.

# Compute lookback_start_date for LOCF: pull feature data
# from (start_date - max_ttl) so window functions can
# forward-fill the last observation before start_date.
max_ttl_seconds = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate computation: When start_date is None, max_ttl_seconds is already computed in the block above (lines 147-151) to derive start_date. This block recomputes the exact same value. Consider hoisting max_ttl_seconds to a single computation before both uses:

max_ttl_seconds = max(
    (int(fv.ttl.total_seconds()) for fv in feature_views
     if fv.ttl and isinstance(fv.ttl, timedelta)),
    default=0,
)

{% endfor %}

{# Build list of output feature names per FV for consistent column ordering #}
{% set all_feature_cols = [] %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: all_feature_cols is computed here but never referenced in the rest of the template. The final SELECT iterates over featureview.features directly and computes col_name inline. This entire block (lines 569-579) can be removed.

"{{ featureview.name }}__grouped" AS (
SELECT *,
COUNT(feature_anchor) OVER (
PARTITION BY {% if featureview.entities %}{% for entity in featureview.entities %}"{{ entity }}"{% if not loop.last %}, {% endif %}{% endfor %}{% else %}1{% endif %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entityless FV edge case: PARTITION BY 1 works in Postgres but is non-standard and some query planners may not optimize it well. An alternative is PARTITION BY (SELECT NULL) or simply omitting the PARTITION clause (which gives one partition over the whole result set).

Also worth adding a test case for entityless feature views to verify the PARTITION BY 1 and {% else %}1{% endif %} paths generate valid SQL.

LEFT JOIN "{{ featureview.name }}__filled" AS "{{ featureview.name }}__f"
ON spine.event_timestamp = "{{ featureview.name }}__f".event_timestamp
{% for entity in all_entities %}
AND spine."{{ entity }}" IS NOT DISTINCT FROM "{{ featureview.name }}__f"."{{ entity }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of IS NOT DISTINCT FROM here. This fixes a bug in the old LATERAL join path where NULL = NULL would fail the join for FVs with different entity sets. Worth calling this out in the PR description as a bug fix.

{% set col_name = featureview.field_mapping.get(feature, feature) %}
{% endif %}
{% if featureview.ttl != 0 %}
CASE WHEN (spine.event_timestamp - "{{ featureview.name }}__f"."{{ featureview.name }}__filled_ts") <= {{ featureview.ttl }} * interval '1' second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TTL enforcement via CASE is correct — when filled_ts is NULL (no prior observation), the subtraction yields NULL, the comparison is UNKNOWN, and CASE falls through to ELSE NULL.

One edge case: if featureview.ttl is a very large number, {{ featureview.ttl }} * interval '1' second could overflow on some Postgres versions. Consider using make_interval(secs => {{ featureview.ttl }}) for safety, though in practice TTLs are unlikely to overflow.

WHERE "{{ featureview.timestamp_field }}" BETWEEN '{{ lookback_start_date or start_date }}' AND '{{ end_date }}'
),
{% if featureview.created_timestamp_column %}
"{{ featureview.name }}__data" AS (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: When there's no created_timestamp_column, this CTE is a no-op passthrough (SELECT * FROM __data_raw). Postgres will likely optimize it away, but you could simplify by using __data_raw as the alias directly:

{% if featureview.created_timestamp_column %}
"{{ featureview.name }}__data" AS (
    SELECT * FROM (
        ... dedup logic ...
    ) __dedup WHERE __rn = 1
),
{% endif %}

Then reference {{ featureview.name }}__data{% if not featureview.created_timestamp_column %}_raw{% endif %} downstream. Though I understand the current approach is simpler to maintain.

def test_lateral_join_ttl_constraints(self):
"""Test that LATERAL JOINs include proper TTL constraints"""
from jinja2 import BaseLoader, Environment
def test_locf_template_multi_fv_date_range(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tests for the happy path and TTL=0 case. Consider adding coverage for:

  1. Different entity sets across FVs — e.g., fv1 has driver_id, fv2 has customer_id. This exercises the NULL AS "{{ entity }}" and IS NOT DISTINCT FROM paths which are the trickiest part of the LOCF approach.
  2. Entityless feature view — verifies the PARTITION BY 1 fallback generates valid SQL.
  3. created_timestamp_column set — verifies the dedup CTE (__data_raw__data via ROW_NUMBER) is generated correctly.
  4. full_feature_names=True — verifies the featureview.name ~ '__' ~ feature column naming.

These are the main code paths that differ from the happy path and are most likely to produce malformed SQL.

Vperiodt and others added 7 commits March 16, 2026 11:52
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Signed-off-by: Vanshika Vanshika <[email protected]>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants