Summary
When both table and query are provided to BigQuerySource, get_table_query_string() always returns table, silently ignoring query. This makes it impossible to use a custom query (e.g., for deduplication) on a PushSource batch source, since PushSource requires table for offline writes via offline_write_batch().
Expected Behavior
When both table and query are set on a BigQuerySource:
- Reads (
get_table_query_string()) should use query — it's more specific and intentionally provided
- Writes (
offline_write_batch()) should continue using .table directly as the write destination
Current Behavior
get_table_query_string() in bigquery_source.py always prefers table:
def get_table_query_string(self) -> str:
if self.table:
return f"`{self.table}`"
return f"({self.query})"
This means any custom query (e.g., deduplication logic) is silently ignored when table is also present.
Use Case
Streaming (push) sources often produce duplicate rows in BigQuery. The natural solution is:
batch_source = BigQuerySource(
name="my_batch_source",
table="project.dataset.my_table", # needed for push writes
query="""
SELECT * FROM `project.dataset.my_table`
QUALIFY ROW_NUMBER() OVER (PARTITION BY entity_id, event_time) = 1
""", # needed for deduplicated reads
timestamp_field="event_time",
)
push_source = PushSource(name="my_source", batch_source=batch_source)
But because get_table_query_string() ignores query when table is set, reads return duplicates. And removing table to force query usage breaks offline_write_batch(), which accesses .table directly (bigquery.py:449).
Environment
- Feast version: 0.58.0 (also confirmed unresolved on 0.61.0 / current main)
- Offline store: BigQuery
Summary
When both
tableandqueryare provided toBigQuerySource,get_table_query_string()always returnstable, silently ignoringquery. This makes it impossible to use a custom query (e.g., for deduplication) on aPushSourcebatch source, sincePushSourcerequirestablefor offline writes viaoffline_write_batch().Expected Behavior
When both
tableandqueryare set on aBigQuerySource:get_table_query_string()) should usequery— it's more specific and intentionally providedoffline_write_batch()) should continue using.tabledirectly as the write destinationCurrent Behavior
get_table_query_string()inbigquery_source.pyalways preferstable:This means any custom
query(e.g., deduplication logic) is silently ignored whentableis also present.Use Case
Streaming (push) sources often produce duplicate rows in BigQuery. The natural solution is:
But because
get_table_query_string()ignoresquerywhentableis set, reads return duplicates. And removingtableto forcequeryusage breaksoffline_write_batch(), which accesses.tabledirectly (bigquery.py:449).Environment