refactor: New read node that defers ibis table instantiation#709
refactor: New read node that defers ibis table instantiation#709TrevorBergeron merged 12 commits intomainfrom
Conversation
tswast
left a comment
There was a problem hiding this comment.
Nowhere near finished reviewing, but sending some early comments so it doesn't get stuck for too long.
bigframes/core/__init__.py
Outdated
| session: Session, | ||
| *, | ||
| predicate: Optional[str] = None, | ||
| snapshot_time: Optional[datetime.datetime] = None, |
There was a problem hiding this comment.
Let's make sure we reconcile this with the changes from #712
bigframes/core/compile/compiler.py
Outdated
| # These parameters should not be used | ||
| index_cols=(), |
There was a problem hiding this comment.
Is this because we're in ArrayValue, which doesn't have a concept of "index"? Let's clarify in the comment.
Alternatively, any chance you could make these parameters optional in to_query() and omit them?
There was a problem hiding this comment.
Yeah, index is really only a concept at higher layers, pullling out managing it to the caller
bigframes/core/compile/compiler.py
Outdated
| ibis_table = ibis.table(physical_schema, full_table_name) | ||
|
|
||
| if ordered: | ||
| if node.primary_key: |
There was a problem hiding this comment.
Seems a bit odd to me to put ordering generation here, but I guess this is just for total ordering, right? We still generate separate order by when we add the index, right?
There was a problem hiding this comment.
Yes, read table nodes should be able to establish their own total ordering either with provided uniqueness metadata (primary_key field) or by generating a hash-based key. Just like before, we do a .sort_index() on top of the read operation if the user provided index columns
bigframes/core/compile/compiler.py
Outdated
| ordering_value_columns = tuple( | ||
| bf_ordering.ascending_over(col) for col in node.primary_key | ||
| ) | ||
| if node.primary_key_sequential: |
There was a problem hiding this comment.
Where do we have primary keys that we know are sequential integers?
There was a problem hiding this comment.
Caching, which doesn't use this new node yet. Also, uploading local data could provide this
bigframes/core/nodes.py
Outdated
| columns: schemata.ArraySchema = field() | ||
|
|
||
| table_session: bigframes.session.Session = field() | ||
| # Should this even be stored here? |
There was a problem hiding this comment.
Would "native ordering column" or something be more appropriate? Such a name might allow us to use a row ID pseudocolumn as a fallback if one becomes available.
There was a problem hiding this comment.
renamed to total_order_cols
bigframes/core/nodes.py
Outdated
| primary_key: Tuple[str, ...] = field() # subset of schema | ||
| # indicates a primary key that is exactly offsets 0, 1, 2, ..., N-2, N-1 | ||
| primary_key_sequential: bool = False | ||
| snapshot_time: typing.Optional[datetime.datetime] = None |
There was a problem hiding this comment.
Technically "time travel" which is different from snapshot in BQ. https://cloud.google.com/bigquery/docs/access-historical-data
Although looking at that, even our the backend messages conflate the two.
There was a problem hiding this comment.
renamed symbols to not say "snapshot"
| # Added for backwards compatibility, not validated | ||
| sql_predicate: typing.Optional[str] = None |
There was a problem hiding this comment.
Fascinating. This implies some level of SQL compilation outside of this node. Should this be a structured "filters" object, instead?
There was a problem hiding this comment.
The original filters type is a bit too flexible, allowing potentially non-hashable tuples. I could convert the whole thing to tuples I guess. Would there be a benefit to that approach?
There was a problem hiding this comment.
Hmm... Forcing compilation to a string doesn't seem like the right choice to me. Some namedtuple or frozen dataclass would make the most sense to me.
There was a problem hiding this comment.
If eq and frozen are both true, by default @DataClass will generate a hash() method for you.
https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass
bigframes/core/nodes.py
Outdated
| def __post_init__(self): | ||
| # enforce invariants | ||
| physical_names = set(map(lambda i: i.name, self.physical_schema)) | ||
| assert len(self.columns.names) > 0 |
There was a problem hiding this comment.
Why this assertion? It is possible to create a completely empty table in BQ. Why one would want to do so, I'm not certain, but it is possible.
There was a problem hiding this comment.
Yeah, I guess we should allow empty tables, removed this constraint
bigframes/core/nodes.py
Outdated
| # enforce invariants | ||
| physical_names = set(map(lambda i: i.name, self.physical_schema)) | ||
| assert len(self.columns.names) > 0 | ||
| assert set(self.primary_key).issubset(physical_names) |
There was a problem hiding this comment.
This assertion might be false in future if "primary key" contains psuedo columns.
There was a problem hiding this comment.
Removed this constraint, though we might need a bit more work to support pseudo columns anyways.
bigframes/core/nodes.py
Outdated
| physical_names = set(map(lambda i: i.name, self.physical_schema)) | ||
| assert len(self.columns.names) > 0 | ||
| assert set(self.primary_key).issubset(physical_names) | ||
| assert set(self.columns.names).issubset(physical_names) |
There was a problem hiding this comment.
If we ever reach this line of code, it would likely be an indication we should have a ValueError further up the call stack. Would at least be helpful to have a custom error message here as in other assertions so the user know to file a bug that we missed a validation check somewhere.
There was a problem hiding this comment.
Added some error messages
| if isinstance(ibis_dtype, ibis_dtypes.Integer): | ||
| return pd.Int64Dtype() | ||
|
|
||
| # Temporary: Will eventually support an explicit json type instead of casting to string. |
There was a problem hiding this comment.
We should probably raise a warning (PreviewWarning?) in this case to make sure folks know that depending on any JSON functionality may break in future.
There was a problem hiding this comment.
added a preview warning
| ) | ||
|
|
||
| try: | ||
| print(sql) |
bigframes/session/__init__.py
Outdated
| # have executed a query with a LIMIT clause. | ||
| max_results=None, | ||
| ) | ||
| bf_read_gbq_table.validate_sql_through_ibis(sql, self.ibis_client) |
There was a problem hiding this comment.
We'll need Henry's logic to dryrun with and without the time_travel_timestamp so we can continue to support tables that don't support time travel.
There was a problem hiding this comment.
merge in his logic
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕