refactor: New read node that defers ibis table instantiation by TrevorBergeron · Pull Request #709 · googleapis/python-bigquery-dataframes

TrevorBergeron · 2024-05-20T19:55:56Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

tswast

Nowhere near finished reviewing, but sending some early comments so it doesn't get stuck for too long.

tswast · 2024-05-23T21:47:53Z

bigframes/core/__init__.py

+        session: Session,
+        *,
+        predicate: Optional[str] = None,
+        snapshot_time: Optional[datetime.datetime] = None,


Let's make sure we reconcile this with the changes from #712

tswast · 2024-05-23T21:57:28Z

bigframes/core/compile/compiler.py

+            # These parameters should not be used
+            index_cols=(),


Is this because we're in ArrayValue, which doesn't have a concept of "index"? Let's clarify in the comment.

Alternatively, any chance you could make these parameters optional in to_query() and omit them?

Yeah, index is really only a concept at higher layers, pullling out managing it to the caller

tswast · 2024-05-23T22:03:34Z

bigframes/core/compile/compiler.py

+        ibis_table = ibis.table(physical_schema, full_table_name)
+
+    if ordered:
+        if node.primary_key:


Seems a bit odd to me to put ordering generation here, but I guess this is just for total ordering, right? We still generate separate order by when we add the index, right?

Yes, read table nodes should be able to establish their own total ordering either with provided uniqueness metadata (primary_key field) or by generating a hash-based key. Just like before, we do a .sort_index() on top of the read operation if the user provided index columns

tswast · 2024-05-23T22:04:53Z

bigframes/core/compile/compiler.py

+            ordering_value_columns = tuple(
+                bf_ordering.ascending_over(col) for col in node.primary_key
+            )
+            if node.primary_key_sequential:


Where do we have primary keys that we know are sequential integers?

Caching, which doesn't use this new node yet. Also, uploading local data could provide this

bigframes/core/compile/compiler.py

tswast · 2024-05-24T15:20:16Z

bigframes/core/nodes.py

+    columns: schemata.ArraySchema = field()
+
+    table_session: bigframes.session.Session = field()
+    # Should this even be stored here?


Would "native ordering column" or something be more appropriate? Such a name might allow us to use a row ID pseudocolumn as a fallback if one becomes available.

renamed to total_order_cols

bigframes/core/nodes.py

tswast · 2024-05-24T15:22:22Z

bigframes/core/nodes.py

+    primary_key: Tuple[str, ...] = field()  # subset of schema
+    # indicates a primary key that is exactly offsets 0, 1, 2, ..., N-2, N-1
+    primary_key_sequential: bool = False
+    snapshot_time: typing.Optional[datetime.datetime] = None


Technically "time travel" which is different from snapshot in BQ. https://cloud.google.com/bigquery/docs/access-historical-data

Although looking at that, even our the backend messages conflate the two.

renamed symbols to not say "snapshot"

tswast · 2024-05-24T15:23:19Z

bigframes/core/nodes.py

+    # Added for backwards compatibility, not validated
+    sql_predicate: typing.Optional[str] = None


Fascinating. This implies some level of SQL compilation outside of this node. Should this be a structured "filters" object, instead?

The original filters type is a bit too flexible, allowing potentially non-hashable tuples. I could convert the whole thing to tuples I guess. Would there be a benefit to that approach?

Hmm... Forcing compilation to a string doesn't seem like the right choice to me. Some namedtuple or frozen dataclass would make the most sense to me.

If eq and frozen are both true, by default @DataClass will generate a hash() method for you.

https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass

tswast · 2024-05-24T15:24:07Z

bigframes/core/nodes.py

+    def __post_init__(self):
+        # enforce invariants
+        physical_names = set(map(lambda i: i.name, self.physical_schema))
+        assert len(self.columns.names) > 0


Why this assertion? It is possible to create a completely empty table in BQ. Why one would want to do so, I'm not certain, but it is possible.

Yeah, I guess we should allow empty tables, removed this constraint

tswast · 2024-05-24T15:24:43Z

bigframes/core/nodes.py

+        # enforce invariants
+        physical_names = set(map(lambda i: i.name, self.physical_schema))
+        assert len(self.columns.names) > 0
+        assert set(self.primary_key).issubset(physical_names)


This assertion might be false in future if "primary key" contains psuedo columns.

Removed this constraint, though we might need a bit more work to support pseudo columns anyways.

tswast · 2024-05-24T15:26:26Z

bigframes/core/nodes.py

+        physical_names = set(map(lambda i: i.name, self.physical_schema))
+        assert len(self.columns.names) > 0
+        assert set(self.primary_key).issubset(physical_names)
+        assert set(self.columns.names).issubset(physical_names)


If we ever reach this line of code, it would likely be an indication we should have a ValueError further up the call stack. Would at least be helpful to have a custom error message here as in other assertions so the user know to file a bug that we missed a validation check somewhere.

Added some error messages

tswast · 2024-05-24T15:29:09Z

bigframes/dtypes.py

    if isinstance(ibis_dtype, ibis_dtypes.Integer):
        return pd.Int64Dtype()

+    # Temporary: Will eventually support an explicit json type instead of casting to string.


We should probably raise a warning (PreviewWarning?) in this case to make sure folks know that depending on any JSON functionality may break in future.

added a preview warning

tswast · 2024-05-24T15:30:30Z

bigframes/session/_io/bigquery/__init__.py

        )

    try:
+        print(sql)


Remove this print.

tswast · 2024-05-24T15:32:13Z

bigframes/session/__init__.py

+            # have executed a query with a LIMIT clause.
+            max_results=None,
        )
+        bf_read_gbq_table.validate_sql_through_ibis(sql, self.ibis_client)


We'll need Henry's logic to dryrun with and without the time_travel_timestamp so we can continue to support tables that don't support time travel.

merge in his logic

refactor: New read node that defers ibis table instantiation

d46017f

TrevorBergeron requested review from a team and tswast May 20, 2024 19:55

product-auto-label bot added the size: l Pull request size is large. label May 20, 2024

blunderbuss-gcf bot assigned jiaxunwu May 20, 2024

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. label May 20, 2024

TrevorBergeron added 4 commits May 21, 2024 19:31

handle json type

551ec75

Merge remote-tracking branch 'github/main' into read_table_node

54c9907

Merge remote-tracking branch 'github/main' into read_table_node

c885ea6

fix some read table cases

b4bd0f1

tswast reviewed May 23, 2024

View reviewed changes

tswast suggested changes May 24, 2024

View reviewed changes

TrevorBergeron added 6 commits May 24, 2024 17:55

Merge remote-tracking branch 'github/main' into read_table_node

003b85c

simplify code paths

3807c96

refactor compile read table node

48c6d93

more refactors

651c539

Merge remote-tracking branch 'github/main' into read_table_node

2cd9bdf

add warning for json as string autocast

5474b60

TrevorBergeron requested a review from tswast May 28, 2024 17:59

tswast approved these changes May 29, 2024

View reviewed changes

Merge branch 'main' into read_table_node

9345d62

TrevorBergeron merged commit 9f0406e into main May 30, 2024

TrevorBergeron deleted the read_table_node branch May 30, 2024 00:30

		# Added for backwards compatibility, not validated
		sql_predicate: typing.Optional[str] = None

Conversation

TrevorBergeron commented May 20, 2024

Uh oh!

tswast left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants