fix: Update the pyarrow to latest v14.0.1 regarding the CVE-2023-47248.#3835
fix: Update the pyarrow to latest v14.0.1 regarding the CVE-2023-47248.#3835shuchu wants to merge 14 commits intofeast-dev:masterfrom
Conversation
Signed-off-by: Shuchu Han <[email protected]>
|
A little bit worried about the unit test coverage. please be aware that I unpin the pyarrow version. py3.8-requirements.txt and py3.8-ci-requirements.txt were updated manually. (regarding the DASK version issue for python 3.8) |
| @@ -1,5 +1,5 @@ | |||
| # | |||
| # This file is autogenerated by pip-compile with Python 3.10 | |||
| # This file is autogenerated by pip-compile with Python 3.9 | |||
There was a problem hiding this comment.
you are right, I need to create a python 3.10 venv and run the command from Makefile. let me fix this.
There was a problem hiding this comment.
fixed, let's see the testing results.
There was a problem hiding this comment.
seems integration test failed...
There was a problem hiding this comment.
21 fails, the most frequent error is about the wrong format of Timestamp: google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Invalid timestamp microseconds value 1700011424237000000 of logical type NONE; in column 'created'
let me dig into it and see what's the root cause
There was a problem hiding this comment.
1, The Google's BigQuery api only accepts "ms" resolution for timestamp, while the Pyarrow.parquet.write_table() will maintain the resolution to the exact original resolution which is "ns" by default.
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
Signed-off-by: Shuchu Han <[email protected]>
…le write to temporary parquet file. Signed-off-by: Shuchu Han <[email protected]>
…ng pyarrow v10.0.1 Signed-off-by: Shuchu Han <[email protected]>
|
I meet a very interesting problem. I only update the Pyarrow version and snowflake api, the integration test results show me that the timestamp range is error while running Redshift SQL query. it happens while run "get_historical_features()" and the timestamp range were inferenced from the "entity_df": |
Signed-off-by: Shuchu Han <[email protected]>
Signed-off-by: Shuchu Han <[email protected]>
Signed-off-by: Shuchu Han <[email protected]>
Signed-off-by: Shuchu Han <[email protected]>
|
please do not merge this PR. @sudohainguyen |
Signed-off-by: Shuchu Han <[email protected]>
|
No worries, looking forward to seeing this works |
Signed-off-by: Shuchu Han <[email protected]>
Signed-off-by: Shuchu Han <[email protected]>
Signed-off-by: Shuchu Han <[email protected]>
Signed-off-by: Shuchu Han <[email protected]>
Signed-off-by: Shuchu Han <[email protected]>
|
Finally, I found the fix way. It's about the setting of parameter "coerce_timestamps" of "pyarrow.parquet.write_table". Let me close this PR and create a clean new one. |
|
Great @shuchu !! |
What this PR does / why we need it:
Update the pyarrow to latest version v14.0.1 which has the fix for CVE-2023-47248
Fixes #3832