-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We are adapting DataFusion Comet (Spark accelerator) to use DataFusion's native Parquet scan backed by arrow-rs. Spark defaults to writing timestamps in Parquet as INT96 (a la Hive, Impala, and other systems), which most systems infer as a timestamp despite the Parquet spec having a separate timestamp type. In arrow-rs's case, it converts to a Timestamp(TimeUnit::Nanoseconds, None). The nanosecond-precision renders the data type unable to represent the same range of dates as what Spark wrote to the file originally.
Describe the solution you'd like
An opt-in feature that allows INT96 to pass unmodified bytes for each value, perhaps as FixedSizedBinary(12).
Describe alternatives you've considered
- An option to choose the precision for inferring INT96 as Timestamps. For example, Spark uses microsecond precision, so going to
Timestamp(TimeUnit::Microsecond, None)would support a larger range of dates. However, I do not think it's reasonable to push Spark-specific options into arrow-rs. - An option to pass INT96 as a struct of
Time64andDate32Arrow types, which is essentially what an INT96 timestamp represents, however I take the same issue with the previous point. - Bring over existing code from Comet's Parquet reader to arrow-rs which handles some of these quirks, and all of their respective Spark-specific configs. Same issue as above.
Additional context
- Please see Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark datafusion#7958 for relevant discussion from 2023.
- Interpreting INT96 as a timestamp can be tough: it depends on the Spark config, the Spark version, and there still seems to be debate on whether arithmetic during conversion should wrap on overflow or not.
- DataFusion's
SchemaAdaptergives us a lot of control over how to adjust data coming out of its Parquet scan. However, because this "lossy" conversion from INT96 to an Arrow type happens in arrow-rs, it's too late for us to fix it in a customSchemaAdapter. If we implement this feature, we will be able to handle all of the Spark-specific quirks in aSchemaAdapter. - Comet already has a native Parquet scan operator that handles these types of timestamp quirks, but it does not support complex types. In order to support complex types and share code with arrow-rs we want to use DataFusion's Parquet scan instead.