Support different TimeUnits and timezones when reading Timestamps from INT96

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**


We are adapting [DataFusion Comet](https://github.com/apache/datafusion-comet) (Spark accelerator) to use DataFusion's native Parquet scan backed by arrow-rs. Spark defaults to writing timestamps in Parquet as INT96 (a la Hive, Impala, and other systems), which most systems infer as a timestamp despite the Parquet spec having a separate timestamp type. In arrow-rs's case, it [converts to a `Timestamp(TimeUnit::Nanoseconds, None)`](https://github.com/apache/arrow-rs/blob/88eaa33ea5c959c4f129ad1b3d292d9bab1ba670/parquet/src/arrow/schema/primitive.rs#L104). The nanosecond-precision renders the data type unable to represent the same range of dates as what Spark wrote to the file originally.


**Describe the solution you'd like**


An opt-in feature that allows INT96 to pass unmodified bytes for each value, perhaps as `FixedSizedBinary(12)`.

**Describe alternatives you've considered**


- An option to choose the precision for inferring INT96 as Timestamps. For example, Spark uses microsecond precision, so going to `Timestamp(TimeUnit::Microsecond, None)` would support a larger range of dates. However, I do not think it's reasonable to push Spark-specific options into arrow-rs.
- An option to  pass INT96 as a struct of `Time64` and `Date32` Arrow types, which is essentially what an INT96 timestamp represents, however I take the same issue with the previous point.
- Bring over existing code from Comet's Parquet reader to arrow-rs which handles some of these quirks, and all of their respective Spark-specific configs. Same issue as above.

**Additional context**


- Please see https://github.com/apache/datafusion/issues/7958 for relevant discussion from 2023.
- Interpreting INT96 as a timestamp can be tough: it depends on the [Spark config](https://spark.apache.org/docs/latest/configuration.html), the [Spark version](https://kontext.tech/article/1062/spark-2x-to-3x-date-timestamp-and-int96-rebase-modes), and there still seems to be debate on whether arithmetic during conversion should wrap on overflow or not.
- DataFusion's `SchemaAdapter` gives us a lot of control over how to adjust data coming out of its Parquet scan. However, because this "lossy" conversion from INT96 to an Arrow type happens in arrow-rs, it's too late for us to fix it in a custom `SchemaAdapter`. If we implement this feature, we will be able to handle all of the Spark-specific quirks in a `SchemaAdapter`.
- Comet already has a native Parquet scan operator that handles these types of timestamp quirks, but it does not support complex types. In order to support complex types and share code with arrow-rs we want to use DataFusion's Parquet scan instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support different TimeUnits and timezones when reading Timestamps from INT96 #7220

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support different TimeUnits and timezones when reading Timestamps from INT96 #7220

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions