Skip to content

Support different TimeUnits and timezones when reading Timestamps from INT96 #7220

@mbutrovich

Description

@mbutrovich

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We are adapting DataFusion Comet (Spark accelerator) to use DataFusion's native Parquet scan backed by arrow-rs. Spark defaults to writing timestamps in Parquet as INT96 (a la Hive, Impala, and other systems), which most systems infer as a timestamp despite the Parquet spec having a separate timestamp type. In arrow-rs's case, it converts to a Timestamp(TimeUnit::Nanoseconds, None). The nanosecond-precision renders the data type unable to represent the same range of dates as what Spark wrote to the file originally.

Describe the solution you'd like

An opt-in feature that allows INT96 to pass unmodified bytes for each value, perhaps as FixedSizedBinary(12).

Describe alternatives you've considered

  • An option to choose the precision for inferring INT96 as Timestamps. For example, Spark uses microsecond precision, so going to Timestamp(TimeUnit::Microsecond, None) would support a larger range of dates. However, I do not think it's reasonable to push Spark-specific options into arrow-rs.
  • An option to pass INT96 as a struct of Time64 and Date32 Arrow types, which is essentially what an INT96 timestamp represents, however I take the same issue with the previous point.
  • Bring over existing code from Comet's Parquet reader to arrow-rs which handles some of these quirks, and all of their respective Spark-specific configs. Same issue as above.

Additional context

  • Please see Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark datafusion#7958 for relevant discussion from 2023.
  • Interpreting INT96 as a timestamp can be tough: it depends on the Spark config, the Spark version, and there still seems to be debate on whether arithmetic during conversion should wrap on overflow or not.
  • DataFusion's SchemaAdapter gives us a lot of control over how to adjust data coming out of its Parquet scan. However, because this "lossy" conversion from INT96 to an Arrow type happens in arrow-rs, it's too late for us to fix it in a custom SchemaAdapter. If we implement this feature, we will be able to handle all of the Spark-specific quirks in a SchemaAdapter.
  • Comet already has a native Parquet scan operator that handles these types of timestamp quirks, but it does not support complex types. In order to support complex types and share code with arrow-rs we want to use DataFusion's Parquet scan instead.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions