Jekyll2026-03-10T15:27:02-04:00https://arrow.apache.org/feed.xmlApache ArrowApache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics. It specifies a standardized language-independent column-oriented memory format for flat and nested data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.Apache Arrow Go 18.5.2 Release2026-03-04T00:00:00-05:002026-03-04T00:00:00-05:00https://arrow.apache.org/blog/2026/03/04/arrow-go-18.5.2

The Apache Arrow team is pleased to announce the v18.5.2 release of Apache Arrow Go. This patch release covers 16 commits from 6 distinct contributors.

Contributors

$ git shortlog -sn v18.5.1..v18.5.2
    11	Matt Topol
     2	daniel-adam-tfs
     1	Evan Todd
     1	Rusty Conover
     1	Stas Spiridonov
     1	William

Changelog

What's Changed

  • chore: bump parquet-testing submodule by @zeroshade in #633
  • fix(arrow/array): handle empty binary values correctly in BinaryBuilder by @zeroshade in #634
  • test(arrow/array): add test to binary builder by @zeroshade in #636
  • fix(parquet): decryption of V2 data pages by @daniel-adam-tfs in #596
  • perf(arrow): Reduce the amount of allocated objects by @spiridonov in #645
  • fix(parquet/file): regression with decompressing data by @zeroshade in #652
  • fix(arrow/compute): take on record/array with nested struct by @zeroshade in #653
  • fix(parquet/file): write large string values by @zeroshade in #655
  • ci: ensure extra GC cycle for flaky tests by @zeroshade in #661
  • fix(arrow/array): handle exponent notation for unmarshal int by @zeroshade in #662
  • fix(flight/flightsql/driver): fix time.Time params by @etodd in #666
  • fix(parquet): bss encoding and tests on big endian systems by @daniel-adam-tfs in #663
  • fix(parquet/pqarrow): selective column reading of complex map column by @zeroshade in #668
  • feat(arrow/ipc): support custom_metadata on RecordBatch messages by @rustyconover in #669
  • feat: Support setting IPC options in FlightSQL call options by @peasee in #674
  • chore(dev/release): embed hash of source tarball into email by @zeroshade in #675
  • chore(arrow): bump PkgVersion to 18.5.2 by @zeroshade in #676

New Contributors

  • @spiridonov made their first contribution in #645
  • @etodd made their first contribution in #666
  • @rustyconover made their first contribution in #669
  • @peasee made their first contribution in #674

Full Changelog: https://github.com/apache/arrow-go/compare/v18.5.1...v18.5.2

]]>
pmc
Apache Arrow nanoarrow 0.8.0 Release2026-02-24T00:00:00-05:002026-02-24T00:00:00-05:00https://arrow.apache.org/blog/2026/02/24/nanoarrow-0.8.0-release

The Apache Arrow team is pleased to announce the 0.8.0 release of Apache Arrow nanoarrow. This release consists of 28 resolved GitHub issues from 10 contributors.

Release Highlights

  • Support for building String View arrays by buffer
  • LZ4 decompression support in IPC reader
  • Support for Conan
  • Support for Hombrew

See the Changelog for a detailed list of contributions to this release.

Features

String Views By Buffer

The C library in general supports two methods for producing or consuming arrays: most users use the builder pattern (e.g., ArrowArrayAppendString()); however, the "build by buffer" pattern can be effective when using nanoarrow with a higher level runtime like C++, Rust, Python, or R, all of which have mechanisms to build buffers already. The C library supports this with ArrowArraySetBuffer(); however, there was no way to reserve and/or set variadic buffers for string view arrays. In nanoarrow 0.8.0, the array builder API fully supports both mechanisms for building string view arrays.

LZ4 Decompression Support

The Arrow IPC reader included in the nanoarrow C library supports most features of the Arrow IPC format; however, decompression support for the LZ4 codec was missing which made the library and its bindings unusable for some common use cases. In 0.8.0, decompression for the LZ4 codec was added to the C library.

Users of the C library will need to configure CMake with -DNANOARROW_IPC_WITH_LZ4=ON and -DNANOARROW_IPC=ON to use CMake-resolved LZ4; however, client libraries can also use an existing ZSTD or LZ4 implementation using callbacks just like in 0.7.0.

nanoarrow on Conan

The nanoarrow C library can now be installed using the Conan C/C++ Package Manager! CMake projects can now use find_package(nanoarrow) when using a Conan-enabled toolchain after adding the nanoarrow dependency to conanfile.txt.

Thanks to @wgtmac for contributing the recipe!

nanoarrow on Homebrew

The nanoarrow C library can now be installed using Homebrew!

brew install nanoarrow

CMake projects can then use find_package(nanoarrow) when using Homebrew-provided cmake and allows other vcpkg ports to use nanoarrow as a dependency.

Thanks to @ankane for contributing the formula!

Contributors

This release consists of contributions from 12 contributors in addition to the invaluable advice and support of the Apache Arrow community.

$ git shortlog -sn apache-arrow-nanoarrow-0.8.0.dev..apache-arrow-nanoarrow-0.8.0-rc0
    23  Dewey Dunnington
     2  Bryce Mecum
     2  Dirk Eddelbuettel
     1  Even Rouault
     1  Kevin Liu
     1  Michael Chirico
     1  Namit Kewat
     1  Nyall Dawson
     1  Sutou Kouhei
     1  William Ayd
]]>
pmc
Apache Arrow 23.0.1 Release2026-02-16T00:00:00-05:002026-02-16T00:00:00-05:00https://arrow.apache.org/blog/2026/02/16/23.0.1-release

The Apache Arrow team is pleased to announce the 23.0.1 release. It includes a security fix for the C++ IPC file reader, so be sure to read the relevant details below to see if you are affected.

Apart from that, 23.0.1 is mostly a bugfix release that includes 28 resolved issues on 29 distinct commits from 12 distinct contributors.

See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

C++ notes

  • Fix possible OOB write in buffered IO (GH-48311).

IPC

CVE-2026-25087: Use After Free vulnerability in IPC file reader

Fix a security issue can be triggered when reading an Arrow IPC file (but not an IPC stream) with pre-buffering enabled, if the IPC file contains data with variadic buffers (such as Binary View and String View data).

Pre-buffering is disabled by default, so your code is vulnerable only if it enables it explicitly by calling RecordBatchFileReader::PreBufferMetadata. Affected Arrow C++ versions are 15.0.0 through 23.0.0. The fix integrated in 23.0.1 can also be separately viewed at GH-48925.

See our separate announcement for further detail.

Other fixes

  • Avoid memory blowup with excessive variadic buffer count in IPC (GH-48900).

Gandiva

  • Fix passing CPU attributes to LLVM (GH-48160).
  • Detect overflow in repeat() (GH-49159).

Parquet

  • Avoid re-serializing footer for signature verification (GH-48858).

Python notes

  • Added missing NOTICE.txt and LICENSE.txt to wheels (GH-48983).
  • Some fixes for compatibility with newer Cython versions like (GH-48965), (GH-49156) and (GH-49138).

Ruby notes

  • Fix a bug where Arrow::ExecutePlan nodes may be Garbage Collected (GH-48880).

R notes

  • Bump C++20 for R build infrastructure (GH-48817) and fix some C++ 20 related compilation issues (GH-48973).

Other modules and languages

No general changes were made to the other libraries or languages.

]]>
pmc
Apache Arrow is 10 years old 🎉2026-02-12T00:00:00-05:002026-02-12T00:00:00-05:00https://arrow.apache.org/blog/2026/02/12/arrow-anniversary

The Apache Arrow project was officially established and had its first git commit on February 5th 2016, and we are therefore enthusiastic to announce its 10-year anniversary!

Looking back over these 10 years, the project has developed in many unforeseen ways and we believe to have delivered on our objective of providing agnostic, efficient, durable standards for the exchange of columnar data.

How it started

From the start, Arrow has been a joint effort between practitioners of various horizons looking to build common grounds to efficiently exchange columnar data between different libraries and systems. In this blog post, Julien Le Dem recalls how some of the founders of the Apache Parquet project participated in the early days of the Arrow design phase. The idea of Arrow as an in-memory format was meant to address the other half of the interoperability problem, the natural complement to Parquet as a persistent storage format.

Apache Arrow 0.1.0

The first Arrow release, numbered 0.1.0, was tagged on October 7th 2016. It already featured the main data types that are still the bread-and-butter of most Arrow datasets, as evidenced in this Flatbuffers declaration:


/// ----------------------------------------------------------------------
/// Top-level Type value, enabling extensible type-specific metadata. We can
/// add new logical types to Type without breaking backwards compatibility

union Type {
  Null,
  Int,
  FloatingPoint,
  Binary,
  Utf8,
  Bool,
  Decimal,
  Date,
  Time,
  Timestamp,
  Interval,
  List,
  Struct_,
  Union
}

The release announcement made the bold claim that "the metadata and physical data representation should be fairly stable as we have spent time finalizing the details". Does that promise hold? The short answer is: yes, almost! But let us analyse that in a bit more detail:

  • the Columnar format, for the most part, has only seen additions of new datatypes since 2016. One single breaking change occurred: Union types cannot have a top-level validity bitmap anymore.

  • the IPC format has seen several minor evolutions of its framing and metadata format; these evolutions are encoded in the MetadataVersion field which ensures that new readers can read data produced by old writers. The single breaking change is related to the same Union validity change mentioned above.

First cross-language integration tests

Arrow 0.1.0 had two implementations: C++ and Java, with bindings of the former to Python. There were also no integration tests to speak of, that is, no automated assessment that the two implementations were in sync (what could go wrong?).

Integration tests had to wait for November 2016 to be designed, and the first automated CI run probably occurred in December of the same year. Its results cannot be fetched anymore, so we can only assume the tests passed successfully. 🙂

From that moment, integration tests have grown to follow additions to the Arrow format, while ensuring that older data can still be read successfully. For example, the integration tests that are routinely checked against multiple implementations of Arrow have data files generated in 2019 by Arrow 0.14.1.

No breaking changes... almost

As mentioned above, at some point the Union type lost its top-level validity bitmap, breaking compatibility for the workloads that made use of this feature.

This change was proposed back in June 2020 and enacted shortly thereafter. It elicited no controversy and doesn't seem to have caused any significant discontent among users, signaling that the feature was probably not widely used (if at all).

Since then, there has been precisely zero breaking change in the Arrow Columnar and IPC formats.

Apache Arrow 1.0.0

We have been extremely cautious with version numbering and waited until July 2020 before finally switching away from 0.x version numbers. This was signalling to the world that Arrow had reached its "adult phase" of making formal compatibility promises, and that the Arrow formats were ready for wide consumption amongst the data ecosystem.

Apache Arrow, today

Describing the breadth of the Arrow ecosystem today would take a full-fledged article of its own, or perhaps even multiple Wikipedia pages. Our "powered by" page can give a small taste.

As for the Arrow project, we will merely refer you to our official documentation:

  1. The various specifications that cater to multiple aspects of sharing Arrow data, such as in-process zero-copy sharing between producers and consumers that know nothing about each other, or executing database queries that efficiently return their results in the Arrow format.

  2. The implementation status page that lists the implementations developed officially under the Apache Arrow umbrella (native software libraries for C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust). But keep in mind that multiple third-party implementations exist in non-Apache projects, either open source or proprietary.

However, that is only a small part of the landscape. The Arrow project hosts several official subprojects, such as ADBC and nanoarrow. A notable success story is Apache DataFusion, which began as an Arrow subproject and later graduated to become an independent top-level project in the Apache Software Foundation, reflecting the maturity and impact of the technology.

Beyond these subprojects, many third-party efforts have adopted the Arrow formats for efficient interoperability. GeoArrow is an impressive example of how building on top of existing Arrow formats and implementations can enable groundbreaking efficiency improvements in a very non-trivial problem space.

It should also be noted that Arrow, as an in-memory columnar format, is often used hand in hand with Parquet for persistent storage; as a matter of fact, most official Parquet implementations are nowadays being developed within Arrow repositories (C++, Rust, Go).

Tomorrow

The Apache Arrow community is primarily driven by consensus, and the project does not have a formal roadmap. We will continue to welcome everyone who wishes to participate constructively. While the specifications are stable, they still welcome additions to cater for new use cases, as they have done in the past.

The Arrow implementations are actively maintained, gaining new features, bug fixes, and performance improvements. We encourage people to contribute to their implementation of choice, and to engage with us and the community.

Now and going forward, a large amount of Arrow-related progress is happening in the broader ecosystem of third-party tools and libraries. It is no longer possible for us to keep track of all the work being done in those areas, but we are proud to see that they are building on the same stable foundations that have been laid 10 years ago.

]]>
pmc
Introducing a Security Model for Arrow2026-02-09T00:00:00-05:002026-02-09T00:00:00-05:00https://arrow.apache.org/blog/2026/02/09/arrow-security-model

We are thrilled to announce the official publication of a Security Model for Apache Arrow.

The Arrow security model covers a core subset of the Arrow specifications: the Arrow Columnar Format, the Arrow C Data Interface and the Arrow IPC Format. It sets expectations and gives guidelines for handling data coming from untrusted sources.

The specifications covered by the Arrow security model are building blocks for all the other Arrow specifications, such as Flight and ADBC.

The ideas underlying the Arrow security model were informally shared between Arrow maintainers and have informed decisions for years, but they were left undocumented until now.

Implementation-specific security considerations, such as proper API usage and runtime safety guarantees, will later be covered in the documentation of the respective implementations.

]]>
pmc
Apache Arrow Go 18.5.1 Release2026-01-26T00:00:00-05:002026-01-26T00:00:00-05:00https://arrow.apache.org/blog/2026/01/26/arrow-go-18.5.1

The Apache Arrow team is pleased to announce the v18.5.1 release of Apache Arrow Go. This patch release covers 10 commits from 6 distinct contributors.

Contributors

$ git shortlog -sn v18.5.0..v18.5.1
     6	Matt Topol
     1	Alfonso Subiotto Marqués
     1	Arnold Wakim
     1	Bryce Mecum
     1	Rok Mihevc
     1	cai.zhang

Changelog

What's Changed

  • fix(internal): fix assertion on undefined behavior by @amoeba in #602
  • ci(benchmark): switch to new conbench instance by @rok in #593
  • fix(flight): make StreamChunksFromReader ctx aware and cancellation-safe by @arnoldwakim in #615
  • fix(parquet/variant): fix basic stringify by @zeroshade in #624
  • fix(parquet/pqarrow): fix partial struct panic by @zeroshade in #630
  • Flaky test fixes by @zeroshade in #629
  • ipc: clear variadicCounts in recordEncoder.reset() by @asubiotto in #631
  • fix(arrow/cdata): Handle errors to prevent panic by @xiaocai2333 in #614

New Contributors

  • @rok made their first contribution in #593
  • @asubiotto made their first contribution in #631
  • @xiaocai2333 made their first contribution in #614

Full Changelog: https://github.com/apache/arrow-go/compare/v18.5.0...v18.5.1

]]>
pmc
Apache Arrow 23.0.0 Release2026-01-18T00:00:00-05:002026-01-18T00:00:00-05:00https://arrow.apache.org/blog/2026/01/18/23.0.0-release

The Apache Arrow team is pleased to announce the 23.0.0 release. This release covers over 3 months of development work and includes 336 resolved issues on 417 distinct commits from 71 distinct contributors. See the Install Page to learn how to get the libraries for your platform.

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Community

As per our newly started tradition of rotating the PMC chair once a year Antoine Pitrou was elected as the new PMC chair and VP, succeeding Neal Richardson.

Thanks for your contributions and participation in the project!

Arrow Flight RPC Notes

An ODBC driver for Apache Arrow Flight SQL has been completed. Currently it is not packaged for release, but can be built from source.

C++ Notes

The C++ standard has been updated to C++ 20 GH-45885 and the minimum GCC to 8.

Some improvements to leverage C++ 20 GH-48592,

Compute

  • Graceful error handling for decimal binary arithmetic and comparison instead of firing confusing assertions. GH-35957
  • Fixed an issue where the MinMax kernel was emitting -inf/inf for all-NaN input. GH-46063
  • Avoid ZeroCopyCastExec when casting between Binary offset types to avoid high overheads. GH-43660
  • Enhanced type checking for hash join residual filter in Acero. GH-48268

Format

  • Clarified that empty compressed buffers can omit the length header. GH-47918

Parquet

  • A new setting to limit the number of rows written per page has been added. GH-47030
  • A arrow::Result version of parquet::arrow::FileReader::Make() has been added. GH-44810
  • Support for reading INT-encoded Decimal statistics as Arrow scalars. GH-47955

Several bug fixes including:

  • Fixed invalid Parquet files written when dictionary encoded pages are large. GH-47973
  • Fixed pre-1970 INT96 timestamps roundtrip. GH-48246
  • Fixed potential crash when reading invalid Parquet data. GH-48308
  • Added compatibility with non-compliant RLE streams. GH-47981
  • Fixed Util & Level Conversion logic on big-endian systems. GH-48218

Encryption

  • Simplified nested field encryption configuration. GH-41246
  • Improved column encryption API. GH-48337
  • Better fuzzing support for encrypted files. GH-48335

Miscellaneous C++ changes

Linux Packaging Notes

Fixed a bug that the parquet-devel RPM package depends on parquet-glib-devel.

See also: GH-48044

CentOS 7 support has been dropped.

See also: GH-40735

MATLAB Notes

Added support for building against MATLAB R2025b GH-48154.

Python Notes

Compatibility notes

  • Deprecated Array.format is removed GH-48102.
  • Experimental tag has been removed for Arrow PyCapsule Interface GH-47975.
  • PyWeakref_GetRef has replaced the use of PyWeakref_GET_OBJECT to support Python 3.15 GH-47823.

New features

  • Bindings for scatter and inverse_permutationare added GH-48167.
  • max_rows_per_page argument is now exposed in parquet.WriterProperties GH-48096.
  • External key material and rotation is enabled for individual Parquet files GH-31869.

Other improvements

  • Nested field encryption configuration has been simplified GH-41246.
  • Reading INT-encoded Decimal statistics with StatisticsAsScalars is now supported GH-47955.
  • Unsigned dictionary indices are now supported in pandas conversion GH-47022.
  • Added code examples for compute functions min, max and min_max GH-48668.
  • Add temporal unit checking in NumPyDtypeUnifier GH-48625
  • Error message is improved when mixing numpy.datetime64 values with different units (e.g., datetime64[s] and datetime64[ms]) in a single array GH-48463.
  • The source argument is now checked in pyarrow.parquet.read_table GH-47728.

Relevant bug fixes

  • ipc.Message __repr__ has been corrected to use f-string GH-48608.
  • Failures when reading parquet files written with non-compliant RLE encoders have been fixed in C++ with adding compatibility GH-47981.
  • Memory usage is now reduced when using to_pandas() with many extension arrays columns GH-47861.
  • Missing required argument error in FSSpecHandler delete_root_dir_contents has been fixed GH-47559.
  • Invalid RecordBatch.from_struct_array batch for sliced arrays with offset zero has been fixed in the C++ GH-44318.

R Notes

Compatibility notes

  • GCS have been turned off by default GH-48342.
  • OpenSSL 1.x builds have been removed GH-45449

Relevant bug fixes

  • Fixed a segfault that could be raised when concatenatig tables GH-47000.

Several Continuous integration fixes and minor bugs have also been added to the release for a full list check the release notes.

Ruby and C GLib Notes

All missing compute function options have been added. So we can use all compute functions from Ruby and C GLib. This is done by Sten Larsson.

Fixed size list array support has been added.

See also: GH-48362

Changing thread pool configuration support in Acero has been added. This is done by Sten Larsson.

Duration support has been added.

CSV writer support has been added.

See also: GH-48680

Ruby

Experimental Pure Ruby Apache Arrow reader implementation has been added as red-arrow-format gem.

See also: GH-48132

We'll add experimental writer implementation in the next release.

Arrow::Column#to_arrow{,_array,_chunked_array} have been added. They are for convenient.

See also: GH-48292

Auto Apache Arrow type detection in Arrow::Array.new has been improved for nested integer list case.

See also:

Arrow::FixedSizeListArray.new(data_type, values) support has been added.

See also: GH-48610

C GLib

We use Arrow-${MAJOR}.${MINOR}.{gir,typelib} not Arrow-1.0.{gir,typelib} for .gir and .typelib file names. It's for co-existent multiple C GLib versions in the same system.

See also: GH-48616

Java, JavaScript, Go, .NET, Swift and Rust Notes

The Java, JavaScript, Go, .NET, Swift and Rust projects have moved to separate repositories outside the main Arrow monorepo.

]]>
pmc
Apache Arrow ADBC 22 (Libraries) Release2026-01-09T00:00:00-05:002026-01-09T00:00:00-05:00https://arrow.apache.org/blog/2026/01/09/adbc-22-release

The Apache Arrow team is pleased to announce the version 22 release of the Apache Arrow ADBC libraries. This release includes 14 resolved issues from 16 distinct contributors.

This is a release of the libraries, which are at version 22. The API specification is versioned separately and is at version 1.1.0.

The subcomponents are versioned independently:

  • C/C++/GLib/Go/Python/Ruby: 1.10.0
  • C#: 0.22.0
  • Java: 0.22.0
  • R: 0.22.0
  • Rust: 0.22.0

The release notes below are not exhaustive and only expose selected highlights of the release. Many other bugfixes and improvements have been made: we refer you to the complete changelog.

Release Highlights

In the C++ driver manager and packages using it (such as the driver managers for Go and Python), now it is possible to open a connection with only a URI (the driver name will be assumed to be the URI scheme) (#3694, #3790).

For Windows users, C++ builds now generate import LIBs so DLLs can be properly linked to (#2858).

The C# Databricks driver has reduced memory usage (#3654, #3683). Also, token exchange was fixed (#3715), and a deadlock was fixed (#3756).

The DataFusion driver allows setting the async runtime (#3712).

The Arrow Flight SQL driver supports bulk ingestion with compatible servers (#3808). The Python package also exposes constants for OAuth (#3849).

The Go database/sql adapter now closes resources properly (#3731). All drivers built with Go were updated to a newer Go and arrow-go to resolve CVEs reported against Go itself (although we believe they do not affect the drivers) and to resolve a crash when used with Polars (#3758).

The PostgreSQL driver supports setting the transaction isolation level (#3760), and a bug with the GetObjects filter options was fixed (#3855).

The Python driver manager will now close unclosed cursors for you when a connection is closed (#3810).

Driver shared libraries built with Rust will now catch panics and error gracefully instead of terminating the entire process (although we do not currently distribute any driver libraries built this way) (#3819). Also, the Rust driver manager will no longer dlclose driver libraries as drivers built with Go would hang if this was done on some platforms (#3844).

The SQLite driver supports more info keys (#3843).

Contributors

$ git shortlog --perl-regexp --author='^((?!dependabot\[bot\]).*)$' -sn apache-arrow-adbc-21..apache-arrow-adbc-22
    19	David Li
     7	eric-wang-1990
     6	eitsupi
     5	Mandukhai Alimaa
     4	Bryce Mecum
     3	davidhcoe
     2	Matt Topol
     2	Pavel Agafonov
     2	Philip Moore
     1	Ali Alamiri
     1	Curt Hagenlocher
     1	Hélder Gregório
     1	Matt Corley
     1	Pranav Joglekar
     1	Sudhir Reddy Emmadi
     1	msrathore-db

Roadmap

We are working on the next revision of the API standard. After some discussion, we will likely put aside async APIs to focus on addressing other API gaps that users have reported. However, we may still define langauge-specific async APIs for ecosystems that expect them (like Rust).

Getting Involved

We welcome questions and contributions from all interested. Issues can be filed on GitHub, and questions can be directed to GitHub or the Arrow mailing lists.

]]>
pmc
Apache Arrow Go 18.5.0 Release2025-12-12T00:00:00-05:002025-12-12T00:00:00-05:00https://arrow.apache.org/blog/2025/12/12/arrow-go-18.5.0

The Apache Arrow team is pleased to announce the v18.5.0 release of Apache Arrow Go. This minor release covers 38 commits from 17 distinct contributors.

Contributors

$ git shortlog -sn v18.4.1..v18.5.0
    11	Matt Topol
     5	Alex
     5	pixelherodev
     2	Mandukhai Alimaa
     2	Rick  Morgans
     2	Sutou Kouhei
     1	Ahmed Mezghani
     1	Bryce Mecum
     1	Dhruvit Maniya
     1	Erez Rokah
     1	Jacob Romero
     1	James Guthrie
     1	Orson Peters
     1	Pierre Lacave
     1	Ruihao Chen
     1	Travis Patterson
     1	andyfan

Highlights

  • Bumped substrait-go dependency to v7 #526

Arrow

  • Support customizing ValueStr output for Timestamp types #510
  • Series of fixes for the cdata integration including fixing some memory leaks #513 #603
  • Optimizations for comparisons and Take kernels #574 #556 #563 #573 #557
  • New temporal rounding methods added to compute #572
  • Fix concatenating out of order REE slices #587

Parquet

  • Fix nullable elements for FixedSizeList values #585
  • Add a build tag pqarrow_read_only for optimized builds that don't need to write #569
  • Better tracking of Total bytes and Total Compressed bytes #548
  • Dramatic performance gains for writing Parquet files (~70%-90% reduced memory, up to 10x faster in some cases) #595

Changelog

What's Changed

  • fix(parquet/pqarrow): Fix null_count column stats by @MasslessParticle in #489
  • chore: Use apache/arrow-dotnet for integration test by @kou in #495
  • feat(parquet): utilize memory allocator in serializedPageReader by @joechenrh in #485
  • chore: Automate GitHub Releases creation and site redirect by @kou in #497
  • use xnor for boolean equals function by @Dhruvit96 in #505
  • chore(ci): fix verify_rc finding latest go by @zeroshade in #512
  • feat: Add support for specifying Timestamp ValueStr output layout by @erezrokah in #510
  • fix(arrow/cdata): Avoid calling unsafe.Slice on zero-length pointers by @orlp in #513
  • fix(arrow/compute): fix scalar comparison panic by @zeroshade in #518
  • fix(arrow/array): fix panic in dictionary builders by @zeroshade in #517
  • fix(parquet/pqarrow): unsupported dictionary types in pqarrow by @zeroshade in #520
  • chore(parquet/metadata): use constant time compare for signature verify by @zeroshade in #528
  • build(deps): update substrait to v7 by @zeroshade in #526
  • fix(parquet): fix adaptive bloom filter duplicate hash counting, comparison logic, and GC safety by @Mandukhai-Alimaa in #527
  • refactor(arrow): last increment of the Record -> RecordBatch migration by @Mandukhai-Alimaa in #522
  • fix: update iceberg substrait URN by @zeroshade in #541
  • optimization: comparison: when DataType is static, skip reflection by @pixelherodev in #542
  • fix(parquet/pqarrow): decoding Parquet with Arrow dict in schema by @freakyzoidberg in #551
  • feat: support conversion of chunked arrays by @ahmed-mez in #553
  • format: regenerate internal/flatbuf from arrow repo and newer flatc by @pixelherodev in #555
  • Batch of small optimizations by @pixelherodev in #556
  • perf: optimize compute.Take for fewer memory allocations by @hamilton-earthscope in #557
  • optimization: compare: avoid initializing config when it's not needed by @pixelherodev in #563
  • optimization: schema: use slices.Sort instead of sort.Slice by @pixelherodev in #564
  • doc(parquet): document arrow parquet mappings by @amoeba in #561
  • fix: Metadata.Equal comparison with keys in different order by @zeroshade in #571
  • perf(compute): optimize Take kernel for list types by @hamilton-earthscope in #573
  • build: add pqarrow_read_only build tags to avoid building write related code by @jacobromero in #569
  • [Go] [Parquet] pqarrow file-writer & row-group-writer tracking total & total compressed bytes by @DuanWeiFan in #548
  • [Go][Parquet] Fix FixedSizeList nullable elements read as NULL by @rmorgans in #585
  • [Go][Parquet] Refactor: extract visitListLike helper for list-like types by @rmorgans in #586
  • feat(compute): Take kernel for Map type by @hamilton-earthscope in #574
  • fix: correctly initialize SchemaField.ColIndex by @JamesGuthrie in #591
  • fix(arrow/array): fix concat for out of order REE slices by @zeroshade in #587
  • new(arrow/compute): temporal rounding methods by @hamilton-earthscope in #572
  • chore(arrow): Bump package version to 18.5.0 by @zeroshade in #594
  • perf(parquet): minor tweaks for iceberg write improvement by @hamilton-earthscope in #595
  • fix(arrow/cdata): fix leaks identified by leak-sanitizer by @zeroshade in #603

New Contributors

  • @Dhruvit96 made their first contribution in #505
  • @erezrokah made their first contribution in #510
  • @orlp made their first contribution in #513
  • @pixelherodev made their first contribution in #542
  • @freakyzoidberg made their first contribution in #551
  • @ahmed-mez made their first contribution in #553
  • @hamilton-earthscope made their first contribution in #557
  • @jacobromero made their first contribution in #569
  • @DuanWeiFan made their first contribution in #548
  • @rmorgans made their first contribution in #585
  • @JamesGuthrie made their first contribution in #591

Full Changelog: https://github.com/apache/arrow-go/compare/v18.4.1...v18.5.0

]]>
pmc
Arrow-rs Parquet 读取中的延迟物化(Late Materialization)实战深度解析2025-12-11T00:00:00-05:002025-12-11T00:00:00-05:00https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive-zh

本文深入探讨了在 arrow-rs(为 Apache DataFusion 等项目提供动力的读取器)的 Apache Parquet 读取器中实现延迟物化(Late Materialization)的决策和陷阱。我们将看到一个看似简单的文件读取器如何通过复杂的逻辑来评估谓词——实际上它自身变成了一个微型查询引擎

1. 为什么要延迟物化?

列式读取是 I/O 带宽CPU 解码成本 之间的一场持久战。虽然跳过数据通常是好事,但跳过本身也有计算成本。arrow-rs 中 Parquet 读取器的目标是流水线式的延迟物化:首先评估谓词,然后访问投影列。对于过滤掉许多行的谓词,在评估之后再进行物化可以最大限度地减少读取和解码工作。

这种方法与 Abadi 等人的论文 列式 DBMS 中的物化策略 中的 LM-pipelined 策略非常相似:交错进行谓词评估和数据列访问,而不是一次性读取所有列并试图将它们重新拼接成行。

LM-pipelined late materialization pipeline

为了使用延迟物化评估像 SELECT B, C FROM table WHERE A > 10 AND B < 5 这样的查询,读取器遵循以下步骤:

  1. 读取列 A 并评估 A > 10 以构建一个 RowSelection(一个稀疏掩码),代表最初幸存的行集。
  2. 使用该 RowSelection 读取列 B 中幸存的值,并评估 B < 5,更新 RowSelection 使其更加稀疏。
  3. 使用细化后的 RowSelection 读取列 C(投影列),仅解码最终幸存的行。

本文的其余部分将详细介绍代码如何实现这一路径。


2. Rust Parquet 读取器中的延迟物化

2.1 LM-pipelined(流水线延迟物化)

“LM-pipelined”听起来像是教科书里的术语。在 arrow-rs 中,它简单地指一个按顺序运行的流水线:“读取谓词列 → 生成行选择 → 读取数据列”。这与并行策略形成对比,后者同时读取所有谓词列。虽然并行可以最大化多核 CPU 的利用率,但在列式存储中,流水线方法通常更优,因为每个过滤步骤都大幅减少了后续步骤需要读取和解析的数据量。

代码结构分为几个核心角色:

RowSelection 可以在 RLE 和位掩码之间动态切换。当间隙很小且稀疏度很高时,位掩码更快;RLE 则对大范围的页级跳过更友好。关于这种权衡的细节将在 3.1 节中介绍。

再次考虑查询:SELECT B, C FROM table WHERE A > 10 AND B < 5

  1. 初始selection = None(相当于“全选”)。
  2. 读取 AArrayReader 分批解码列 A;谓词构建一个布尔掩码;RowSelection::from_filters 将其转换为稀疏选择。
  3. 收紧ReadPlanBuilder::with_predicate 通过 RowSelection::and_then 链接新的掩码。
  4. 读取 B:使用当前的 selection 构建列 B 的读取器;读取器仅对选定的行执行 I/O 和解码,产生一个更稀疏的掩码。
  5. 合并selection = selection.and_then(selection_b);投影列现在只解码极小的行集。

代码位置和草图

// Close to the flow in read_plan.rs (simplified)
let mut builder = ReadPlanBuilder::new(batch_size);

// 1) Inject external pruning (e.g., Page Index):
builder = builder.with_selection(page_index_selection);

// 2) Append predicates serially:
for predicate in predicates {
    builder = builder.with_predicate(predicate); // internally uses RowSelection::and_then
}

// 3) Build readers; all ArrayReaders share the final selection strategy
let plan = builder.build();
let reader = ParquetRecordBatchReader::new(array_reader, plan);

我画了一个简单的流程图来说明这个流程,帮助你理解:

Predicate-first pipeline flow

现在你已经了解了这个流水线是如何工作的,下一个问题是如何表示和组合这些稀疏选择(图中的 Row Mask),这就是 RowSelection 发挥作用的地方。

2.2 组合行选择器 (RowSelection::and_then)

RowSelection 代表了最终将生成的行集。它目前使用 RLE (RowSelector::select/skip(len)) 来描述稀疏范围。RowSelection::and_then 是“将一个选择应用于另一个”的核心操作:左侧参数是“已经通过的行”,右侧参数是“在通过的行中,哪些也通过了第二个过滤器”。输出是它们的布尔 AND。

演练示例

  • 输入选择 A(已过滤)[Skip 100, Select 50, Skip 50](物理行 100-150 被选中)
  • 选择 B(在 A 内部过滤)[Select 10, Skip 40](在选中的 50 行中,只有前 10 行通过 B)
  • 结果[Skip 100, Select 10, Skip 90]

运行过程: 想象一下它就像拉拉链:我们同时遍历两个列表,如下所示:

  1. 前 100 行:A 是 Skip → 结果是 Skip 100。
  2. 接下来的 50 行:A 是 Select。看 B:
    • B 的前 10 个是 Select → 结果 Select 10。
    • B 的剩余 40 个是 Skip → 结果 Skip 40。
  3. 最后 50 行:A 是 Skip → 结果 Skip 50。

结果[Skip 100, Select 10, Skip 90]

下面是代码示例:

// Example: Skip 100 rows, then take the next 10
let a: RowSelection = vec![RowSelector::skip(100), RowSelector::select(50)].into();
let b: RowSelection = vec![RowSelector::select(10), RowSelector::skip(40)].into();
let result = a.and_then(&b);
// Result should be: Skip 100, Select 10, Skip 40
assert_eq!(
    Vec::<RowSelector>::from(result),
    vec![RowSelector::skip(100), RowSelector::select(10), RowSelector::skip(40)]
);
RowSelection logical AND walkthrough

这不断缩小过滤范围,同时只触及轻量级的元数据——没有数据拷贝。目前的 and_then 实现是一个双指针线性扫描;复杂度与选择器段数呈线性关系。谓词收缩选择的越多,后续的扫描就越便宜。

3. 工程挑战

延迟物化在理论上听起来很简单,但在像 arrow-rs 这样的生产级系统中实现它绝对是一场工程噩梦。历史上,这些技术非常棘手,一直被锁定在专有引擎中。在开源世界中,我们已经为此打磨了多年(看看 DataFusion 的这个 ticket 就知道了),终于,我们可以大展拳脚,与全物化一较高下。为了实现这一点,我们需要解决几个严重的工程挑战。

3.1 自适应 RowSelection 策略(位掩码 vs. RLE)

一个主要的障碍是为 RowSelection 选择正确的内部表示,因为最佳选择取决于稀疏模式。这篇论文 揭示了一个关键障碍:对于 RowSelection 来说,不存在“一刀切”的格式。研究人员发现,最佳的内部表示是一个移动的目标,随着数据的“密集”或“稀疏”程度——不断变化。

  • 极度稀疏(例如,每 10,000 行 1 行):这里使用位掩码很浪费(每行 1 位加起来也不少),而 RLE 非常干净——只需几个选择器就搞定了。
  • 稀疏但有微小间隙(例如,“读 1,跳 1”):RLE 会产生碎片化的混乱,让解码器超负荷工作;这里位掩码效率高得多。

由于两者各有优缺点,我们决定采用自适应策略来兼得两者之长(详情见 #arrow-rs/8733):

  • 我们查看选择器的平均游程长度,并将其与阈值(目前为 32)进行比较。如果平均值太小,我们切换到位掩码;否则,我们坚持使用选择器(RLE)。
  • 安全网:位掩码看起来很棒,直到遇到页修剪(Page Pruning),这可能会导致糟糕的“页丢失”恐慌(panic),因为掩码可能会盲目地试图过滤从未读取过的页中的行。RowSelection 逻辑会提防这种灾难配方,并强制切回 RLE 以防止崩溃(见 3.1.2)。

3.1.1 32 这个阈值是怎么来的?

数字 32 并不是凭空捏造的。它来自于使用各种分布(均匀间隔、指数稀疏、随机噪声)进行的 数据驱动的“对决”。它在区分“破碎但密集”和“长跳跃区域”方面做得很好。未来,我们可能会基于数据类型采用更复杂的启发式方法。

下图展示了对决中的一个示例运行。蓝线是 read_selector (RLE),橙线是 read_mask (位掩码)。纵轴是时间(越低越好),横轴是平均游程长度。你可以看到性能曲线在 32 附近交叉。

Bitmask vs RLE benchmark threshold

3.1.2 位掩码陷阱:丢失的页

在实现自适应策略时,位掩码在纸面上看起来很完美,但在结合 页修剪(Page Pruning) 时隐藏着一个讨厌的陷阱。

在深入细节之前,先快速回顾一下页(更多内容见 3.2 节):Parquet 文件被切分成页(Page)。如果我们知道一个页在选择中没有行,我们根本不会触碰它——不解压,不解码。ArrayReader 甚至不知道它的存在。

案发现场:

想象一下读取一块数据[0,1,2,3,4,5,6],中间的四行 [1,2,3,4]被过滤掉了。碰巧其中两行 [2,3] 位于它们自己的页中,因此该页被完全修剪掉了。

Page pruning example with only first and last rows kept

如果我们要使用 RLE (RowSelector),执行 Skip(4) 是一帆风顺的:我们只是跳过间隙。

RLE skipping pruned pages safely

问题:

然而,如果我们使用位掩码,读取器将首先解码所有 6 行,打算稍后过滤它们。但是中间的页不存在!一旦解码器遇到那个间隙,它就会恐慌(panic)。ArrayReader 是一个流处理单元——它不处理 I/O,因此不知道上层决定修剪页,所以它看不到前面的悬崖。

Bitmask hitting a missing page panic

修复:

我们目前的解决方案既保守又稳健:如果我们检测到页修剪,我们就禁用位掩码并强制回退到 RLE。 在未来,我们希望扩展位掩码逻辑以使其感知页修剪(见 #arrow-rs/8845)。

// Auto prefers bitmask, but... wait, offset_index says page pruning is on.
let policy = RowSelectionPolicy::Auto { threshold: 32 };
let plan_builder = ReadPlanBuilder::new(1024).with_row_selection_policy(policy);
let plan_builder = override_selector_strategy_if_needed(
    plan_builder,
    &projection_mask,
    Some(offset_index), // page index enables page pruning
);
// ...so we play it safe and switch to Selectors (RLE).
assert_eq!(plan_builder.row_selection_policy(), &RowSelectionPolicy::Selectors);

3.2 页修剪(Page Pruning)

终极的性能胜利是根本不进行 I/O 或解码。但是在现实世界中(特别是对象存储),发出一百万个微小的读取请求是性能杀手arrow-rs 使用 Parquet PageIndex 来精确计算哪些页包含我们实际需要的数据。对于选择性极高的谓词,跳过页可以节省大量的 I/O,即使底层存储客户端合并了相邻的范围请求。另一个主要的胜利是减少了 CPU:我们完全跳过了对完全修剪页的解压和解码的繁重工作。

  • 注意点:如果 RowSelection 从一个页中哪怕只选择了一行,整个页也必须被解压。因此,这一步的效率很大程度上依赖于数据聚类和谓词之间的相关性。
  • 实现RowSelection::scan_ranges 使用每个页的元数据(first_row_indexcompressed_page_size)进行计算,找出哪些范围是完全跳过的,仅返回所需的 (offset, length) 列表。

下面的代码示例说明了页跳过:

// Example: two pages; page0 covers 0..100, page1 covers 100..200
let locations = vec![
    PageLocation { offset: 0, compressed_page_size: 10, first_row_index: 0 },
    PageLocation { offset: 10, compressed_page_size: 10, first_row_index: 100 },
];
// RowSelection wants 150..160; page0 is total junk, only read page1
let sel: RowSelection = vec![
    RowSelector::skip(150),
    RowSelector::select(10),
    RowSelector::skip(40),
].into();
let ranges = sel.scan_ranges(&locations);
assert_eq!(ranges.len(), 1); // Only request page1

下图说明了使用 RLE 选择进行的页跳过。第一页既不读取也不解码,因为没有行被选中。第二页被读取并完全解压(例如,zstd),然后只解码所需的行。第三页被完全解压和解码,因为所有行都被选中。

Page-level scan range calculation

这种机制充当了逻辑行过滤和物理字节获取之间的桥梁。虽然我们无法将文件切分得比单个页更细(由于压缩边界),但页修剪确保了我们永远不会为页支付解压成本,除非它至少为结果贡献了一行。它达成了一种务实的平衡:利用粗粒度的页索引(Page Index)跳过大片数据,同时留给细粒度的 RowSelection 来处理幸存页内的具体行。

3.3 智能缓存

延迟物化引入了一个结构性的进退两难(原文是Catch-22,第二十二条军规):为了有效地跳过数据,我们必须先读取它。考虑像 SELECT A FROM table WHERE A > 10 这样的查询。读取器必须解码列 A 来评估过滤器。在传统的“读取所有内容”的方法中,这不是问题:列 A 只需留在内存中等待投影。然而,在严格的流水线中,“谓词”阶段和“投影”阶段是解耦的。一旦过滤器生成了 RowSelection,投影阶段发现它需要列 A,就会触发对同一数据的第二次读取。

如果不加干预,我们会支付“双重税”:一次解码用于决定保留什么,再一次解码用于实际保留它。在 #arrow-rs/7850 中引入的 CachedArrayReader 使用双层缓存架构解决了这个难题。它允许我们在第一次看到解码批次时(在过滤期间)将其存储起来,并稍后(在投影期间)重用。

但是为什么要两层?为什么不直接用一个大缓存?

  • 共享缓存(乐观重用): 这是一个跨所有列和读取器共享的全局缓存。它有一个用户可配置的内存限制(容量)。当一个页因谓词被解码时,它被放置在这里。如果投影步骤紧接着运行,它可以“命中”这个缓存并避免 I/O。然而,因为内存是有限的,缓存驱逐随时可能发生。如果我们仅依赖于此,繁重的工作负载可能会在我们再次需要数据之前就将其驱逐。
  • 本地缓存(确定性保证): 这是一个特定于单列读取器的私有缓存。它充当安全网。当一个列正在被主动读取时,数据被“钉”(Pin)在本地缓存中。这保证了数据在当前操作期间仍然可用,不受全局共享缓存驱逐的影响。

读取器在获取页时遵循严格的层级结构:

  1. 检查本地: 我已经钉住它了吗?
  2. 检查共享: 流水线的另一部分最近解码过它吗?如果是,将其提升到本地(钉住它)。
  3. 从源读取: 执行 I/O 和解码,然后插入到本地和共享缓存中。

这种双重策略让我们兼得两者之长:在过滤和投影步骤之间共享数据的效率,以及知道必要数据不会因内存压力而在查询中途消失的稳定性

3.4 最小化拷贝和分配

arrow-rs 进行重大优化的另一个领域是避免不必要的拷贝。Rust 的 内存安全 设计使得拷贝变得容易,而每一次额外的分配和拷贝都会浪费 CPU 周期和内存带宽。一种幼稚的实现经常通过将数据解压到临时的 Vec 然后 memcpy 到 Arrow Buffer 而支付**“不必要的税”**。

对于定长类型(如整数或浮点数),这完全是多余的,因为它们的内存布局是相同的。PrimitiveArrayReader 通过 零拷贝转换 消除了这种开销:它不再拷贝字节,而是简单地将解码后的 Vec<T>所有权直接移交给底层的 Arrow Buffer

3.5 对齐挑战

链式过滤是坐标系中的一种令人抓狂的练习。过滤器 N 中的“第 1 行”实际上可能是文件中的“第 10,001 行”,这是由于之前的过滤器所致。

  • 我们如何保持正轨?:我们对每个 RowSelection 操作(split_off, and_then, trim)进行 模糊测试 (fuzz test)。我们需要绝对确定相对偏移量和绝对偏移量之间的转换是精准无误的。这种正确性是保持读取器在批次边界、稀疏选择和页修剪这三重威胁下保持稳定的基石。

4. 结论

arrow-rs 中的 Parquet 读取器不仅仅是一个简单的文件读取器——它是一个伪装的微型查询引擎。我们融入了诸如谓词下推和延迟物化等高端特性。读取器只读取需要的内容,只解码必要的内容,在节省资源的同时保持正确性。以前,这些功能仅限于专有或紧密集成的系统。现在,感谢社区的努力,arrow-rs 将高级查询处理技术的好处带给了即使是轻量级的应用程序。

我们邀请您 加入社区,探索代码,进行实验,并为其不断的演进做出贡献。优化数据访问的旅程永无止境,我们可以一起推动开源数据处理可能性的边界。

]]>
<a href="https://github.com/hhhizzz">Qiwei Huang</a> and <a href="https://github.com/alamb">Andrew Lamb</a>