feat/batch creation by d-v-b · Pull Request #2665 · zarr-developers/zarr-python

d-v-b · 2025-01-07T13:27:35Z

This PR adds a few routines for creating a collection of arrays and groups (i.e., a dict with path-like keys and ArrayMetadata / GroupMetadata values) in storage concurrently.

create_hierarchy takes a dict representation of a hierarchy, parses that dict to ensure that there are no implicit groups (creating group metadata documents as needed), then invokes create_nodes and yields the results
create_nodes concurrently writes metadata documents to storage, and yields the created AsyncArray / AsyncGroup instances.

I still need to wire up concurrency limits, and test them.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

…/batch-creation

…into feat/batch-creation

…at/batch-creation

…to feat/batch-creation

…at/batch-creation

…into feat/batch-creation

…at/batch-creation

d-v-b · 2025-01-10T14:40:15Z

this is now working, so I would appreciate some feedback on the design.

The basic design is the same as what I outlined earlier in this PR: there are two new functions that take a dict[path, GroupMetadata | ArrayMetadata] like {'a': GroupMetadata(zarr_format=3), 'a/b': ArrayMetadata(...)} and concurrently persist those metadata documents to storage, resulting in a hierarchy on disk that looks like the dict.

approach

basically the same as concurrent group members listing, except we don't need any recursion. I'm scheduling writes and using as_completed to yield Arrays / Groups when they are available.

new functions

create_nodes is low-level and doesn't do any checking of its input, so it will happily create invalid hierarchies, e.g. nesting groups inside arrays, or mixing v2 and v3 metadata, and it won't create intermediate groups, either.
create_hierarchy is higher level, it parses the input, checking it for invalid hierarchies, and inserting implicit groups as needed.
Group.create_hierarchy is a new method on the Group / AsyncGroup classes that takes a hierarchy dict and creates the nodes specified in that dict at locations relative to the path of the group instance. the return value is dict[str, AsyncGroup | AsyncArray], but I guess it also doesn't have tor return anything, or it could be an async iterator, so that you can interact with the nodes as they are formed. This is flexible right now, but I think the iterator idea is nice.
_from_flat (names welcome) is a new function that creates a group entirely from a hierarchy dict + a store. that dict must specify a root group, otherwise an exception is raised. We could revise this to create a root group if one is not specified. Open to suggestions here.

Implicit groups

Partial hierarchies like {'a': GroupMetadata(), 'a/b/c': ArrayMetadata(...)} implicitly denote a group at a/b. When creating such a hierarchy, if we find an existing group at a/b, then we don't need to create a new one. So in the context of modeling a hierarchy, implicit groups are a little special -- by not specifying the properties of the group, the user / application is tolerant of any group being there. So I introduced a subclass of GroupMetadata called _ImplicitGroupMetadata, which can be inserted into a hierarchy dict to explicitly denote groups that don't need to be written if one already exists. _ImplicitGroupMetadata is just like GroupMetadata except it errors if you try to set any parameter except zarr_format.

streaming v2 vs v3 node creation

creating v3 arrays / groups requires writing 1 metadata document, but v2 requires 2. To get the most concurrency I await the write of each metadata document separately, which means that foo/.zattrs might resolve before foo/.zarray. So in the v2 case I only yield up an array / group when both documents were written.

Overlap with metadata consolidation logic

there's a lot of similarity between the stuff in this PR and routines used for consolidated metadata. it would be great to find ways to factor out some of the overlap areas

still to do:

write some more tests (checking that implicit groups don't get written if a group already exists)
handle overwriting. I think the plan here is, if overwrite is False, then we do a check before any writing to ensure that there are no conflicts between the proposed hierarchy and the stuff that actually exists in storage. this check will involve more IO.

TomNicholas · 2025-02-13T20:46:35Z

IMO they should be relative to the path of the group

That sounds fine as it's clear that the group.create_hierarchy API is already looking at one specific group.

…at/batch-creation

…he group

d-v-b · 2025-02-18T14:14:15Z

removed the path and allow_root keyword arguments to create_hierarchy. the keys are now paths relative to the root of the store. for the Group.create_hierarchy method, the keys are relative to the path to the group, which ensures that only sub-arrays and sub-groups of that group can be created.
created a new core/sync_group.py module, which contains the synchronous wrappers around async core/group.py functions. this allows us to have two functions named create_nodes, where one is async, and the other is sync, but neither is public. We will eventually want a more structured layout but that's for a later PR.
usage examples added to docstrings for create_hierarchy
examples added to docs, in both the quickstart and group pages

in the interest of a narrow scope, I've limited the public api to just create_hierarchy.

dcherian

Nice. The public API create_hierarchy looks nice to me.

d-v-b · 2025-02-21T15:39:33Z

test failure is unrelated to this PR (looks like an fsspec thing)

d-v-b added 8 commits December 11, 2024 15:38

sketch out batch creation routine

8faf994

scratch state of easy batch creation

8952911

Merge branch 'main' of https://github.com/d-v-b/zarr-python into feat…

de3c594

…/batch-creation

rename tupleize keys

c700e39

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

986d68b

…into feat/batch-creation

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

97b768f

…at/batch-creation

Merge branch 'feat/batch-creation' of github.com:d-v-b/zarr-python in…

b6bf2dd

…to feat/batch-creation

tests and proper implementation for create_nodes and create_hierarchy

57ceb64

d-v-b requested review from dcherian and jhamman January 7, 2025 13:27

d-v-b added 7 commits January 7, 2025 14:28

privatize

181d3d0

use Posixpath instead of Path in tests; avoid redundant cast

e8e6107

restore cast

4f2c954

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

dd4174c

…at/batch-creation

pureposixpath instead of posixpath

cf72834

group-level create_hierarchy

e2cff8c

docstring

0912ecb

normanrz added this to the After 3.0.0 milestone Jan 7, 2025

d-v-b added 3 commits January 8, 2025 19:12

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

04f7922

…into feat/batch-creation

sketch out from_flat for groups

089feef

better concurrency for v2

116ab87

dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

246f862

…into feat/batch-creation

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025

d-v-b added 2 commits January 9, 2025 21:42

revert change to default concurrency

e38c1ca

create root correctly

2fb9083

d-v-b mentioned this pull request Jan 10, 2025

creating groups from dicts #2685

Open

d-v-b added 2 commits January 10, 2025 15:03

working _from_flat

b099fba

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

64b54bf

…at/batch-creation

add function signature tests

afe47cd

d-v-b added 4 commits February 13, 2025 21:47

update exception name

a2547b3

refactor: remove path kwarg, bring back ImplicitGroupMetadata

9f0ccfa

prune top-level synchronous API

42b9804

more api pruning

d7d0070

d-v-b mentioned this pull request Feb 14, 2025

create_array creates explicit groups #2795

Merged

d-v-b added 6 commits February 14, 2025 17:49

put sync wrappers in sync_group module, move utils to utils

afdc320

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

e74445b

…at/batch-creation

ensure we always have a root group

50b02b4

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

fdc1c8f

…at/batch-creation

docs

7c56b87

fix group.create_hierarchy to properly prefix keys with the name of t…

8245e80

…he group

d-v-b added 3 commits February 18, 2025 15:24

docstrings

df2bdc6

docstrings

35afe7f

docstring examples

77264e4

d-v-b requested review from TomAugspurger, TomNicholas and jhamman February 18, 2025 18:14

TomNicholas approved these changes Feb 19, 2025

View reviewed changes

Merge branch 'main' into feat/batch-creation

3bf83ad

dcherian approved these changes Feb 21, 2025

View reviewed changes

d-v-b enabled auto-merge (squash) February 21, 2025 15:30

d-v-b disabled auto-merge February 21, 2025 15:39

Merge branch 'main' into feat/batch-creation

11e3fa1

d-v-b enabled auto-merge (squash) February 23, 2025 17:30

d-v-b merged commit 8d2fb47 into zarr-developers:main Feb 23, 2025

jhamman mentioned this pull request Mar 20, 2025

DataTree.to_zarr() is very slow writing to high latency store pydata/xarray#9455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat/batch creation#2665

feat/batch creation#2665
d-v-b merged 84 commits intozarr-developers:mainfrom
d-v-b:feat/batch-creation

d-v-b commented Jan 7, 2025

Uh oh!

d-v-b commented Jan 10, 2025 •

edited

Loading

Uh oh!

TomNicholas commented Feb 13, 2025

Uh oh!

d-v-b commented Feb 18, 2025

Uh oh!

dcherian left a comment

Uh oh!

d-v-b commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Uh oh!

Conversation

d-v-b commented Jan 7, 2025

Uh oh!

d-v-b commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

approach

new functions

Implicit groups

streaming v2 vs v3 node creation

Overlap with metadata consolidation logic

still to do:

Uh oh!

TomNicholas commented Feb 13, 2025

Uh oh!

d-v-b commented Feb 18, 2025

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Feb 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

d-v-b commented Jan 10, 2025 •

edited

Loading