Add zarr by gzuidhof · Pull Request #574 · Kaggle/docker-python

gzuidhof · 2019-06-30T22:59:16Z

Add zarr, a package useful and commonly used for handling large datasets in a multithreaded fashion.
Added a test for it too.

I was unable to verify locally due to unrelated issues, created issue #575 for that.

LorePep · 2019-07-03T15:31:40Z

Super useful!! 🥇

rosbo · 2019-07-04T16:28:29Z

Hi @gzuidhof,

Have you tried following these steps to install it in your kernel?
https://github.com/Kaggle/docker-python/wiki/Missing-Packages

We are trying to limit the number of packages we add to our base image to keep it manageable.

Thank you

gzuidhof · 2019-07-04T21:37:39Z

Thanks for taking the time to reply,
it is possible to install it through that method, but this is not allowed in offline-only kernels (e.g. for submission).

To make a case for this package: this package is up there with HDF5 and bcolz, and is in my opinion better than both (better performance, multiprocessing friendliness and ease of use), so I think it deserves a spot.

rosbo · 2019-07-04T22:05:43Z

In an effort to prevent an explosion in the # of packages, we are usually only adding very popular packages to our base image (1000+ stars on GitHub). Every new package we add makes our image harder to maintain (+ size, + build time, +dependency conflicts, etc.).

better performance, multiprocessing friendliness and ease of use

Do you have benchmark results or blog post to support this claim. I would be happy to look at it.

Thank you

gzuidhof · 2019-07-04T23:35:17Z

This blog post (by the author of Zarr) explains the why of Zarr and includes benchmarks. It's three years old, and I think in that time Zarr has only improved. It is in my opinion incredibly simple to use and one of the best things to have been released the past few years in this space.

Maybe I should write a blog post about why everybody should consider using zarr 🤔

Four random thoughts that are not mentioned clearly in the blog post:

HDF5 is not thread safe out of the box. t seems to be possible with MPI, but that allows one to use that (I don't expect this is even possible in a Kaggle kernel). The top voted answer in this question explains the state better than I could. Also I've been bitten by HDF5 dependency hell many times in the past, it's difficult for novice engineers to get working sometimes.
Bcolz is in my opinion more difficult to use than Zarr, it consists of two primitives: carray, which is basically a one dimensional array of data, and ctable which is a columnar store (with most types of sample level prediction problems you don't want to query by column, so not relevant for many usecases, but it has its place).
For usage examples you can see this tutorial. The documentation is quite sparse (this tutorial is basically all there is on carrays), and the interfaces to do even simple stuff can be complicated.
Chunking the data in a custom way (e.g. one image would be one chunk, so I can read only that image) is functionality that is not straightforward to use (you would need to work out the chunk size in byte or element count size)
Bcolz thread/multiprocess safety is unknown. Googling it has people claiming either way about thread safety, although it seems that it is definitely not multiprocess safe.
In zarr having a dataset on disk, and loading a part of it is trivially easy, which saves you from writing so much code. This is one of it's absolute killer features, here is an example:

def prepare_zarr_dataset(images, output_filepath):
    num_images = len(images)
    z = zarr.open(
        output_filepath,
        mode="w",
        shape=(num_images, PIXELS, PIXELS, 3),
        chunks=(1, None, None, None), // One image per chunk! Really straightforward :)
        dtype=np.uint8
    )

    def process_image(enumerated_filepath):
        index, filepath = enumerated_filepath
        image = read_and_preprocess_image(filepath)
        z[index] = image

    with ThreadPool(processes=os.cpu_count()) as p:
            with tqdm(total=num_images) as pbar:
                for _ in tqdm(p.imap_unordered(process_image, enumerate(images))):
                    pbar.update()

Note how I just create the Zarr array, but it actually only writes to disk when I write into a slice. And I don't have to care at all that this happens from different threads (or even processes) at the same time!

And now reading this is equally trivial, here's an example data loader:

class ZarrDataset(Dataset):
    # Indices are the only datapoints that will be read from z and labels (useful for cross-validation)
    def __init__(self, z, labels):
        super().__init__()
        self.z = z
        self.labels = labels

    def __len__(self):
        return len(self.z)
    
    def __getitem__(self, index):       
        image = self.z[index]
        label = self.labels[index]
        // maybe do some preprocessing
        return image, label

Neither of these things would have been nearly as simple with hdf5 or bcolz. This will work with any crazy multihreading/multiprocessing scheme too. Did I mention it also supports advanced indexing?

Apologies for the short novel, but I feel strongly about the utility and greatness of Zarr, and would love to see it added 👍

Add zar package, add test

45afc4d

rosbo self-requested a review July 4, 2019 16:28

rosbo removed their request for review October 15, 2019 21:18

rosbo added the new-package Requests for installing new packages label Oct 15, 2019

Philmod closed this Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zarr#574

Add zarr#574
gzuidhof wants to merge 1 commit intoKaggle:mainfrom
gzuidhof:master

gzuidhof commented Jun 30, 2019 •

edited

Loading

Uh oh!

LorePep commented Jul 3, 2019

Uh oh!

rosbo commented Jul 4, 2019

Uh oh!

gzuidhof commented Jul 4, 2019

Uh oh!

rosbo commented Jul 4, 2019 •

edited

Loading

Uh oh!

gzuidhof commented Jul 4, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gzuidhof commented Jun 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LorePep commented Jul 3, 2019

Uh oh!

rosbo commented Jul 4, 2019

Uh oh!

gzuidhof commented Jul 4, 2019

Uh oh!

rosbo commented Jul 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gzuidhof commented Jul 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gzuidhof commented Jun 30, 2019 •

edited

Loading

rosbo commented Jul 4, 2019 •

edited

Loading

gzuidhof commented Jul 4, 2019 •

edited

Loading