Skip to content

Add zarr#574

Closed
gzuidhof wants to merge 1 commit intoKaggle:mainfrom
gzuidhof:master
Closed

Add zarr#574
gzuidhof wants to merge 1 commit intoKaggle:mainfrom
gzuidhof:master

Conversation

@gzuidhof
Copy link
Copy Markdown

@gzuidhof gzuidhof commented Jun 30, 2019

Add zarr, a package useful and commonly used for handling large datasets in a multithreaded fashion.
Added a test for it too.

I was unable to verify locally due to unrelated issues, created issue #575 for that.

@LorePep
Copy link
Copy Markdown

LorePep commented Jul 3, 2019

Super useful!! 🥇

@rosbo
Copy link
Copy Markdown
Contributor

rosbo commented Jul 4, 2019

Hi @gzuidhof,

Have you tried following these steps to install it in your kernel?
https://github.com/Kaggle/docker-python/wiki/Missing-Packages

We are trying to limit the number of packages we add to our base image to keep it manageable.

Thank you

@rosbo rosbo self-requested a review July 4, 2019 16:28
@gzuidhof
Copy link
Copy Markdown
Author

gzuidhof commented Jul 4, 2019

Thanks for taking the time to reply,
it is possible to install it through that method, but this is not allowed in offline-only kernels (e.g. for submission).

To make a case for this package: this package is up there with HDF5 and bcolz, and is in my opinion better than both (better performance, multiprocessing friendliness and ease of use), so I think it deserves a spot.

@rosbo
Copy link
Copy Markdown
Contributor

rosbo commented Jul 4, 2019

In an effort to prevent an explosion in the # of packages, we are usually only adding very popular packages to our base image (1000+ stars on GitHub). Every new package we add makes our image harder to maintain (+ size, + build time, +dependency conflicts, etc.).

better performance, multiprocessing friendliness and ease of use

Do you have benchmark results or blog post to support this claim. I would be happy to look at it.

Thank you

@gzuidhof
Copy link
Copy Markdown
Author

gzuidhof commented Jul 4, 2019

This blog post (by the author of Zarr) explains the why of Zarr and includes benchmarks. It's three years old, and I think in that time Zarr has only improved. It is in my opinion incredibly simple to use and one of the best things to have been released the past few years in this space.

Maybe I should write a blog post about why everybody should consider using zarr 🤔


Four random thoughts that are not mentioned clearly in the blog post:

  • HDF5 is not thread safe out of the box. t seems to be possible with MPI, but that allows one to use that (I don't expect this is even possible in a Kaggle kernel). The top voted answer in this question explains the state better than I could. Also I've been bitten by HDF5 dependency hell many times in the past, it's difficult for novice engineers to get working sometimes.

  • Bcolz is in my opinion more difficult to use than Zarr, it consists of two primitives: carray, which is basically a one dimensional array of data, and ctable which is a columnar store (with most types of sample level prediction problems you don't want to query by column, so not relevant for many usecases, but it has its place).
    For usage examples you can see this tutorial. The documentation is quite sparse (this tutorial is basically all there is on carrays), and the interfaces to do even simple stuff can be complicated.

  • Chunking the data in a custom way (e.g. one image would be one chunk, so I can read only that image) is functionality that is not straightforward to use (you would need to work out the chunk size in byte or element count size)

  • Bcolz thread/multiprocess safety is unknown. Googling it has people claiming either way about thread safety, although it seems that it is definitely not multiprocess safe.

  • In zarr having a dataset on disk, and loading a part of it is trivially easy, which saves you from writing so much code. This is one of it's absolute killer features, here is an example:

def prepare_zarr_dataset(images, output_filepath):
    num_images = len(images)
    z = zarr.open(
        output_filepath,
        mode="w",
        shape=(num_images, PIXELS, PIXELS, 3),
        chunks=(1, None, None, None), // One image per chunk! Really straightforward :)
        dtype=np.uint8
    )

    def process_image(enumerated_filepath):
        index, filepath = enumerated_filepath
        image = read_and_preprocess_image(filepath)
        z[index] = image

    with ThreadPool(processes=os.cpu_count()) as p:
            with tqdm(total=num_images) as pbar:
                for _ in tqdm(p.imap_unordered(process_image, enumerate(images))):
                    pbar.update()   

Note how I just create the Zarr array, but it actually only writes to disk when I write into a slice. And I don't have to care at all that this happens from different threads (or even processes) at the same time!

And now reading this is equally trivial, here's an example data loader:

class ZarrDataset(Dataset):
    # Indices are the only datapoints that will be read from z and labels (useful for cross-validation)
    def __init__(self, z, labels):
        super().__init__()
        self.z = z
        self.labels = labels

    def __len__(self):
        return len(self.z)
    
    def __getitem__(self, index):       
        image = self.z[index]
        label = self.labels[index]
        // maybe do some preprocessing
        return image, label

Neither of these things would have been nearly as simple with hdf5 or bcolz. This will work with any crazy multihreading/multiprocessing scheme too. Did I mention it also supports advanced indexing?

Apologies for the short novel, but I feel strongly about the utility and greatness of Zarr, and would love to see it added 👍

@rosbo rosbo removed their request for review October 15, 2019 21:18
@rosbo rosbo added the new-package Requests for installing new packages label Oct 15, 2019
@Philmod Philmod closed this Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-package Requests for installing new packages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants