Conversation
|
Super useful!! 🥇 |
|
Hi @gzuidhof, Have you tried following these steps to install it in your kernel? We are trying to limit the number of packages we add to our base image to keep it manageable. Thank you |
|
Thanks for taking the time to reply, To make a case for this package: this package is up there with |
|
In an effort to prevent an explosion in the # of packages, we are usually only adding very popular packages to our base image (1000+ stars on GitHub). Every new package we add makes our image harder to maintain (+ size, + build time, +dependency conflicts, etc.).
Do you have benchmark results or blog post to support this claim. I would be happy to look at it. Thank you |
|
This blog post (by the author of Zarr) explains the why of Zarr and includes benchmarks. It's three years old, and I think in that time Zarr has only improved. It is in my opinion incredibly simple to use and one of the best things to have been released the past few years in this space. Maybe I should write a blog post about why everybody should consider using zarr 🤔 Four random thoughts that are not mentioned clearly in the blog post:
def prepare_zarr_dataset(images, output_filepath):
num_images = len(images)
z = zarr.open(
output_filepath,
mode="w",
shape=(num_images, PIXELS, PIXELS, 3),
chunks=(1, None, None, None), // One image per chunk! Really straightforward :)
dtype=np.uint8
)
def process_image(enumerated_filepath):
index, filepath = enumerated_filepath
image = read_and_preprocess_image(filepath)
z[index] = image
with ThreadPool(processes=os.cpu_count()) as p:
with tqdm(total=num_images) as pbar:
for _ in tqdm(p.imap_unordered(process_image, enumerate(images))):
pbar.update() Note how I just create the Zarr array, but it actually only writes to disk when I write into a slice. And I don't have to care at all that this happens from different threads (or even processes) at the same time! And now reading this is equally trivial, here's an example data loader: class ZarrDataset(Dataset):
# Indices are the only datapoints that will be read from z and labels (useful for cross-validation)
def __init__(self, z, labels):
super().__init__()
self.z = z
self.labels = labels
def __len__(self):
return len(self.z)
def __getitem__(self, index):
image = self.z[index]
label = self.labels[index]
// maybe do some preprocessing
return image, labelNeither of these things would have been nearly as simple with hdf5 or bcolz. This will work with any crazy multihreading/multiprocessing scheme too. Did I mention it also supports advanced indexing? Apologies for the short novel, but I feel strongly about the utility and greatness of Zarr, and would love to see it added 👍 |
Add zarr, a package useful and commonly used for handling large datasets in a multithreaded fashion.
Added a test for it too.
I was unable to verify locally due to unrelated issues, created issue #575 for that.