Feature caching proposal: CachedDynamicItem by pplantinga · Pull Request #2985 · speechbrain/speechbrain

pplantinga · 2025-10-20T02:00:20Z

Do we really need yet another proposal for feature caching? We already have:

and possibly others. However, each of these changes 10+ files, adding new recipes just for feature extraction. Instead, this proposal is simpler and I think fits the design philosophy of SpeechBrain better.

The core idea: just add @cache(directory) to data items that should have their result cached. That's it. The result of each dynamic item is stored in a file, computed lazily, and can even be computed on GPU -- although there are some limitations here that could be discussed.

Let's take this proposal against the others and decide which offers the best functionality vs. ease-of-use trade-off.

TParcollet · 2025-10-20T08:00:20Z

I like it! @Adel-Moumen what do you think?

Do you know how fast it is? I'm thinking about extraction of very large datasets.

Adel-Moumen · 2025-10-20T10:34:56Z

I like it! @Adel-Moumen what do you think?

Do you know how fast it is? I'm thinking about extraction of very large datasets.

I think, so far, this is the 'cleanest' proposed implementation that is the most aligned with SpeechBrain principles.

@pplantinga, I think it would be nice to extend this proposal by supporting different storage backend. In this proposal, you've used torch, and you are directly saving one file per item, however, some compute facilities like compute canada have limits in terms of number of files created. I would therefore suggest having the possibility to modify the behaviour so that the backend can be something like numpy or h5, and thus, save everything within the same file. Some of the proposed PRs supported this feature (e.g. mine I think).

Additionally, I think if we want to push this PR in our recipes, I would say that in practice, we should do it in two stages: 1. data extraction and saving, 2. training. This way, the second stage can increase the batch size with respect to the gain in VRAM by removing the feature extractor from GPU (e.g. SSL encoder like HuBERT). Of course, this will depend on the recipe (e.g. FBanks do not need to be done in two stages). But, maybe we can discuss about how we want to implement this feature in recipes? cc @TParcollet

Finally, I propose for each proposed PRs to list all the supported features (e.g. multiple backends for storing), pick the one we think should be implemented in SB, close the PRs, implement theses features within this PR. I don't think we should keep the alive as I think your's is much better (congrats!).

Adel-Moumen · 2025-10-20T12:09:49Z

After looking at the 3 different PRs, I think we can close them all. What I would just keep in mind is that it would be better if we could:

define the saving/loading backend (e.g. numpy vs torch vs h5)
a la webdataset, having the possibility to have buckets of saved elements (e.g. 1000 elements per sharded file)
having the possibility to define the dtype of the saved features e.g. uint16 ; for loading, we could just write in the folder a yaml file describing how to read the saved features (e.g. specific dtype).

pplantinga · 2025-10-20T21:48:55Z

Do you know how fast it is? I'm thinking about extraction of very large datasets.

I have not tested this, but the speed is potentially a limitation of this dataset-based approach, as the samples cannot be batched together (caching is done in data pipeline before the batches are created). But the dataset only has to be iterated slow-ish once and never again, so perhaps this is an acceptable cost for the ease of use? Also, there may be a clever way to get batching here that I haven't yet thought of.

I think it would be nice to extend this proposal by supporting different storage backend.

I've added an h5py backend, was quite straightforward to add, take a look! But along with this, a question: should h5py be added to SpeechBrain dependencies, or should this be migrated to integrations?

I would say that in practice, we should do it in two stages: 1. data extraction and saving, 2. training.

I have updated the speakerid template to do this in two stages which allows num_workers to be > 0. Basically, the recipe just iterates the dataset once, then converts to read-only. Could even add a tqdm here if we wanted to show progress of caching. Question: do we want to add some sort of function that simplifies this further?

Adel-Moumen · 2025-10-21T12:28:27Z

I have not tested this, but the speed is potentially a limitation of this dataset-based approach, as the samples cannot be batched together (caching is done in data pipeline before the batches are created). But the dataset only has to be iterated slow-ish once and never again, so perhaps this is an acceptable cost for the ease of use? Also, there may be a clever way to get batching here that I haven't yet thought of.

Since we are not claiming to be a SOTA toolkit for production-ready systems (e.g. trained on 1M hours of speech), I think it's ok not to have batching support. Of course, this can be a bit painful on some datasets e.g. LibriLight, but I don't think this should be a major blocker. Also, I tend to think that for speech tokens (e.g. extracted from codecs), the dataset iteration won't be optimal as you are using a large VQ-VAE system to extract the representations, therefore you want to use a bsize>1 to speed up the process. But, this is not necessarily a major blocker as I have been extracting speech tokens using bsize=1 and it was ok (you just need to scale a bit more the grid of GPUs).

So I don't think what you are proposing is the 'best' trade-off we can make between SB identity and efficiency.

I've added an h5py backend, was quite straightforward to add, take a look! But along with this, a question: should h5py be added to SpeechBrain dependencies, or should this be migrated to integrations?

I would say integrations as it is not a major dependancy since we can just use torch instead.

One note is that, I think I do really like the design! I would say we can already start integrating tests within this PR, as I don't think we should expect major re-design.

Could even add a tqdm here if we wanted to show progress of caching.

Yup.

Question: do we want to add some sort of function that simplifies this further?

What are you thinking about?

pplantinga · 2025-10-21T14:58:54Z

Okay, the hdf5 backend is in the integrations folder and there is a function to warm the cache with optional progressbar.

a la webdataset, having the possibility to have buckets of saved elements (e.g. 1000 elements per sharded file)

Do we need this on top of hdf5 which already provides bucketing?

having the possibility to define the dtype of the saved features e.g. uint16

Can't this just be done by converting the features to the desired type within the DynamicItem function?

I would say we can already start integrating tests within this PR, as I don't think we should expect major re-design.

The doctest already gives 84% coverage (see the README) but do you think we need unit tests here as well?

All my questions are just trying to determine what is really needed here

Adel-Moumen · 2025-10-21T15:21:49Z

Okay, the hdf5 backend is in the integrations folder and there is a function to warm the cache with optional progressbar.

a la webdataset, having the possibility to have buckets of saved elements (e.g. 1000 elements per sharded file)

Do we need this on top of hdf5 which already provides bucketing?

having the possibility to define the dtype of the saved features e.g. uint16

Can't this just be done by converting the features to the desired type within the DynamicItem function?

I would say we can already start integrating tests within this PR, as I don't think we should expect major re-design.

The doctest already gives 84% coverage (see the README) but do you think we need unit tests here as well?

All my questions are just trying to determine what is really needed here

Btw, I was thinking but I actually don't think we are constrained by the batching thing. The current caching decorator check if the uid is part of the hdf5file (or self._uid2path). Therefore, what we can do, is just having a let's say feat_extractor.py file, that will batchify the compute_forward and save the output to the same format as required by the dataloader!

Do we need this on top of hdf5 which already provides bucketing?

Not necessarily!

Can't this just be done by converting the features to the desired type within the DynamicItem function?

Yes, but it's less clear I believe unless we force the casting e.g. return tensor.to(torch.float16).

speechbrain/utils/data_pipeline.py

Adel-Moumen · 2025-10-29T09:46:08Z

Hey @pplantinga, do you know what is missing from this PR? Thanks---

pplantinga · 2025-10-29T14:26:21Z

@Adel-Moumen Do we need a tutorial here?

pplantinga · 2025-10-29T14:26:57Z

Also, possibly some testing with DDP

pplantinga · 2025-10-29T22:02:58Z

Check here to see the section I added about the cached pipeline:

https://speechbrain--2985.org.readthedocs.build/en/2985/tutorials/basics/data-loading-pipeline.html#cached-pipeline

pplantinga · 2025-10-29T22:59:11Z

I ran a very basic test on DDP and didn't see any immediate issues. I guess since each process is using different data, this should be fine. For the hdf5 integration, the cache should be built ahead of time, so no worries there on the DDP front.

I'd say this is good to go, unless @Adel-Moumen has other comments / suggestions.

Adel-Moumen

LGTM!

Adel-Moumen · 2025-10-30T10:19:49Z

Thanks @pplantinga ! This is a great new feature in SB.

Add Cached Dynamic Item and a short example of use

4541a34

pplantinga added this to the v1.1.0 milestone Oct 20, 2025

pplantinga requested review from Adel-Moumen, TParcollet, flexthink and mravanelli October 20, 2025 02:00

pplantinga self-assigned this Oct 20, 2025

pplantinga added the enhancement New feature or request label Oct 20, 2025

pplantinga added 4 commits October 20, 2025 14:56

Better documentation in template example of feature cache

aca93e2

Add HDF5 storage format for cached features

c7b03cb

Add function to change HDF5 file mode for multiprocess reading

256ff26

Add h5py to dependencies

301e688

A wild numpy has appeared

feeaee8

pplantinga added 3 commits October 21, 2025 10:24

Move hdf5 cached item to integrations folder

ba41507

Add function to iterate dataset to warm the cache

70a077a

Fix argument to disable progressbar

a9c5e9b

Adel-Moumen reviewed Oct 21, 2025

View reviewed changes

speechbrain/utils/data_pipeline.py Show resolved Hide resolved

pplantinga and others added 3 commits October 29, 2025 16:01

Add tutorial for cached dynamic item

a5e012a

Merge branch 'develop' into caching-dynamic-item-dataset

f1f3a60

Add a line about warming the cache

525e3d2

Adel-Moumen approved these changes Oct 30, 2025

View reviewed changes

Adel-Moumen merged commit dadc1d3 into speechbrain:develop Oct 30, 2025
5 checks passed

This was referenced Oct 30, 2025

PoC Offline Extraction with SB #2942

Closed

Generic feature extraction POC - H5 backend only #2938

Closed

Generic feature extraction POC #2876

Closed

pplantinga deleted the caching-dynamic-item-dataset branch October 31, 2025 00:23

Conversation

pplantinga commented Oct 20, 2025

Uh oh!

TParcollet commented Oct 20, 2025

Uh oh!

Adel-Moumen commented Oct 20, 2025

Uh oh!

Adel-Moumen commented Oct 20, 2025

Uh oh!

pplantinga commented Oct 20, 2025

Uh oh!

Adel-Moumen commented Oct 21, 2025

Uh oh!

pplantinga commented Oct 21, 2025

Uh oh!

Adel-Moumen commented Oct 21, 2025

Uh oh!

Uh oh!

Adel-Moumen commented Oct 29, 2025

Uh oh!

pplantinga commented Oct 29, 2025

Uh oh!

pplantinga commented Oct 29, 2025

Uh oh!

pplantinga commented Oct 29, 2025

Uh oh!

pplantinga commented Oct 29, 2025

Uh oh!

Adel-Moumen left a comment

Choose a reason for hiding this comment

Uh oh!

Adel-Moumen commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants