Skip to content

Feature caching proposal: CachedDynamicItem#2985

Merged
Adel-Moumen merged 12 commits intospeechbrain:developfrom
pplantinga:caching-dynamic-item-dataset
Oct 30, 2025
Merged

Feature caching proposal: CachedDynamicItem#2985
Adel-Moumen merged 12 commits intospeechbrain:developfrom
pplantinga:caching-dynamic-item-dataset

Conversation

@pplantinga
Copy link
Copy Markdown
Collaborator

Do we really need yet another proposal for feature caching? We already have:

and possibly others. However, each of these changes 10+ files, adding new recipes just for feature extraction. Instead, this proposal is simpler and I think fits the design philosophy of SpeechBrain better.

The core idea: just add @cache(directory) to data items that should have their result cached. That's it. The result of each dynamic item is stored in a file, computed lazily, and can even be computed on GPU -- although there are some limitations here that could be discussed.

Let's take this proposal against the others and decide which offers the best functionality vs. ease-of-use trade-off.

@pplantinga pplantinga added this to the v1.1.0 milestone Oct 20, 2025
@pplantinga pplantinga self-assigned this Oct 20, 2025
@pplantinga pplantinga added the enhancement New feature or request label Oct 20, 2025
@TParcollet
Copy link
Copy Markdown
Collaborator

I like it! @Adel-Moumen what do you think?

Do you know how fast it is? I'm thinking about extraction of very large datasets.

@Adel-Moumen
Copy link
Copy Markdown
Collaborator

I like it! @Adel-Moumen what do you think?

Do you know how fast it is? I'm thinking about extraction of very large datasets.

I think, so far, this is the 'cleanest' proposed implementation that is the most aligned with SpeechBrain principles.

@pplantinga, I think it would be nice to extend this proposal by supporting different storage backend. In this proposal, you've used torch, and you are directly saving one file per item, however, some compute facilities like compute canada have limits in terms of number of files created. I would therefore suggest having the possibility to modify the behaviour so that the backend can be something like numpy or h5, and thus, save everything within the same file. Some of the proposed PRs supported this feature (e.g. mine I think).

Additionally, I think if we want to push this PR in our recipes, I would say that in practice, we should do it in two stages: 1. data extraction and saving, 2. training. This way, the second stage can increase the batch size with respect to the gain in VRAM by removing the feature extractor from GPU (e.g. SSL encoder like HuBERT). Of course, this will depend on the recipe (e.g. FBanks do not need to be done in two stages). But, maybe we can discuss about how we want to implement this feature in recipes? cc @TParcollet

Finally, I propose for each proposed PRs to list all the supported features (e.g. multiple backends for storing), pick the one we think should be implemented in SB, close the PRs, implement theses features within this PR. I don't think we should keep the alive as I think your's is much better (congrats!).

@Adel-Moumen
Copy link
Copy Markdown
Collaborator

After looking at the 3 different PRs, I think we can close them all. What I would just keep in mind is that it would be better if we could:

  • define the saving/loading backend (e.g. numpy vs torch vs h5)
  • a la webdataset, having the possibility to have buckets of saved elements (e.g. 1000 elements per sharded file)
  • having the possibility to define the dtype of the saved features e.g. uint16 ; for loading, we could just write in the folder a yaml file describing how to read the saved features (e.g. specific dtype).

@pplantinga
Copy link
Copy Markdown
Collaborator Author

Do you know how fast it is? I'm thinking about extraction of very large datasets.

I have not tested this, but the speed is potentially a limitation of this dataset-based approach, as the samples cannot be batched together (caching is done in data pipeline before the batches are created). But the dataset only has to be iterated slow-ish once and never again, so perhaps this is an acceptable cost for the ease of use? Also, there may be a clever way to get batching here that I haven't yet thought of.

I think it would be nice to extend this proposal by supporting different storage backend.

I've added an h5py backend, was quite straightforward to add, take a look! But along with this, a question: should h5py be added to SpeechBrain dependencies, or should this be migrated to integrations?

I would say that in practice, we should do it in two stages: 1. data extraction and saving, 2. training.

I have updated the speakerid template to do this in two stages which allows num_workers to be > 0. Basically, the recipe just iterates the dataset once, then converts to read-only. Could even add a tqdm here if we wanted to show progress of caching. Question: do we want to add some sort of function that simplifies this further?

@Adel-Moumen
Copy link
Copy Markdown
Collaborator

I have not tested this, but the speed is potentially a limitation of this dataset-based approach, as the samples cannot be batched together (caching is done in data pipeline before the batches are created). But the dataset only has to be iterated slow-ish once and never again, so perhaps this is an acceptable cost for the ease of use? Also, there may be a clever way to get batching here that I haven't yet thought of.

Since we are not claiming to be a SOTA toolkit for production-ready systems (e.g. trained on 1M hours of speech), I think it's ok not to have batching support. Of course, this can be a bit painful on some datasets e.g. LibriLight, but I don't think this should be a major blocker. Also, I tend to think that for speech tokens (e.g. extracted from codecs), the dataset iteration won't be optimal as you are using a large VQ-VAE system to extract the representations, therefore you want to use a bsize>1 to speed up the process. But, this is not necessarily a major blocker as I have been extracting speech tokens using bsize=1 and it was ok (you just need to scale a bit more the grid of GPUs).

So I don't think what you are proposing is the 'best' trade-off we can make between SB identity and efficiency.

I've added an h5py backend, was quite straightforward to add, take a look! But along with this, a question: should h5py be added to SpeechBrain dependencies, or should this be migrated to integrations?

I would say integrations as it is not a major dependancy since we can just use torch instead.

One note is that, I think I do really like the design! I would say we can already start integrating tests within this PR, as I don't think we should expect major re-design.

Could even add a tqdm here if we wanted to show progress of caching.

Yup.

Question: do we want to add some sort of function that simplifies this further?

What are you thinking about?

@pplantinga
Copy link
Copy Markdown
Collaborator Author

Okay, the hdf5 backend is in the integrations folder and there is a function to warm the cache with optional progressbar.

a la webdataset, having the possibility to have buckets of saved elements (e.g. 1000 elements per sharded file)

Do we need this on top of hdf5 which already provides bucketing?

having the possibility to define the dtype of the saved features e.g. uint16

Can't this just be done by converting the features to the desired type within the DynamicItem function?

I would say we can already start integrating tests within this PR, as I don't think we should expect major re-design.

The doctest already gives 84% coverage (see the README) but do you think we need unit tests here as well?

All my questions are just trying to determine what is really needed here

@Adel-Moumen
Copy link
Copy Markdown
Collaborator

Okay, the hdf5 backend is in the integrations folder and there is a function to warm the cache with optional progressbar.

a la webdataset, having the possibility to have buckets of saved elements (e.g. 1000 elements per sharded file)

Do we need this on top of hdf5 which already provides bucketing?

having the possibility to define the dtype of the saved features e.g. uint16

Can't this just be done by converting the features to the desired type within the DynamicItem function?

I would say we can already start integrating tests within this PR, as I don't think we should expect major re-design.

The doctest already gives 84% coverage (see the README) but do you think we need unit tests here as well?

All my questions are just trying to determine what is really needed here

Btw, I was thinking but I actually don't think we are constrained by the batching thing. The current caching decorator check if the uid is part of the hdf5file (or self._uid2path). Therefore, what we can do, is just having a let's say feat_extractor.py file, that will batchify the compute_forward and save the output to the same format as required by the dataloader!

Do we need this on top of hdf5 which already provides bucketing?

Not necessarily!

Can't this just be done by converting the features to the desired type within the DynamicItem function?

Yes, but it's less clear I believe unless we force the casting e.g. return tensor.to(torch.float16).

@Adel-Moumen
Copy link
Copy Markdown
Collaborator

Hey @pplantinga, do you know what is missing from this PR? Thanks---

@pplantinga
Copy link
Copy Markdown
Collaborator Author

@Adel-Moumen Do we need a tutorial here?

@pplantinga
Copy link
Copy Markdown
Collaborator Author

Also, possibly some testing with DDP

@pplantinga
Copy link
Copy Markdown
Collaborator Author

Check here to see the section I added about the cached pipeline:

https://speechbrain--2985.org.readthedocs.build/en/2985/tutorials/basics/data-loading-pipeline.html#cached-pipeline

@pplantinga
Copy link
Copy Markdown
Collaborator Author

I ran a very basic test on DDP and didn't see any immediate issues. I guess since each process is using different data, this should be fine. For the hdf5 integration, the cache should be built ahead of time, so no worries there on the DDP front.

I'd say this is good to go, unless @Adel-Moumen has other comments / suggestions.

Copy link
Copy Markdown
Collaborator

@Adel-Moumen Adel-Moumen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Adel-Moumen
Copy link
Copy Markdown
Collaborator

Thanks @pplantinga ! This is a great new feature in SB.

@Adel-Moumen Adel-Moumen merged commit dadc1d3 into speechbrain:develop Oct 30, 2025
5 checks passed
@pplantinga pplantinga deleted the caching-dynamic-item-dataset branch October 31, 2025 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants