Feature caching proposal: CachedDynamicItem#2985
Feature caching proposal: CachedDynamicItem#2985Adel-Moumen merged 12 commits intospeechbrain:developfrom
Conversation
|
I like it! @Adel-Moumen what do you think? Do you know how fast it is? I'm thinking about extraction of very large datasets. |
I think, so far, this is the 'cleanest' proposed implementation that is the most aligned with SpeechBrain principles. @pplantinga, I think it would be nice to extend this proposal by supporting different storage backend. In this proposal, you've used Additionally, I think if we want to push this PR in our recipes, I would say that in practice, we should do it in two stages: 1. data extraction and saving, 2. training. This way, the second stage can increase the batch size with respect to the gain in VRAM by removing the feature extractor from GPU (e.g. SSL encoder like HuBERT). Of course, this will depend on the recipe (e.g. FBanks do not need to be done in two stages). But, maybe we can discuss about how we want to implement this feature in recipes? cc @TParcollet Finally, I propose for each proposed PRs to list all the supported features (e.g. multiple backends for storing), pick the one we think should be implemented in SB, close the PRs, implement theses features within this PR. I don't think we should keep the alive as I think your's is much better (congrats!). |
|
After looking at the 3 different PRs, I think we can close them all. What I would just keep in mind is that it would be better if we could:
|
I have not tested this, but the speed is potentially a limitation of this dataset-based approach, as the samples cannot be batched together (caching is done in data pipeline before the batches are created). But the dataset only has to be iterated slow-ish once and never again, so perhaps this is an acceptable cost for the ease of use? Also, there may be a clever way to get batching here that I haven't yet thought of.
I've added an h5py backend, was quite straightforward to add, take a look! But along with this, a question: should h5py be added to SpeechBrain dependencies, or should this be migrated to
I have updated the speakerid template to do this in two stages which allows |
Since we are not claiming to be a SOTA toolkit for production-ready systems (e.g. trained on 1M hours of speech), I think it's ok not to have batching support. Of course, this can be a bit painful on some datasets e.g. LibriLight, but I don't think this should be a major blocker. Also, I tend to think that for speech tokens (e.g. extracted from codecs), the dataset iteration won't be optimal as you are using a large VQ-VAE system to extract the representations, therefore you want to use a bsize>1 to speed up the process. But, this is not necessarily a major blocker as I have been extracting speech tokens using bsize=1 and it was ok (you just need to scale a bit more the grid of GPUs). So I don't think what you are proposing is the 'best' trade-off we can make between SB identity and efficiency.
I would say One note is that, I think I do really like the design! I would say we can already start integrating tests within this PR, as I don't think we should expect major re-design.
Yup.
What are you thinking about? |
|
Okay, the hdf5 backend is in the integrations folder and there is a function to warm the cache with optional progressbar.
Do we need this on top of hdf5 which already provides bucketing?
Can't this just be done by converting the features to the desired type within the DynamicItem function?
The doctest already gives 84% coverage (see the README) but do you think we need unit tests here as well? All my questions are just trying to determine what is really needed here |
Btw, I was thinking but I actually don't think we are constrained by the batching thing. The current caching decorator check if the
Not necessarily!
Yes, but it's less clear I believe unless we force the casting e.g. |
|
Hey @pplantinga, do you know what is missing from this PR? Thanks--- |
|
@Adel-Moumen Do we need a tutorial here? |
|
Also, possibly some testing with DDP |
|
Check here to see the section I added about the cached pipeline: |
|
I ran a very basic test on DDP and didn't see any immediate issues. I guess since each process is using different data, this should be fine. For the hdf5 integration, the cache should be built ahead of time, so no worries there on the DDP front. I'd say this is good to go, unless @Adel-Moumen has other comments / suggestions. |
|
Thanks @pplantinga ! This is a great new feature in SB. |
Do we really need yet another proposal for feature caching? We already have:
and possibly others. However, each of these changes 10+ files, adding new recipes just for feature extraction. Instead, this proposal is simpler and I think fits the design philosophy of SpeechBrain better.
The core idea: just add
@cache(directory)to data items that should have their result cached. That's it. The result of each dynamic item is stored in a file, computed lazily, and can even be computed on GPU -- although there are some limitations here that could be discussed.Let's take this proposal against the others and decide which offers the best functionality vs. ease-of-use trade-off.