PoC Offline Extraction with SB#2942
Closed
Adel-Moumen wants to merge 6 commits intospeechbrain:developfrom
Closed
Conversation
Collaborator
Author
|
Closed in favour of #2985. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR aims at providing an alternative proof-of-concept for saving/loading features in SpeechBrain.
Background
SpeechBrain has primarily focused on extracting features on the fly—FBanks, SSL representations, etc.—as part of its philosophy of doing everything in a single
train.pyfile driven by ayamlconfiguration. This enables rapid prototyping with tight feedback loops.However, in recent months and years we’ve seen a trend toward ever-larger datasets (e.g., GigaSpeech’s 10 k hours, LibriHeavy’s 50 k hours) becoming the de facto benchmarks for training models (farewell to our old LibriSpeech on V100). The cost of on-the-fly feature extraction grows with multiple epochs over such large corpora. Moreover, a new form of representation has emerged: speech tokens. These discrete representations—often extracted from SSL encoders or VQ-VAE models—are fixed and never changed, and are used by medium- to large-scale autoregressive models. But because these encoders are heavy, extracting tokens on the fly is prohibitively expensive. Instead, tokens are typically extracted offline (much like SentencePiece tokens) and then loaded at training time so that only the decoder (and the tokens) reside in VRAM.
This growing use of frozen features and discrete representations renders SpeechBrain’s current workflow impractical at scale. The community’s embrace of SpeechLMs and SpeechLLMs marks a paradigm shift in which on-the-fly feature extraction is no longer feasible. This PR addresses that challenge by providing a proof of concept for saving and loading pre-extracted features in SpeechBrain.
Description of the Prototype
I extended the
Brainclass with two new methods:compute_featuresandcache_features.compute_features(batch, stage)Similar to
fit_batch, this method takes abatchand astage, extracts the required features, and returns a list of dictionaries. Each dictionary must include the utteranceidplus any feature key/value pairs you want to save. For example, to saveid,ssl_feats, andtokens, return:[ {"id": "utt1", "ssl_feats": <tensor>, "tokens": <ndarray>}, {"id": "utt2", "ssl_feats": <tensor>, "tokens": <ndarray>}, … ]cache_features(...)Analogous to
fit()orevaluate(), this method iterates over a dataset (or dataloader), callscompute_featureson each batch, and writes the returned feature dictionaries to disk.I/O Backends & Configuration
Inspired by [lhotse’s I/O module](https://github.com/lhotse-speech/lhotse/blob/fda1a986e5e1e72a14c82049b4ee709fc09a81e6/lhotse/features/io.py#L494), I added a
feature_io.pyfile defining reader and writer classes, plus a simple factory. Key points:np.memmap-style access to avoid loading everything into RAM.All configuration lives in YAML via a
FeatureStorageConfigsection that specifies, for each feature:name: the key under which to store it (e.g.,ssl_feats)dtype: e.g.,float32writer_class: e.g.,NumpyHdf5WriterYAML Example
Usage Example
Reading Cached Features in
train.pyDefine your readers in YAML:
And use them in your data pipeline:
That’s all! Implement
compute_features, configure your writers and readers in YAML, and callcache_features.Room for Improvements
When handling multiple dataset splits, the
dataio_preparestage can become verbose. For example:One way to simplify this would be to move the writer (and reader) instantiations into the
Brainclass itself, rather than defining them in YAML. That way, you wouldn’t need to clutter your config with:— the
Brainsubclass could automatically create and exposefeature_writersandfeature_readersfor each split based on a singlefeature_configsentry.NOTE: Please don't ask me about solving tests etc. The intended goal of this PR so far is to provide a PoC. I will make things cleaner etc once we are converging towards a general design.