Generic feature extraction POC#2876
Conversation
Adel-Moumen
left a comment
There was a problem hiding this comment.
Do you have an example of a train.py integration of your new tokens loader?
There was a problem hiding this comment.
I don't think that this script should be here. I think it should be dataset dependent similar to what we are doing for let's say librispeech_preparation.py
There was a problem hiding this comment.
This was borrowed from DASB - my older approach integrated that with preparation.
There was a problem hiding this comment.
maybe you should move this a unit-test. I think the extraction will requires extensive tests to make sure they are correct in the loading/saving process
There was a problem hiding this comment.
I will create unit tests
There was a problem hiding this comment.
@Adel-Moumen: Unit tests created in #2938
I have some private examples - but they are on new work in progress not ready to be merged yet, as well as older incarnations of Tokotron. I would suggest choosing one existing recipe and integrating it. |
|
Also, quick question to @pplantinga, don't you think we should maybe aim for a single backend? Given that we are trying to minimise the number of dependencies, I would find it better to just stick to the best and more general purpose solution (instead of having something too general). I believe that most of them share similar pos and cons. In our context, I am not sure if we really need something very sota, I would prefer having something easy to use, where we only need low efforts to maintain the integration. So maybe, something like numpy or h5 is enough. |
|
See #2938 for a simplified H5-only version. |
|
Closed in favour of #2985. |
What does this PR do?
(Work in progress) A universal feature extractor to extract arbitrary features from dataset (discrete tokens, continuous representations, etc) and save them using arbitrary formats.
Before submitting
PR review
Reviewer checklist