Training Set Selection - Speech
We ask you to please not look at or use the provided evaluation sets in any way other than for offline evaluation of your submissions to Dynabench (e.g., do not optimize on the evaluation data).
Each training set in the final submission will be capped at either 25 or 60 samples, depending on the leaderboard. Training sets with more than the maximum number of selected samples for that leaderboard will be rejected.
For this challenge, the submitted train.json file can be unbalanced, therefore an optimal solution may leverage an unbalanced training set.
The provided candidate pool of training samples is a custom subset of the Multilingual Spoken Words Corpus (MSWC). You may analyze other languages in MSWC, but please do not use English, Portuguese, or Indonesian MSWC data outside of the samples specified in allowed_training_set.yaml for each respective language.