Training Set Cleaning

Why is this challenge important?

When dealing with massive datasets, noises in the datasets become inevitable. This is increasingly the problem for ML training and noises in the dataset can come from many places:

If trained over these noisy datasets, ML models might suffer not only from lower quality, but also potential risks on other quality dimensions such as fairness. Careful data cleaning can often accommodate this, however, it can be a very expensive process if we need to investigate and clean all examples. By using a more data-centric approach we hope to direct human attention and the cleaning efforts toward data examples that matter more to the improvement of ML models.

In this data cleaning challenge, we invite participants to design and experiment data-centric approaches towards strategic data cleaning for training sets of an image classification model. As a participant, you will be asked to rank the samples in the entire training set, and then we will clean them one by one and evaluate the performance of the model after each fix. The earlier it reached a high enough accuracy, the better your rank is.

Similar to other dataperf challenges, the cleaning challenge comes with two flavors: the open division and the closed division.

Contact

In case you have any questions, please do not hesitate to contact Xiaozhe Yao, we also have an online discussion forum at https://github.com/DS3Lab/dataperf-vision-debugging/discussions.