Training Set Selection 




Big data has been critical to many of the successes in machine learning, but it brings its own problems. Working with massive datasets is cumbersome and expensive, especially with unstructured data like images, videos, and speech. Careful data selection can mitigate the pains of big data by focusing computational and labeling resources on the most valuable examples. By using a more data-centric approach that focuses on data quality rather than quantity, we can significantly lower the barrier of training ML models.

Challenge summary

This challenge invites participants to design novel data-centric approaches towards data selection for training of image classifiers. The image classification task will be binary classification of visual concepts (e.g. “Monster truck”, “Jean jacket”, etc) of unlabeled images. Familiar examples of similar models in production include automated labeling services by Amazon Rekognition, Google Cloud Vision API and Azure Cognitive Services.

In this challenge, your task will be to design a data selection strategy that chooses the best training examples from a candidate pool of training images (a custom subset of the Open Images Dataset V6 train set) which maximizes the mean average precision (mAP) across a set of visual concepts (e.g., “Cupcake”, “Hawk”, “Sushi”).

Successful approaches will aid in enabling image classification of long-tail concepts at scale where discovery of high-value data points is critical, in a major step towards the democratization of computer vision applications. This challenge is part of a larger effort to emphasize data-centric approaches to machine learning. The current challenge is the first one for visual data in a series of challenges on improving training and testing datasets.