Training Set Cleaning

How are you evaluated?

The evaluation code of the challenge is entirely open at dataperf-vision-debugging, where you can run some baselines and evaluate your algorithms locally. Below are the instructions on how to setup the environment and run them locally.


Offline evaluation (MLCube)

In order to perform offline evaluation with MLCube, you will need to have docker installed on your system. Please follow the steps here: Get Docker.


# Fetch the vision selection repo 

git clone https://github.com/DS3Lab/dataperf-vision-debugging && cd ./dataperf-vision-debugging 

# Create Python environment and install MLCube Docker runner 

python3 -m venv ./venv && source ./venv/bin/activate && pip install mlcube-docker


# Download and extract dataset 

mlcube run --task=download -Pdocker.build_strategy=always 

# Create baselines submission (skip this if you have already done so, or you want to evaluate your submissions only) 

mlcube run --task=create_baselines -Pdocker.build_strategy=always 

# Run evaluation 

mlcube run --task=evaluate -Pdocker.build_strategy=always 

# Run plotter 

mlcube run --task=plot -Pdocker.build_strategy=always


In order to evaluate your own algorithms, you can either:


Online evaluation (Dynabench)

As stated before, for the open division we ask that you submit multiple files, each being the output of the cleaning algorithm you developed. The only limitations on your submission are:

To participate in the online evaluation, you will need to first login into dynabench.org, create an account there and click the “Debugging DataPerf” card. Then click the Submit Train Files button. There you will be asked to upload the 8 files and give a name to your method. Once you submitted, it will be evaluated on our server (please give it some time to run, but if it takes half a day, please let us know), and you will be able to see it under the "your models" page.

Note: Please note that there are some restrictions on the platform:


Evaluation Metric

Your submission will be evaluated based on "how many samples your submission needs to fix, to achieve a high enough accuracy". This is to imitate real use cases of the data cleaning algorithms, where we want to inspect as less samples as possible, but keep the data quality good enough. For example, if the accuracy of the model, trained on a perfectly clean dataset, is 0.9, then we define the high enough accuracy to be 0.9 * 95% = 0.855. Assume that algorithm A achieves an accuracy of 0.855 after fixing 100 samples and algorithm B achieves an accuracy of 0.855 after fixing 200 samples, then score(A)=100/300 = 1/3 while score(B)=2/3. In other words, the lower the score, the better the cleaning algorithm.

Contact

In case you have any questions, please do not hesitate to contact Xiaozhe Yao, we also have an online discussion forum at https://github.com/DS3Lab/dataperf-vision-debugging/discussions.