Run CleanVision on Torchvision dataset#
After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.
[2]:
from torchvision.datasets import CIFAR10
from torch.utils.data import ConcatDataset
from cleanvision import Imagelab
1. Download dataset and concatenate all splits#
Since we’re interested in generally understanding what issues plague our data, we merge the training and test sets into one larger dataset before running CleanVision. You could alternatively just run the package on these two sets of data separately to obtain two different reports.
CIFAR10 is classification dataset, but CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).
Load all splits of the CIFAR10 dataset
[3]:
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:02<00:00, 57027649.85it/s]
Extracting ./cifar-10-python.tar.gz to ./
Files already downloaded and verified
Concatenate train and test splits
[4]:
dataset = ConcatDataset([train_set, test_set])
A sample from the dataset
[5]:
dataset[0]
[5]:
(<PIL.Image.Image image mode=RGB size=32x32>, 6)
Let’s look at the first image in this dataset
[6]:
dataset[0][0]
[6]:

2. Run CleanVision#
[7]:
imagelab = Imagelab(torchvision_dataset=dataset)
[8]:
imagelab.find_issues()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...
Issue checks completed. 173 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
3. View Results#
Get a report of all the issues found
[9]:
imagelab.report()
Issues found in images in order of severity in the dataset
| | issue_type | num_images |
|---:|:-----------------|-------------:|
| 0 | blurry | 118 |
| 1 | near_duplicates | 40 |
| 2 | dark | 11 |
| 3 | light | 3 |
| 4 | low_information | 1 |
| 5 | grayscale | 0 |
| 6 | odd_aspect_ratio | 0 |
| 7 | odd_size | 0 |
| 8 | exact_duplicates | 0 |
---------------------- blurry images -----------------------
Number of examples with this issue: 118
Examples representing most severe instances of this issue:
------------------ near_duplicates images ------------------
Number of examples with this issue: 40
Examples representing most severe instances of this issue:
Set: 0
Set: 1
Set: 2
Set: 3
----------------------- dark images ------------------------
Number of examples with this issue: 11
Examples representing most severe instances of this issue:
----------------------- light images -----------------------
Number of examples with this issue: 3
Examples representing most severe instances of this issue:
------------------ low_information images ------------------
Number of examples with this issue: 1
Examples representing most severe instances of this issue:
View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.
[10]:
imagelab.issues
[10]:
odd_size_score | is_odd_size_issue | odd_aspect_ratio_score | is_odd_aspect_ratio_issue | low_information_score | is_low_information_issue | light_score | is_light_issue | grayscale_score | is_grayscale_issue | dark_score | is_dark_issue | blurry_score | is_blurry_issue | exact_duplicates_score | is_exact_duplicates_issue | near_duplicates_score | is_near_duplicates_issue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | False | 1.0 | False | 0.854301 | False | 0.797130 | False | 1 | False | 0.915605 | False | 0.484206 | False | 1.0 | False | 1.0 | False |
1 | 1.0 | False | 1.0 | False | 0.926223 | False | 0.850131 | False | 1 | False | 0.976886 | False | 0.571952 | False | 1.0 | False | 1.0 | False |
2 | 1.0 | False | 1.0 | False | 0.818548 | False | 0.852996 | False | 1 | False | 1.000000 | False | 0.532279 | False | 1.0 | False | 1.0 | False |
3 | 1.0 | False | 1.0 | False | 0.813629 | False | 0.844033 | False | 1 | False | 0.775992 | False | 0.401225 | False | 1.0 | False | 1.0 | False |
4 | 1.0 | False | 1.0 | False | 0.898399 | False | 0.958406 | False | 1 | False | 0.919564 | False | 0.503716 | False | 1.0 | False | 1.0 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59995 | 1.0 | False | 1.0 | False | 0.860407 | False | 0.794629 | False | 1 | False | 0.996078 | False | 0.523458 | False | 1.0 | False | 1.0 | False |
59996 | 1.0 | False | 1.0 | False | 0.888932 | False | 0.939203 | False | 1 | False | 0.843293 | False | 0.498186 | False | 1.0 | False | 1.0 | False |
59997 | 1.0 | False | 1.0 | False | 0.818150 | False | 0.960275 | False | 1 | False | 0.865067 | False | 0.444907 | False | 1.0 | False | 1.0 | False |
59998 | 1.0 | False | 1.0 | False | 0.900018 | False | 0.892104 | False | 1 | False | 0.952069 | False | 0.528622 | False | 1.0 | False | 1.0 | False |
59999 | 1.0 | False | 1.0 | False | 0.858985 | False | 0.809504 | False | 1 | False | 0.932046 | False | 0.501550 | False | 1.0 | False | 1.0 | False |
60000 rows × 18 columns
Get indices of all dark images in the dataset sorted by their dark score.
[11]:
indices = imagelab.issues.query('is_dark_issue').sort_values(by='dark_score').index.tolist()
View the 5th darkest image in the dataset
[12]:
dataset[indices[5]][0]
[12]:

View global information about each issue, such as how many images in the dataset suffer from this issue.
[13]:
imagelab.issue_summary
[13]:
issue_type | num_images | |
---|---|---|
0 | blurry | 118 |
1 | near_duplicates | 40 |
2 | dark | 11 |
3 | light | 3 |
4 | low_information | 1 |
5 | grayscale | 0 |
6 | odd_aspect_ratio | 0 |
7 | odd_size | 0 |
8 | exact_duplicates | 0 |
For more detailed guide on how to use CleanVision, check thetutorial notebook.