Run CleanVision on Torchvision dataset#

After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.

[2]:

from torchvision.datasets import CIFAR10
from torch.utils.data import ConcatDataset
from cleanvision import Imagelab

1. Download dataset and concatenate all splits#

Since we’re interested in generally understanding what issues plague our data, we merge the training and test sets into one larger dataset before running CleanVision. You could alternatively just run the package on these two sets of data separately to obtain two different reports.

CIFAR10 is classification dataset, but CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).

Load all splits of the CIFAR10 dataset

[3]:

%%capture
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)

Concatenate train and test splits

[4]:

dataset = ConcatDataset([train_set, test_set])

A sample from the dataset

[5]:

dataset[0]

[5]:

(<PIL.Image.Image image mode=RGB size=32x32>, 6)

Let’s look at the first image in this dataset

[6]:

dataset[0][0]

[6]:

../_images/tutorials_torchvision_dataset_12_0.png

2. Run CleanVision#

[7]:

imagelab = Imagelab(torchvision_dataset=dataset)

[8]:

imagelab.find_issues()

Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 173 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().

3. View Results#

Get a report of all the issues found

[9]:

imagelab.report()

Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | blurry           |          118 |
|  1 | near_duplicates  |           40 |
|  2 | dark             |           11 |
|  3 | light            |            3 |
|  4 | low_information  |            1 |
|  5 | odd_aspect_ratio |            0 |
|  6 | grayscale        |            0 |
|  7 | odd_size         |            0 |
|  8 | exact_duplicates |            0 |

---------------------- blurry images -----------------------

Number of examples with this issue: 118
Examples representing most severe instances of this issue:

../_images/tutorials_torchvision_dataset_17_1.svg

------------------ near_duplicates images ------------------

Number of examples with this issue: 40
Examples representing most severe instances of this issue:

Set: 0

../_images/tutorials_torchvision_dataset_17_3.svg

Set: 1

../_images/tutorials_torchvision_dataset_17_5.svg

Set: 2

../_images/tutorials_torchvision_dataset_17_7.svg

Set: 3

../_images/tutorials_torchvision_dataset_17_9.svg

----------------------- dark images ------------------------

Number of examples with this issue: 11
Examples representing most severe instances of this issue:

../_images/tutorials_torchvision_dataset_17_11.svg

----------------------- light images -----------------------

Number of examples with this issue: 3
Examples representing most severe instances of this issue:

../_images/tutorials_torchvision_dataset_17_13.svg

------------------ low_information images ------------------

Number of examples with this issue: 1
Examples representing most severe instances of this issue:

../_images/tutorials_torchvision_dataset_17_15.svg

View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.

[10]:

imagelab.issues.head()

[10]:

	odd_size_score	is_odd_size_issue	odd_aspect_ratio_score	is_odd_aspect_ratio_issue	low_information_score	is_low_information_issue	light_score	is_light_issue	grayscale_score	is_grayscale_issue	dark_score	is_dark_issue	blurry_score	is_blurry_issue	exact_duplicates_score	is_exact_duplicates_issue	near_duplicates_score	is_near_duplicates_issue
0	1.0	False	1.0	False	0.854301	False	0.797130	False	1	False	0.915605	False	0.484206	False	1.0	False	1.0	False
1	1.0	False	1.0	False	0.926223	False	0.850131	False	1	False	0.976886	False	0.571952	False	1.0	False	1.0	False
2	1.0	False	1.0	False	0.818548	False	0.852996	False	1	False	1.000000	False	0.532279	False	1.0	False	1.0	False
3	1.0	False	1.0	False	0.813629	False	0.844033	False	1	False	0.775992	False	0.401225	False	1.0	False	1.0	False
4	1.0	False	1.0	False	0.898399	False	0.958406	False	1	False	0.919564	False	0.503716	False	1.0	False	1.0	False

Get indices of all dark images in the dataset sorted by their dark score.

[11]:

indices = (
    imagelab.issues.query("is_dark_issue").sort_values(by="dark_score").index.tolist()
)

View the 5th darkest image in the dataset

[12]:

dataset[indices[5]][0]

[12]:

../_images/tutorials_torchvision_dataset_23_0.png

View global information about each issue, such as how many images in the dataset suffer from this issue.

[13]:

imagelab.issue_summary

[13]:

	issue_type	num_images
0	blurry	118
1	near_duplicates	40
2	dark	11
3	light	3
4	low_information	1
5	odd_aspect_ratio	0
6	grayscale	0
7	odd_size	0
8	exact_duplicates	0

For more detailed guide on how to use CleanVision, check the tutorial notebook.