Run CleanVision on Torchvision dataset#
After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.
[2]:
from torchvision.datasets import CIFAR10
from torch.utils.data import ConcatDataset
from cleanvision import Imagelab
1. Download dataset and concatenate all splits#
Since we’re interested in generally understanding what issues plague our data, we merge the training and test sets into one larger dataset before running CleanVision. You could alternatively just run the package on these two sets of data separately to obtain two different reports.
CIFAR10 is classification dataset, but CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).
Load all splits of the CIFAR10 dataset
[3]:
train_set = CIFAR10(root="./", download=True)
test_set = CIFAR10(root="./", train=False, download=True)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:02<00:00, 84465939.54it/s]
Extracting ./cifar-10-python.tar.gz to ./
Files already downloaded and verified
Concatenate train and test splits
[4]:
dataset = ConcatDataset([train_set, test_set])
A sample from the dataset
[5]:
dataset[0]
[5]:
(<PIL.Image.Image image mode=RGB size=32x32>, 6)
Let’s look at the first image in this dataset
[6]:
dataset[0][0]
[6]:
2. Run CleanVision#
[7]:
imagelab = Imagelab(torchvision_dataset=dataset)
[8]:
imagelab.find_issues()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale images ...
Issue checks completed. To see a detailed report of issues found, use imagelab.report().
3. View Results#
Get a report of all the issues found
[9]:
imagelab.report()
Issues found in order of severity in the dataset
| | issue_type | num_images |
|---:|:----------------|-------------:|
| 0 | blurry | 118 |
| 1 | near_duplicates | 40 |
| 2 | dark | 11 |
| 3 | light | 3 |
| 4 | low_information | 1 |
Top 4 examples with blurry issue in the dataset.
Top 4 sets of images with near_duplicates issue
Set: 0
Set: 1
Set: 2
Set: 3
Top 4 examples with dark issue in the dataset.
Found 3 examples with light issue in the dataset.
Found 1 example with low_information issue in the dataset.
View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.
[10]:
imagelab.issues
[10]:
odd_aspect_ratio_score | is_odd_aspect_ratio_issue | low_information_score | is_low_information_issue | light_score | is_light_issue | grayscale_score | is_grayscale_issue | dark_score | is_dark_issue | blurry_score | is_blurry_issue | exact_duplicates_score | is_exact_duplicates_issue | near_duplicates_score | is_near_duplicates_issue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | False | 0.854301 | False | 0.797130 | False | 1 | False | 0.915605 | False | 0.484206 | False | 1.0 | False | 1.0 | False |
1 | 1.0 | False | 0.926223 | False | 0.850131 | False | 1 | False | 0.976886 | False | 0.571952 | False | 1.0 | False | 1.0 | False |
2 | 1.0 | False | 0.818548 | False | 0.852996 | False | 1 | False | 1.000000 | False | 0.532279 | False | 1.0 | False | 1.0 | False |
3 | 1.0 | False | 0.813629 | False | 0.844033 | False | 1 | False | 0.775992 | False | 0.401225 | False | 1.0 | False | 1.0 | False |
4 | 1.0 | False | 0.898399 | False | 0.958406 | False | 1 | False | 0.919564 | False | 0.503716 | False | 1.0 | False | 1.0 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59995 | 1.0 | False | 0.860407 | False | 0.794629 | False | 1 | False | 0.996078 | False | 0.523458 | False | 1.0 | False | 1.0 | False |
59996 | 1.0 | False | 0.888932 | False | 0.939203 | False | 1 | False | 0.843293 | False | 0.498186 | False | 1.0 | False | 1.0 | False |
59997 | 1.0 | False | 0.818150 | False | 0.960275 | False | 1 | False | 0.865067 | False | 0.444907 | False | 1.0 | False | 1.0 | False |
59998 | 1.0 | False | 0.900018 | False | 0.892104 | False | 1 | False | 0.952069 | False | 0.528622 | False | 1.0 | False | 1.0 | False |
59999 | 1.0 | False | 0.858985 | False | 0.809504 | False | 1 | False | 0.932046 | False | 0.501550 | False | 1.0 | False | 1.0 | False |
60000 rows × 16 columns
Get indices of all dark images in the dataset sorted by their dark score.
[11]:
indices = imagelab.issues.query('is_dark_issue').sort_values(by='dark_score').index.tolist()
View the 5th darkest image in the dataset
[12]:
dataset[indices[5]][0]
[12]:
View global information about each issue, such as how many images in the dataset suffer from this issue.
[13]:
imagelab.issue_summary
[13]:
issue_type | num_images | |
---|---|---|
0 | blurry | 118 |
1 | near_duplicates | 40 |
2 | dark | 11 |
3 | light | 3 |
4 | low_information | 1 |
5 | grayscale | 0 |
6 | odd_aspect_ratio | 0 |
7 | exact_duplicates | 0 |
For more detailed guide on how to use CleanVision, check thetutorial notebook.