Run CleanVision on a Hugging Face dataset#

Open In Colab

After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.

[2]:
from datasets import load_dataset
from cleanvision import Imagelab

1. Download dataset#

cats_vs_dogs is a subset of Assira dataset which contains millions of images of pets classified into cats and dogs.

Please note though this a classification dataset, CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).

Load train split of the dataset.

[4]:
dataset = load_dataset("cats_vs_dogs", split="train")

See more information on the dataset like features and number of examples

[5]:
dataset
[5]:
Dataset({
    features: ['image', 'labels'],
    num_rows: 23410
})

dataset.features is a dict[column_name, column_type] that contains information about the different columns in the dataset and the type of each column. Use dataset.features to find the key that contains the Image feature.

[6]:
dataset.features
[6]:
{'image': Image(decode=True, id=None),
 'labels': ClassLabel(names=['cat', 'dog'], id=None)}

2. View sample images in the dataset#

Initialize Imagelab

[7]:
imagelab = Imagelab(hf_dataset=dataset, image_key="image")
[8]:
imagelab.visualize()
Sample images from the dataset
../_images/tutorials_huggingface_dataset_15_1.svg

3. Run CleanVision#

[9]:
imagelab.find_issues()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...
Issue checks completed. 146 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().

4. View Results#

Get a report of all the issues found

[10]:
imagelab.report()
Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | blurry           |           49 |
|  1 | exact_duplicates |           48 |
|  2 | odd_size         |           20 |
|  3 | near_duplicates  |           12 |
|  4 | odd_aspect_ratio |            5 |
|  5 | grayscale        |            5 |
|  6 | low_information  |            5 |
|  7 | dark             |            1 |
|  8 | light            |            1 |

---------------------- blurry images -----------------------

Number of examples with this issue: 49
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_1.svg
----------------- exact_duplicates images ------------------

Number of examples with this issue: 48
Examples representing most severe instances of this issue:

Set: 0
../_images/tutorials_huggingface_dataset_19_3.svg
Set: 1
../_images/tutorials_huggingface_dataset_19_5.svg
Set: 2
../_images/tutorials_huggingface_dataset_19_7.svg
Set: 3
../_images/tutorials_huggingface_dataset_19_9.svg
--------------------- odd_size images ----------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_11.svg
------------------ near_duplicates images ------------------

Number of examples with this issue: 12
Examples representing most severe instances of this issue:

Set: 0
../_images/tutorials_huggingface_dataset_19_13.svg
Set: 1
../_images/tutorials_huggingface_dataset_19_15.svg
Set: 2
../_images/tutorials_huggingface_dataset_19_17.svg
Set: 3
../_images/tutorials_huggingface_dataset_19_19.svg
----------------- odd_aspect_ratio images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_21.svg
--------------------- grayscale images ---------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_23.svg
------------------ low_information images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_25.svg
----------------------- dark images ------------------------

Number of examples with this issue: 1
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_27.svg
----------------------- light images -----------------------

Number of examples with this issue: 1
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_29.svg

View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.

[11]:
imagelab.issues.head()
[11]:
odd_size_score is_odd_size_issue odd_aspect_ratio_score is_odd_aspect_ratio_issue low_information_score is_low_information_issue light_score is_light_issue grayscale_score is_grayscale_issue dark_score is_dark_issue blurry_score is_blurry_issue exact_duplicates_score is_exact_duplicates_issue near_duplicates_score is_near_duplicates_issue
0 0.928571 False 0.750000 False 0.886329 False 0.916612 False 1 False 0.915625 False 0.419102 False 1.0 False 1.0 False
1 0.853296 False 0.936667 False 0.870913 False 0.920491 False 1 False 0.679669 False 0.389699 False 1.0 False 1.0 False
2 0.834607 False 0.978000 False 0.926407 False 0.947426 False 1 False 0.916317 False 0.478126 False 1.0 False 1.0 False
3 0.904300 False 0.806000 False 0.910463 False 0.741769 False 1 False 0.995705 False 0.507646 False 1.0 False 1.0 False
4 0.638716 False 1.000000 False 0.936971 False 0.944575 False 1 False 0.968940 False 0.492275 False 1.0 False 1.0 False

Get indices of all blurry images in the dataset sorted by their blurry score.

[12]:
indices = (
    imagelab.issues.query("is_blurry_issue")
    .sort_values(by="blurry_score")
    .index.tolist()
)

View the 8th blurriest image in the dataset

[13]:
dataset[indices[8]]["image"]
[13]:
../_images/tutorials_huggingface_dataset_25_0.png

View global information about each issue, such as how many images in the dataset suffer from this issue.

[14]:
imagelab.issue_summary
[14]:
issue_type num_images
0 blurry 49
1 exact_duplicates 48
2 odd_size 20
3 near_duplicates 12
4 odd_aspect_ratio 5
5 grayscale 5
6 low_information 5
7 dark 1
8 light 1

For more detailed guide on how to use CleanVision, check the tutorial notebook.