Run CleanVision on a Hugging Face dataset#

After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.

[2]:

from datasets import load_dataset
from cleanvision import Imagelab

1. Download dataset#

cats_vs_dogs is a subset of Assira dataset which contains millions of images of pets classified into cats and dogs.

Please note though this a classification dataset, CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).

Load train split of the dataset.

[4]:

dataset = load_dataset("cats_vs_dogs", split="train")

See more information on the dataset like features and number of examples

[5]:

dataset

[5]:

Dataset({
    features: ['image', 'labels'],
    num_rows: 23410
})

dataset.features is a dict[column_name, column_type] that contains information about the different columns in the dataset and the type of each column. Use dataset.features to find the key that contains the Image feature.

[6]:

dataset.features

[6]:

{'image': Image(mode=None, decode=True),
 'labels': ClassLabel(names=['cat', 'dog'])}

2. View sample images in the dataset#

Initialize Imagelab

[7]:

imagelab = Imagelab(hf_dataset=dataset, image_key="image")

[8]:

imagelab.visualize()

Sample images from the dataset

../_images/tutorials_huggingface_dataset_15_1.svg

3. Run CleanVision#

[9]:

imagelab.find_issues()

Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 146 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().

4. View Results#

Get a report of all the issues found

[10]:

imagelab.report()

Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | blurry           |           49 |
|  1 | exact_duplicates |           48 |
|  2 | odd_size         |           20 |
|  3 | near_duplicates  |           12 |
|  4 | odd_aspect_ratio |            5 |
|  5 | grayscale        |            5 |
|  6 | low_information  |            5 |
|  7 | dark             |            1 |
|  8 | light            |            1 |

---------------------- blurry images -----------------------

Number of examples with this issue: 49
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_1.svg

----------------- exact_duplicates images ------------------

Number of examples with this issue: 48
Examples representing most severe instances of this issue:

Set: 0

../_images/tutorials_huggingface_dataset_19_3.svg

Set: 1

../_images/tutorials_huggingface_dataset_19_5.svg

Set: 2

../_images/tutorials_huggingface_dataset_19_7.svg

Set: 3

../_images/tutorials_huggingface_dataset_19_9.svg

--------------------- odd_size images ----------------------

Number of examples with this issue: 20
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_11.svg

------------------ near_duplicates images ------------------

Number of examples with this issue: 12
Examples representing most severe instances of this issue:

Set: 0

../_images/tutorials_huggingface_dataset_19_13.svg

Set: 1

../_images/tutorials_huggingface_dataset_19_15.svg

Set: 2

../_images/tutorials_huggingface_dataset_19_17.svg

Set: 3

../_images/tutorials_huggingface_dataset_19_19.svg

----------------- odd_aspect_ratio images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_21.svg

--------------------- grayscale images ---------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_23.svg

------------------ low_information images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_25.svg

----------------------- dark images ------------------------

Number of examples with this issue: 1
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_27.svg

----------------------- light images -----------------------

Number of examples with this issue: 1
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_19_29.svg

View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.

[11]:

imagelab.issues.head()

[11]:

	odd_size_score	is_odd_size_issue	odd_aspect_ratio_score	is_odd_aspect_ratio_issue	low_information_score	is_low_information_issue	light_score	is_light_issue	grayscale_score	is_grayscale_issue	dark_score	is_dark_issue	blurry_score	is_blurry_issue	exact_duplicates_score	is_exact_duplicates_issue	near_duplicates_score	is_near_duplicates_issue
0	0.928571	False	0.750000	False	0.886329	False	0.916612	False	1	False	0.915625	False	0.419102	False	1.0	False	1.0	False
1	0.853296	False	0.936667	False	0.870913	False	0.920491	False	1	False	0.679669	False	0.389699	False	1.0	False	1.0	False
2	0.834607	False	0.978000	False	0.926407	False	0.947426	False	1	False	0.916317	False	0.478126	False	1.0	False	1.0	False
3	0.904300	False	0.806000	False	0.910463	False	0.741769	False	1	False	0.995705	False	0.507646	False	1.0	False	1.0	False
4	0.638716	False	1.000000	False	0.936971	False	0.944575	False	1	False	0.968940	False	0.492275	False	1.0	False	1.0	False

Get indices of all blurry images in the dataset sorted by their blurry score.

[12]:

indices = (
    imagelab.issues.query("is_blurry_issue")
    .sort_values(by="blurry_score")
    .index.tolist()
)

View the 8th blurriest image in the dataset

[13]:

dataset[indices[8]]["image"]

[13]:

../_images/tutorials_huggingface_dataset_25_0.png

View global information about each issue, such as how many images in the dataset suffer from this issue.

[14]:

imagelab.issue_summary

[14]:

	issue_type	num_images
0	blurry	49
1	exact_duplicates	48
2	odd_size	20
3	near_duplicates	12
4	odd_aspect_ratio	5
5	grayscale	5
6	low_information	5
7	dark	1
8	light	1

For more detailed guide on how to use CleanVision, check the tutorial notebook.