Run CleanVision on a Hugging Face dataset#

After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.

[2]:

from datasets import load_dataset
from cleanvision import Imagelab

1. Download dataset#

cats_vs_dogs is a subset of Assira dataset which contains millions of images of pets classified into cats and dogs.

Please note though this a classification dataset, CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).

Load train split of the dataset.

[3]:

dataset = load_dataset("cats_vs_dogs", split="train")

Downloading and preparing dataset cats_vs_dogs/default to /home/docs/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb...

Dataset cats_vs_dogs downloaded and prepared to /home/docs/.cache/huggingface/datasets/cats_vs_dogs/default/1.0.0/d4fe9cf31b294ed8639aa58f7d8ee13fe189011837038ed9a774fde19a911fcb. Subsequent calls will reuse this data.

See more information on the dataset like features and number of examples

[4]:

dataset

[4]:

Dataset({
    features: ['image', 'labels'],
    num_rows: 23410
})

dataset.features is a dict[column_name, column_type] that contains information about the different columns in the dataset and the type of each column. Use dataset.features to find the key that contains the Image feature.

[5]:

dataset.features

[5]:

{'image': Image(decode=True, id=None),
 'labels': ClassLabel(names=['cat', 'dog'], id=None)}

2. View sample images in the dataset#

Initialize Imagelab

[6]:

imagelab = Imagelab(hf_dataset=dataset, image_key="image")

[7]:

imagelab.visualize()

Sample images from the dataset

../_images/tutorials_huggingface_dataset_14_1.svg

3. Run CleanVision#

[8]:

imagelab.find_issues()

Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...

Issue checks completed. 127 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().

4. View Results#

Get a report of all the issues found

[9]:

imagelab.report()

Issues found in images in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | blurry           |           49 |
|  1 | exact_duplicates |           48 |
|  2 | near_duplicates  |           12 |
|  3 | grayscale        |            5 |
|  4 | low_information  |            5 |
|  5 | odd_aspect_ratio |            5 |
|  6 | dark             |            1 |
|  7 | light            |            1 |
|  8 | odd_size         |            1 |

---------------------- blurry images -----------------------

Number of examples with this issue: 49
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_18_1.svg

----------------- exact_duplicates images ------------------

Number of examples with this issue: 48
Examples representing most severe instances of this issue:

Set: 0

../_images/tutorials_huggingface_dataset_18_3.svg

Set: 1

../_images/tutorials_huggingface_dataset_18_5.svg

Set: 2

../_images/tutorials_huggingface_dataset_18_7.svg

Set: 3

../_images/tutorials_huggingface_dataset_18_9.svg

------------------ near_duplicates images ------------------

Number of examples with this issue: 12
Examples representing most severe instances of this issue:

Set: 0

../_images/tutorials_huggingface_dataset_18_11.svg

Set: 1

../_images/tutorials_huggingface_dataset_18_13.svg

Set: 2

../_images/tutorials_huggingface_dataset_18_15.svg

Set: 3

../_images/tutorials_huggingface_dataset_18_17.svg

--------------------- grayscale images ---------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_18_19.svg

------------------ low_information images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_18_21.svg

----------------- odd_aspect_ratio images ------------------

Number of examples with this issue: 5
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_18_23.svg

----------------------- dark images ------------------------

Number of examples with this issue: 1
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_18_25.svg

----------------------- light images -----------------------

Number of examples with this issue: 1
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_18_27.svg

--------------------- odd_size images ----------------------

Number of examples with this issue: 1
Examples representing most severe instances of this issue:

../_images/tutorials_huggingface_dataset_18_29.svg

View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.

[10]:

imagelab.issues

[10]:

	odd_size_score	is_odd_size_issue	odd_aspect_ratio_score	is_odd_aspect_ratio_issue	low_information_score	is_low_information_issue	light_score	is_light_issue	grayscale_score	is_grayscale_issue	dark_score	is_dark_issue	blurry_score	is_blurry_issue	exact_duplicates_score	is_exact_duplicates_issue	near_duplicates_score	is_near_duplicates_issue
0	0.987927	False	0.750000	False	0.886329	False	0.916612	False	1	False	0.915625	False	0.419102	False	1.0	False	1.0	False
1	0.678716	False	0.936667	False	0.870913	False	0.920491	False	1	False	0.679669	False	0.389699	False	1.0	False	1.0	False
2	0.865139	False	0.978000	False	0.926407	False	0.947426	False	1	False	0.916317	False	0.478126	False	1.0	False	1.0	False
3	0.952989	False	0.806000	False	0.910463	False	0.741769	False	1	False	0.995705	False	0.507646	False	1.0	False	1.0	False
4	0.350643	False	1.000000	False	0.936971	False	0.944575	False	1	False	0.968940	False	0.492275	False	1.0	False	1.0	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
23405	0.607332	False	0.750000	False	0.812455	False	0.956905	False	1	False	0.755658	False	0.410711	False	1.0	False	1.0	False
23406	0.707879	False	0.748571	False	0.856650	False	0.916063	False	1	False	0.661844	False	0.366442	False	1.0	False	1.0	False
23407	0.987927	False	0.750000	False	0.902067	False	0.870589	False	1	False	0.907316	False	0.522120	False	1.0	False	1.0	False
23408	0.950985	False	0.662000	False	0.903922	False	0.792583	False	1	False	0.957101	False	0.438148	False	1.0	False	1.0	False
23409	0.281972	False	0.646667	False	0.863999	False	0.803935	False	1	False	1.000000	False	0.562462	False	1.0	False	1.0	False

23410 rows × 18 columns

Get indices of all blurry images in the dataset sorted by their blurry score.

[11]:

indices = imagelab.issues.query('is_blurry_issue').sort_values(by='blurry_score').index.tolist()

View the 8th blurriest image in the dataset

[12]:

dataset[indices[8]]['image']

[12]:

../_images/tutorials_huggingface_dataset_24_0.png

View global information about each issue, such as how many images in the dataset suffer from this issue.

[13]:

imagelab.issue_summary

[13]:

	issue_type	num_images
0	blurry	49
1	exact_duplicates	48
2	near_duplicates	12
3	grayscale	5
4	low_information	5
5	odd_aspect_ratio	5
6	dark	1
7	light	1
8	odd_size	1

For more detailed guide on how to use CleanVision, check thetutorial notebook.