Run CleanVision on a Hugging Face dataset#

After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.

[2]:

from datasets import load_dataset, concatenate_datasets
from cleanvision.imagelab import Imagelab

/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/envs/v0.2.1/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

1. Download dataset and concatenate all splits#

Since we’re interested in generally understanding what issues plague our data, we merge the training and test sets into one larger dataset before running CleanVision. You could alternatively just run the package on these two sets of data separately to obtain two different reports.

CIFAR10 is classification dataset, but CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).

Load all splits of the CIFAR10 dataset

[3]:

dataset_dict = load_dataset("cifar10")

Downloading builder script: 100%|██████████| 3.61k/3.61k [00:00<00:00, 1.46MB/s]
Downloading metadata: 100%|██████████| 1.66k/1.66k [00:00<00:00, 1.11MB/s]
Downloading readme: 100%|██████████| 5.00k/5.00k [00:00<00:00, 2.51MB/s]

Downloading and preparing dataset cifar10/plain_text to /home/docs/.cache/huggingface/datasets/cifar10/plain_text/1.0.0/447d6ec4733dddd1ce3bb577c7166b986eaa4c538dcd9e805ba61f35674a9de4...

Downloading data: 100%|██████████| 170M/170M [00:02<00:00, 66.3MB/s]
Generating train split:   0%|          | 0/50000 [00:00<?, ? examples/s]/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/envs/v0.2.1/lib/python3.11/site-packages/datasets/features/image.py:325: UserWarning: Downcasting array dtype uint8 to uint8 to be compatible with 'Pillow'
  warnings.warn(f"Downcasting array dtype {dtype} to {dest_dtype} to be compatible with 'Pillow'")

Dataset cifar10 downloaded and prepared to /home/docs/.cache/huggingface/datasets/cifar10/plain_text/1.0.0/447d6ec4733dddd1ce3bb577c7166b986eaa4c538dcd9e805ba61f35674a9de4. Subsequent calls will reuse this data.

100%|██████████| 2/2 [00:00<00:00, 533.69it/s]

See more information on the dataset like features, number of examples in each split

[4]:

dataset_dict

[4]:

DatasetDict({
    train: Dataset({
        features: ['img', 'label'],
        num_rows: 50000
    })
    test: Dataset({
        features: ['img', 'label'],
        num_rows: 10000
    })
})

Concatenate train and test splits

[5]:

dataset = concatenate_datasets([d for d in dataset_dict.values()])

Dataset after concatenating

[6]:

dataset

[6]:

Dataset({
    features: ['img', 'label'],
    num_rows: 60000
})

dataset.features is a dict[column_name, column_type] that contains information about the different columns in the dataset and the type of each column. Use dataset.features to find the key that contains the Image feature.

[7]:

dataset.features

[7]:

{'img': Image(decode=True, id=None),
 'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'], id=None)}

Let’s look at the first image in this dataset

[8]:

dataset[0]["img"]

[8]:

../_images/tutorials_huggingface_dataset_16_0.png

2. Run CleanVision#

[9]:

imagelab = Imagelab(hf_dataset=dataset, image_key="img")
imagelab.find_issues()

Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale images ...

100%|██████████| 60000/60000 [00:57<00:00, 1043.14it/s]
100%|██████████| 60000/60000 [00:33<00:00, 1791.74it/s]

Issue checks completed. To see a detailed report of issues found, use imagelab.report().

3. View Results#

Get a report of all the issues found

[10]:

imagelab.report()

Issues found in order of severity in the dataset

|    | issue_type      |   num_images |
|---:|:----------------|-------------:|
|  0 | near_duplicates |           40 |
|  1 | dark            |           29 |
|  2 | light           |            3 |
|  3 | low_information |            1 |


Top 4 sets of images with near_duplicates issue
Set: 0

../_images/tutorials_huggingface_dataset_20_1.svg

Set: 1

../_images/tutorials_huggingface_dataset_20_3.svg

Set: 2

../_images/tutorials_huggingface_dataset_20_5.svg

Set: 3

../_images/tutorials_huggingface_dataset_20_7.svg


Top 4 examples with dark issue in the dataset.

../_images/tutorials_huggingface_dataset_20_9.svg

Found 3 examples with light issue in the dataset.

../_images/tutorials_huggingface_dataset_20_11.svg

Found 1 example with low_information issue in the dataset.

../_images/tutorials_huggingface_dataset_20_13.svg

View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.

[11]:

imagelab.issues

[11]:

	odd_aspect_ratio_score	is_odd_aspect_ratio_issue	low_information_score	is_low_information_issue	light_score	is_light_issue	grayscale_score	is_grayscale_issue	dark_score	is_dark_issue	blurry_score	is_blurry_issue	is_exact_duplicates_issue	is_near_duplicates_issue
0	1.0	False	0.813863	False	0.670485	False	1	False	0.761960	False	0.447264	False	False	False
1	1.0	False	0.889314	False	0.928179	False	1	False	0.870204	False	0.497561	False	False	False
2	1.0	False	0.868758	False	0.799635	False	1	False	0.752100	False	0.507733	False	False	False
3	1.0	False	0.883888	False	0.992232	False	1	False	0.872505	False	0.530581	False	False	False
4	1.0	False	0.902695	False	0.911035	False	1	False	0.897581	False	0.530771	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
59995	1.0	False	0.860407	False	0.794629	False	1	False	0.996078	False	0.523458	False	False	False
59996	1.0	False	0.888932	False	0.939203	False	1	False	0.843293	False	0.498186	False	False	False
59997	1.0	False	0.818150	False	0.960275	False	1	False	0.865067	False	0.444907	False	False	False
59998	1.0	False	0.900018	False	0.892104	False	1	False	0.952069	False	0.528622	False	False	False
59999	1.0	False	0.858985	False	0.809504	False	1	False	0.932046	False	0.501550	False	False	False

60000 rows × 14 columns

Get indices of all dark images in the dataset sorted by their dark score.

[12]:

indices = imagelab.issues.query('is_dark_issue').sort_values(by='dark_score').index.tolist()

View the 5th darkest image in the dataset

[13]:

dataset[indices[5]]['img']

[13]:

../_images/tutorials_huggingface_dataset_26_0.png

View global information about each issue, such as how many images in the dataset suffer from this issue.

[14]:

imagelab.issue_summary

[14]:

	issue_type	num_images
0	near_duplicates	40
1	dark	29
2	light	3
3	low_information	1
4	blurry	0
5	grayscale	0
6	odd_aspect_ratio	0
7	exact_duplicates	0

For more detailed guide on how to use CleanVision, check thetutorial notebook.