Run CleanVision on a Hugging Face dataset#
After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.
[2]:
from datasets import load_dataset, concatenate_datasets
from cleanvision import Imagelab
1. Download dataset and concatenate all splits#
Since we’re interested in generally understanding what issues plague our data, we merge the training and test sets into one larger dataset before running CleanVision. You could alternatively just run the package on these two sets of data separately to obtain two different reports.
CIFAR10 is classification dataset, but CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).
Load all splits of the CIFAR10 dataset
[3]:
dataset_dict = load_dataset("cifar10")
Downloading and preparing dataset cifar10/plain_text to /home/docs/.cache/huggingface/datasets/cifar10/plain_text/1.0.0/447d6ec4733dddd1ce3bb577c7166b986eaa4c538dcd9e805ba61f35674a9de4...
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/envs/v0.3.0/lib/python3.11/site-packages/datasets/features/image.py:325: UserWarning: Downcasting array dtype uint8 to uint8 to be compatible with 'Pillow'
warnings.warn(f"Downcasting array dtype {dtype} to {dest_dtype} to be compatible with 'Pillow'")
Dataset cifar10 downloaded and prepared to /home/docs/.cache/huggingface/datasets/cifar10/plain_text/1.0.0/447d6ec4733dddd1ce3bb577c7166b986eaa4c538dcd9e805ba61f35674a9de4. Subsequent calls will reuse this data.
See more information on the dataset like features, number of examples in each split
[4]:
dataset_dict
[4]:
DatasetDict({
train: Dataset({
features: ['img', 'label'],
num_rows: 50000
})
test: Dataset({
features: ['img', 'label'],
num_rows: 10000
})
})
Concatenate train and test splits
[5]:
dataset = concatenate_datasets([d for d in dataset_dict.values()])
Dataset after concatenating
[6]:
dataset
[6]:
Dataset({
features: ['img', 'label'],
num_rows: 60000
})
dataset.features
is a dict[column_name, column_type]
that contains information about the different columns in the dataset and the type of each column. Use dataset.features
to find the key that contains the Image feature.
[7]:
dataset.features
[7]:
{'img': Image(decode=True, id=None),
'label': ClassLabel(names=['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'], id=None)}
Let’s look at the first image in this dataset
[8]:
dataset[0]["img"]
[8]:
2. Run CleanVision#
[9]:
imagelab = Imagelab(hf_dataset=dataset, image_key="img")
imagelab.find_issues()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale images ...
Issue checks completed. To see a detailed report of issues found, use imagelab.report().
3. View Results#
Get a report of all the issues found
[10]:
imagelab.report()
Issues found in order of severity in the dataset
| | issue_type | num_images |
|---:|:----------------|-------------:|
| 0 | blurry | 118 |
| 1 | near_duplicates | 40 |
| 2 | dark | 11 |
| 3 | light | 3 |
| 4 | low_information | 1 |
Top 4 examples with blurry issue in the dataset.
Top 4 sets of images with near_duplicates issue
Set: 0
Set: 1
Set: 2
Set: 3
Top 4 examples with dark issue in the dataset.
Found 3 examples with light issue in the dataset.
Found 1 example with low_information issue in the dataset.
View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.
[11]:
imagelab.issues
[11]:
odd_aspect_ratio_score | is_odd_aspect_ratio_issue | low_information_score | is_low_information_issue | light_score | is_light_issue | grayscale_score | is_grayscale_issue | dark_score | is_dark_issue | blurry_score | is_blurry_issue | exact_duplicates_score | is_exact_duplicates_issue | near_duplicates_score | is_near_duplicates_issue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | False | 0.813863 | False | 0.670485 | False | 1 | False | 0.761960 | False | 0.447264 | False | 1.0 | False | 1.0 | False |
1 | 1.0 | False | 0.889314 | False | 0.928179 | False | 1 | False | 0.870204 | False | 0.497561 | False | 1.0 | False | 1.0 | False |
2 | 1.0 | False | 0.868758 | False | 0.799635 | False | 1 | False | 0.752100 | False | 0.507733 | False | 1.0 | False | 1.0 | False |
3 | 1.0 | False | 0.883888 | False | 0.992232 | False | 1 | False | 0.872505 | False | 0.530581 | False | 1.0 | False | 1.0 | False |
4 | 1.0 | False | 0.902695 | False | 0.911035 | False | 1 | False | 0.897581 | False | 0.530771 | False | 1.0 | False | 1.0 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59995 | 1.0 | False | 0.860407 | False | 0.794629 | False | 1 | False | 0.996078 | False | 0.523458 | False | 1.0 | False | 1.0 | False |
59996 | 1.0 | False | 0.888932 | False | 0.939203 | False | 1 | False | 0.843293 | False | 0.498186 | False | 1.0 | False | 1.0 | False |
59997 | 1.0 | False | 0.818150 | False | 0.960275 | False | 1 | False | 0.865067 | False | 0.444907 | False | 1.0 | False | 1.0 | False |
59998 | 1.0 | False | 0.900018 | False | 0.892104 | False | 1 | False | 0.952069 | False | 0.528622 | False | 1.0 | False | 1.0 | False |
59999 | 1.0 | False | 0.858985 | False | 0.809504 | False | 1 | False | 0.932046 | False | 0.501550 | False | 1.0 | False | 1.0 | False |
60000 rows × 16 columns
Get indices of all dark images in the dataset sorted by their dark score.
[12]:
indices = imagelab.issues.query('is_dark_issue').sort_values(by='dark_score').index.tolist()
View the 5th darkest image in the dataset
[13]:
dataset[indices[5]]['img']
[13]:
View global information about each issue, such as how many images in the dataset suffer from this issue.
[14]:
imagelab.issue_summary
[14]:
issue_type | num_images | |
---|---|---|
0 | blurry | 118 |
1 | near_duplicates | 40 |
2 | dark | 11 |
3 | light | 3 |
4 | low_information | 1 |
5 | grayscale | 0 |
6 | odd_aspect_ratio | 0 |
7 | exact_duplicates | 0 |
For more detailed guide on how to use CleanVision, check thetutorial notebook.