Run CleanVision on a Hugging Face dataset#
After you install these packages, you may need to restart your notebook runtime before running the rest of this notebook.
[2]:
from datasets import load_dataset
from cleanvision import Imagelab
1. Download dataset#
cats_vs_dogs is a subset of Assira dataset which contains millions of images of pets classified into cats and dogs.
Please note though this a classification dataset, CleanVision can be used to audit images from any type of dataset (including supervised or unsupervised learning).
Load train split of the dataset.
[4]:
dataset = load_dataset("cats_vs_dogs", split="train")
See more information on the dataset like features and number of examples
[5]:
dataset
[5]:
Dataset({
features: ['image', 'labels'],
num_rows: 23410
})
dataset.features
is a dict[column_name, column_type]
that contains information about the different columns in the dataset and the type of each column. Use dataset.features
to find the key that contains the Image feature.
[6]:
dataset.features
[6]:
{'image': Image(decode=True, id=None),
'labels': ClassLabel(names=['cat', 'dog'], id=None)}
2. View sample images in the dataset#
Initialize Imagelab
[7]:
imagelab = Imagelab(hf_dataset=dataset, image_key="image")
[8]:
imagelab.visualize()
Sample images from the dataset
3. Run CleanVision#
[9]:
imagelab.find_issues()
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale, odd_size images ...
Issue checks completed. 146 issues found in the dataset. To see a detailed report of issues found, use imagelab.report().
4. View Results#
Get a report of all the issues found
[10]:
imagelab.report()
Issues found in images in order of severity in the dataset
| | issue_type | num_images |
|---:|:-----------------|-------------:|
| 0 | blurry | 49 |
| 1 | exact_duplicates | 48 |
| 2 | odd_size | 20 |
| 3 | near_duplicates | 12 |
| 4 | odd_aspect_ratio | 5 |
| 5 | grayscale | 5 |
| 6 | low_information | 5 |
| 7 | dark | 1 |
| 8 | light | 1 |
---------------------- blurry images -----------------------
Number of examples with this issue: 49
Examples representing most severe instances of this issue:
----------------- exact_duplicates images ------------------
Number of examples with this issue: 48
Examples representing most severe instances of this issue:
Set: 0
Set: 1
Set: 2
Set: 3
--------------------- odd_size images ----------------------
Number of examples with this issue: 20
Examples representing most severe instances of this issue:
------------------ near_duplicates images ------------------
Number of examples with this issue: 12
Examples representing most severe instances of this issue:
Set: 0
Set: 1
Set: 2
Set: 3
----------------- odd_aspect_ratio images ------------------
Number of examples with this issue: 5
Examples representing most severe instances of this issue:
--------------------- grayscale images ---------------------
Number of examples with this issue: 5
Examples representing most severe instances of this issue:
------------------ low_information images ------------------
Number of examples with this issue: 5
Examples representing most severe instances of this issue:
----------------------- dark images ------------------------
Number of examples with this issue: 1
Examples representing most severe instances of this issue:
----------------------- light images -----------------------
Number of examples with this issue: 1
Examples representing most severe instances of this issue:
View more information about each image, such as what types of issues it exhibits and its quality score with respect to each type of issue.
[11]:
imagelab.issues.head()
[11]:
odd_size_score | is_odd_size_issue | odd_aspect_ratio_score | is_odd_aspect_ratio_issue | low_information_score | is_low_information_issue | light_score | is_light_issue | grayscale_score | is_grayscale_issue | dark_score | is_dark_issue | blurry_score | is_blurry_issue | exact_duplicates_score | is_exact_duplicates_issue | near_duplicates_score | is_near_duplicates_issue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.928571 | False | 0.750000 | False | 0.886329 | False | 0.916612 | False | 1 | False | 0.915625 | False | 0.419102 | False | 1.0 | False | 1.0 | False |
1 | 0.853296 | False | 0.936667 | False | 0.870913 | False | 0.920491 | False | 1 | False | 0.679669 | False | 0.389699 | False | 1.0 | False | 1.0 | False |
2 | 0.834607 | False | 0.978000 | False | 0.926407 | False | 0.947426 | False | 1 | False | 0.916317 | False | 0.478126 | False | 1.0 | False | 1.0 | False |
3 | 0.904300 | False | 0.806000 | False | 0.910463 | False | 0.741769 | False | 1 | False | 0.995705 | False | 0.507646 | False | 1.0 | False | 1.0 | False |
4 | 0.638716 | False | 1.000000 | False | 0.936971 | False | 0.944575 | False | 1 | False | 0.968940 | False | 0.492275 | False | 1.0 | False | 1.0 | False |
Get indices of all blurry images in the dataset sorted by their blurry score.
[12]:
indices = (
imagelab.issues.query("is_blurry_issue")
.sort_values(by="blurry_score")
.index.tolist()
)
View the 8th blurriest image in the dataset
[13]:
dataset[indices[8]]["image"]
[13]:
View global information about each issue, such as how many images in the dataset suffer from this issue.
[14]:
imagelab.issue_summary
[14]:
issue_type | num_images | |
---|---|---|
0 | blurry | 49 |
1 | exact_duplicates | 48 |
2 | odd_size | 20 |
3 | near_duplicates | 12 |
4 | odd_aspect_ratio | 5 |
5 | grayscale | 5 |
6 | low_information | 5 |
7 | dark | 1 |
8 | light | 1 |
For more detailed guide on how to use CleanVision, check the tutorial notebook.