Overview#

Open In Colab

What is CleanVision?#

CleanVision is built to automatically detects various issues in image datasets, such as images that are: (near) duplicates, blurry, over/under-exposed, etc. This data-centric AI package is designed as a quick first step for any computer vision project to find problems in your dataset, which you may want to address before applying machine learning.

Issue Type

Description

Issue Key

1

Light

Images that are too bright/washed out in the dataset

light

2

Dark

Images that are irregularly dark

dark

3

Odd Aspect Ratio

Images with an unusual aspect ratio (i.e. overly skinny/wide)

odd_aspect_ratio

4

Exact Duplicates

Images that are exact duplicates of each other

exact_duplicates

5

Near Duplicates

Images that are almost visually identical to each other (e.g. same image with different filters)

near_duplicates

6

Blurry

Images that are blurry or out of focus

blurry

7

Grayscale

Images that are grayscale (lacking color)

grayscale

8

Low Information

Images that lack much information (e.g. a completely black image with a few white dots)

low_information

The Issue Key column specifies the name for each type of issue in CleanVision code. See our examples which use these keys to detect only particular issue types and specify nondefault parameter settings to use when checking for certain issues.

This notebook uses an example dataset, that you can download using these commands.

wget - nc ‘https://cleanlab-public.s3.amazonaws.com/CleanVision/image_files.zip

unzip -q image_files.zip

Examples#

1. Using CleanVision to detect default issue types#

[4]:
from cleanvision.imagelab import Imagelab

# Path to your dataset, you can specify your own dataset path
dataset_path = "./image_files/"

# Initialize imagelab with your dataset
imagelab = Imagelab(data_path=dataset_path)

# Visualize a few sample images from the dataset
imagelab.visualize(num_images=8)

# Find issues
imagelab.find_issues()
Reading images from /home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files
Sample images from the dataset
../_images/tutorials_tutorial_11_1.svg
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale images ...
100%|██████████| 595/595 [00:05<00:00, 115.49it/s]
100%|██████████| 595/595 [00:02<00:00, 210.45it/s]
Issue checks completed. To see a detailed report of issues found, use imagelab.report().

The report() method helps you quickly understand the major issues detected in the dataset. It reports the number of images in the dataset that exhibit each type of issue, and shows example images corresponding to the most severe instances of each issue.

[5]:
imagelab.report()
Issues found in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | grayscale        |           20 |
|  1 | near_duplicates  |           20 |
|  2 | exact_duplicates |           19 |
|  3 | dark             |           13 |
|  4 | blurry           |           10 |
|  5 | odd_aspect_ratio |            8 |
|  6 | light            |            5 |
|  7 | low_information  |            4 |


Top 4 examples with grayscale issue in the dataset.
../_images/tutorials_tutorial_13_1.svg

Top 4 sets of images with near_duplicates issue
Set: 0
../_images/tutorials_tutorial_13_3.svg
Set: 1
../_images/tutorials_tutorial_13_5.svg
Set: 2
../_images/tutorials_tutorial_13_7.svg
Set: 3
../_images/tutorials_tutorial_13_9.svg

Top 4 sets of images with exact_duplicates issue
Set: 0
../_images/tutorials_tutorial_13_11.svg
Set: 1
../_images/tutorials_tutorial_13_13.svg
Set: 2
../_images/tutorials_tutorial_13_15.svg
Set: 3
../_images/tutorials_tutorial_13_17.svg

Top 4 examples with dark issue in the dataset.
../_images/tutorials_tutorial_13_19.svg

Top 4 examples with blurry issue in the dataset.
../_images/tutorials_tutorial_13_21.svg

Top 4 examples with odd_aspect_ratio issue in the dataset.
../_images/tutorials_tutorial_13_23.svg

Top 4 examples with light issue in the dataset.
../_images/tutorials_tutorial_13_25.svg

Top 4 examples with low_information issue in the dataset.
../_images/tutorials_tutorial_13_27.svg

The main way to interface with your data is via the Imagelab class. This class can be used to understand the issues in your dataset at a high level (global overview) and low level (issues and quality scores for each image) as well as additional information about the dataset. It has three main attributes: - Imagelab.issue_summary - Imagelab.issues - Imagelab.info

imagelab.issue_summary#

Dataframe with global summary of all issue types detected in your dataset and the overall prevalence of each type.

In each row:
issue_type - name of the issue
num_images - number of images of that issue type found in the dataset
[6]:
imagelab.issue_summary
[6]:
issue_type num_images
0 grayscale 20
1 near_duplicates 20
2 exact_duplicates 19
3 dark 13
4 blurry 10
5 odd_aspect_ratio 8
6 light 5
7 low_information 4

imagelab.issues#

DataFrame assessing each image in your dataset, reporting which issues each image exhibits and a quality score for each type of issue.

[7]:
imagelab.issues.head()
[7]:
odd_aspect_ratio_score is_odd_aspect_ratio_issue low_information_score is_low_information_issue light_score is_light_issue grayscale_score is_grayscale_issue dark_score is_dark_issue blurry_score is_blurry_issue is_exact_duplicates_issue is_near_duplicates_issue
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_0.png 1.0 False 0.806332 False 0.925490 False 1 False 1.000000 False 0.373038 False False False
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_1.png 1.0 False 0.923116 False 0.906609 False 1 False 0.990676 False 0.345064 False False False
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_10.png 1.0 False 0.875129 False 0.995127 False 1 False 0.795937 False 0.534317 False False False
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_100.png 1.0 False 0.916140 False 0.889762 False 1 False 0.827587 False 0.494283 False False False
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_101.png 1.0 False 0.779338 False 0.960784 False 0 True 0.992157 False 0.471333 False False False

There is a Boolean column for each issue type, showing whether each image exhibits that type of issue or not. For example, the rows where the is_dark_issue column contains True, those rows correspond to images that appear too dark.

For the dark issue type (and more generally for other types of issues), there is a numeric column dark_score, which assesses how severe this issue is in each image. These quality scores lie between 0 and 1, where lower values indicate more severe instances of the issue (images which are darker in this example).

One use-case for imagelab.issues is to filter out all images exhibiting one particular type of issue and rank them by their quality score. Here’s how to get all blurry images ranked by their blurry_score, note lower scores indicate higher severity:

[8]:
blurry_images = imagelab.issues[imagelab.issues["is_blurry_issue"] == True].sort_values(by=['blurry_score'])
blurry_image_files = blurry_images.index.tolist()

Visualize the blurry images

[9]:
imagelab.visualize(image_files=blurry_image_files[:4])
../_images/tutorials_tutorial_21_0.svg

A shorter way to accomplish the above task is to specify an issue type in imagelab.visualize(). This will show images ordered by the severity of this issue within them.

[10]:
imagelab.visualize(issue_types=['blurry'])

Top 4 examples with blurry issue in the dataset.
../_images/tutorials_tutorial_23_1.svg

imagelab.info#

This is a nested dictionary containing statistics about the images and other miscellaneous information stored while checking for issues in the dataset. Beware: this dictionary may be large and poorly organized (it is only intended for advanced users).

Possible keys in this dict are statistics and a key corresponding to each issue type

[11]:
imagelab.info.keys()
[11]:
dict_keys(['statistics', 'dark', 'light', 'odd_aspect_ratio', 'low_information', 'blurry', 'grayscale', 'exact_duplicates', 'near_duplicates'])

imagelab.info['statistics'] is also a dict containing statistics calculated on images while checking for issues in the dataset.

[12]:
imagelab.info['statistics'].keys()
[12]:
dict_keys(['brightness', 'aspect_ratio', 'entropy', 'blurriness', 'color_space'])

You can see entropy values for each image in the dataset as shown below.

[13]:
imagelab.info['statistics']['entropy']
[13]:
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_0.png      8.063322
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_1.png      9.231165
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_10.png     8.751287
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_100.png    9.161396
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_101.png    7.793376
                                                                                                                                   ...
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_95.png     8.296915
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_96.png     9.155042
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_97.png     9.159282
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_98.png     8.961497
/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_99.png     8.136469
Name: entropy, Length: 595, dtype: float64

Duplicate sets#

imagelab.info can also be used to retrieve which images are near or exact duplicates of each other.

issue.summary shows the number of exact duplicate images but does not show how many such sets of duplicates images exist in the dataset. To see the number of exact duplicate sets, you can use imagelab.info

[14]:
imagelab.info['exact_duplicates']['num_sets']
[14]:
9

You can also get exactly which images are there in each (exact/near) duplicated set using imagelab.info.

[15]:
imagelab.info['exact_duplicates']['sets']
[15]:
[['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_142.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_236.png'],
 ['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_170.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_299.png'],
 ['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_190.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_197.png'],
 ['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_288.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_289.png'],
 ['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_292.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_348.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_492.png'],
 ['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_30.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_55.png'],
 ['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_351.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_372.png'],
 ['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_379.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_579.png'],
 ['/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_550.png',
  '/home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files/image_7.png']]

The rest of this notebook demonstrates more advanced/customized workflows you can do with CleanVision.

2. Using CleanVision to detect specific issues#

It might be the case that only a few issue types are relevant for your dataset and you don’t want to run it through all checks to save time. You can do so by specifying issue_types as an argument.

issue_types is a dict, where keys are the issue types that you want to detect and values are dict which contains hyperparameters. This example uses default hyperparameters, in which case you can leave the hyperparameter dict empty. To find keys for issue types check the above table that lists all issue types supported by CleanVision.

[16]:
# Initialize imagelab with your dataset
imagelab = Imagelab(data_path=dataset_path)

# specify issue types to detect
issue_types = {"dark": {}}

# Find issues
imagelab.find_issues(issue_types)

# Show a report of the issues found
imagelab.report()
Reading images from /home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files
Checking for dark images ...
100%|██████████| 595/595 [00:03<00:00, 155.96it/s]
Issue checks completed. To see a detailed report of issues found, use imagelab.report().
Issues found in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           13 |


Top 4 examples with dark issue in the dataset.
../_images/tutorials_tutorial_38_3.svg

3. Check for additional types of issues using the same instance#

Suppose you also want to check for blurry images after having already detected dark images in the dataset. You can use the same Imagelab instance to incrementally check for another type of issue like blurry images.

[17]:
issue_types = {"blurry": {}}

imagelab.find_issues(issue_types)

imagelab.report()
Checking for blurry images ...
100%|██████████| 595/595 [00:02<00:00, 228.10it/s]
Issue checks completed. To see a detailed report of issues found, use imagelab.report().
Issues found in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | dark         |           13 |
|  1 | blurry       |           10 |


Top 4 examples with dark issue in the dataset.
../_images/tutorials_tutorial_41_3.svg

Top 4 examples with blurry issue in the dataset.
../_images/tutorials_tutorial_41_5.svg

4. Save and load#

CleanVision also has a save and load functionality that you can use to save the results and load them at a later point in time to see results or run more checks.

For saving, specify force=True to overwrite existing files.

[18]:
save_path = "./results"
imagelab.save(save_path)
Saved Imagelab to folder: ./results
The data path and dataset must be not be changed to maintain consistent state when loading this Imagelab

For loading a saved instance, specify dataset_path to help check for any inconsistencies between dataset paths in the previous and current run.

[19]:
imagelab = Imagelab.load(save_path, dataset_path)
Successfully loaded Imagelab

5. Check for an issue with a different threshold#

You can use the loaded imagelab instance to check for an issue type with a custom hyperparameter. Here is a table of hyperparameters that each issue type supports and their permissible values.

threshold- All images with scores below this threshold will be flagged as an issue.

hash_size - This controls how much detail about an image we want to keep for getting perceptual hash. Higher sizes imply more detail.

hash_type - Type of perceptual hash to use. Currently whash and phash are the supported hash types. Check here for more details on these hash types.

Issue Key

Hyperparameters

1

light

threshold (between 0 and 1)

2

dark

threshold (between 0 and 1)

3

odd_aspect_ratio

threshold (between 0 and 1)

4

exact_duplicates

N/A

5

near_duplicates

hash_size (power of 2), hash_types (whash, phash)

6

blurry

threshold (between 0 and 1)

7

grayscale

threshold (between 0 and 1)

8

low_information

threshold (between 0 and 1)

[20]:
issue_types = {"dark": {"threshold": 0.2}}
imagelab.find_issues(issue_types)

imagelab.report()
Checking for dark images ...
Issue checks completed. To see a detailed report of issues found, use imagelab.report().
Issues found in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | blurry       |           10 |
|  1 | dark         |            8 |


Top 4 examples with blurry issue in the dataset.
../_images/tutorials_tutorial_50_1.svg

Top 4 examples with dark issue in the dataset.
../_images/tutorials_tutorial_50_3.svg

Note the number of images with dark issue has reduced from the previous run.

6. Run CleanVision for default issue types, but override hyperparameters for one or more issues#

[21]:
imagelab = Imagelab(data_path=dataset_path)

# Check for all default issue types
imagelab.find_issues()

# Specify an issue with custom hyperparameters
issue_types = {"odd_aspect_ratio": {"threshold": 0.2}}

# Run find issues again with specified issue types
imagelab.find_issues(issue_types)


# Pass list of issue_types to imagelab.report() to report only those issue_types
imagelab.report(["odd_aspect_ratio", "low_information"])
Reading images from /home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files
Checking for dark, light, odd_aspect_ratio, low_information, exact_duplicates, near_duplicates, blurry, grayscale images ...
100%|██████████| 595/595 [00:04<00:00, 123.59it/s]
100%|██████████| 595/595 [00:02<00:00, 218.30it/s]
Issue checks completed. To see a detailed report of issues found, use imagelab.report().
Checking for odd_aspect_ratio images ...
Issue checks completed. To see a detailed report of issues found, use imagelab.report().
Issues found in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  6 | low_information  |            4 |
|  7 | odd_aspect_ratio |            1 |


Top 4 examples with low_information issue in the dataset.
../_images/tutorials_tutorial_53_3.svg
Found 1 example with odd_aspect_ratio issue in the dataset.
../_images/tutorials_tutorial_53_5.svg

7. Customize report#

Report can also be customized in various ways to help with the analysis. For example, you can change the verbosity to return more or less information on issues found, default is verbosity=1

[22]:
# Change verbosity
imagelab.report(verbosity=2)
Issues found in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  0 | grayscale        |           20 |
|  1 | near_duplicates  |           20 |
|  2 | exact_duplicates |           19 |
|  3 | dark             |           13 |
|  4 | blurry           |           10 |
|  5 | light            |            5 |
|  6 | low_information  |            4 |
|  7 | odd_aspect_ratio |            1 |


Top 8 examples with grayscale issue in the dataset.
../_images/tutorials_tutorial_56_1.svg

Top 8 sets of images with near_duplicates issue
Set: 0
../_images/tutorials_tutorial_56_3.svg
Set: 1
../_images/tutorials_tutorial_56_5.svg
Set: 2
../_images/tutorials_tutorial_56_7.svg
Set: 3
../_images/tutorials_tutorial_56_9.svg
Set: 4
../_images/tutorials_tutorial_56_11.svg
Set: 5
../_images/tutorials_tutorial_56_13.svg
Set: 6
../_images/tutorials_tutorial_56_15.svg
Set: 7
../_images/tutorials_tutorial_56_17.svg

Top 8 sets of images with exact_duplicates issue
Set: 0
../_images/tutorials_tutorial_56_19.svg
Set: 1
../_images/tutorials_tutorial_56_21.svg
Set: 2
../_images/tutorials_tutorial_56_23.svg
Set: 3
../_images/tutorials_tutorial_56_25.svg
Set: 4
../_images/tutorials_tutorial_56_27.svg
Set: 5
../_images/tutorials_tutorial_56_29.svg
Set: 6
../_images/tutorials_tutorial_56_31.svg
Set: 7
../_images/tutorials_tutorial_56_33.svg

Top 8 examples with dark issue in the dataset.
../_images/tutorials_tutorial_56_35.svg

Top 8 examples with blurry issue in the dataset.
../_images/tutorials_tutorial_56_37.svg
Found 5 examples with light issue in the dataset.
../_images/tutorials_tutorial_56_39.svg
Found 4 examples with low_information issue in the dataset.
../_images/tutorials_tutorial_56_41.svg
Found 1 example with odd_aspect_ratio issue in the dataset.
../_images/tutorials_tutorial_56_43.svg

You may want to exclude issues from your report which are prevalent in say more than 50% of the dataset and are not real issues but just how the dataset is, for example dark images in an astronomy dataset may not be an issue. You can use the max_prevalence parameter in report to exclude such issues. In this example all issues present in more than 3% of the dataset are excluded.

[23]:
imagelab.report(max_prevalence=0.03)
Removing grayscale from potential issues in the dataset as it exceeds max_prevalence=0.03
Removing near_duplicates from potential issues in the dataset as it exceeds max_prevalence=0.03
Removing exact_duplicates from potential issues in the dataset as it exceeds max_prevalence=0.03
Issues found in order of severity in the dataset

|    | issue_type       |   num_images |
|---:|:-----------------|-------------:|
|  3 | dark             |           13 |
|  4 | blurry           |           10 |
|  5 | light            |            5 |
|  6 | low_information  |            4 |
|  7 | odd_aspect_ratio |            1 |


Top 4 examples with dark issue in the dataset.
../_images/tutorials_tutorial_58_1.svg

Top 4 examples with blurry issue in the dataset.
../_images/tutorials_tutorial_58_3.svg

Top 4 examples with light issue in the dataset.
../_images/tutorials_tutorial_58_5.svg

Top 4 examples with low_information issue in the dataset.
../_images/tutorials_tutorial_58_7.svg
Found 1 example with odd_aspect_ratio issue in the dataset.
../_images/tutorials_tutorial_58_9.svg

8. Visualize specific issues#

Imagelab provides imagelab.visualize that you can use to see examples of specific issues in your dataset.

num_images and cell_size are optional arguments, that you can use to control number of examples of each issue type and size of each image in the grid respectively.

[24]:
issue_types = ["grayscale"]
imagelab.visualize(issue_types=issue_types, num_images=8, cell_size=(3, 3))

Top 8 examples with grayscale issue in the dataset.
../_images/tutorials_tutorial_61_1.svg

Advanced: Create your own issue type#

You can also create a custom issue type by extending the base class IssueManager. CleanVision can then detect your custom issue along with other pre-defined issues in any image dataset! Here’s an example of a custom issue manager, which can also be found in the examples/ folder of the source code.

[25]:
from typing import Any, Dict, List, Optional

import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm

from cleanvision.dataset.base_dataset import Dataset
from cleanvision.issue_managers import register_issue_manager
from cleanvision.utils.base_issue_manager import IssueManager
from cleanvision.utils.utils import get_is_issue_colname, get_score_colname

ISSUE_NAME = "custom"


@register_issue_manager(ISSUE_NAME)
class CustomIssueManager(IssueManager):
    """
    Example class showing how you can self-define a custom type of issue that
    CleanVision can simultaneously check your data for alongside its built-in issue types.
    """

    issue_name: str = ISSUE_NAME
    visualization: str = "individual_images"

    def __init__(self) -> None:
        super().__init__()
        self.params = self.get_default_params()

    def get_default_params(self) -> Dict[str, Any]:
        return {"threshold": 0.4}

    def update_params(self, params: Dict[str, Any]) -> None:
        self.params = self.get_default_params()
        non_none_params = {k: v for k, v in params.items() if v is not None}
        self.params = {**self.params, **non_none_params}

    @staticmethod
    def calculate_mean_pixel_value(image: Image.Image) -> float:
        gray_image = image.convert("L")
        return np.mean(np.array(gray_image))

    def get_scores(self, raw_scores: List[float]) -> "np.ndarray[Any, Any]":
        scores = np.array(raw_scores)
        return scores / 255.0

    def mark_issue(self, scores: pd.Series, threshold: float) -> pd.Series:
        return scores < threshold

    def update_summary(self, summary_dict: Dict[str, Any]) -> None:
        self.summary = pd.DataFrame({"issue_type": [self.issue_name]})
        for column_name, value in summary_dict.items():
            self.summary[column_name] = [value]

    def find_issues(
        self,
        *,
        params: Optional[Dict[str, Any]] = None,
        dataset: Optional[Dataset] = None,
        imagelab_info: Optional[Dict[str, Any]] = None,
        **kwargs: Any,
    ) -> None:
        super().find_issues(**kwargs)
        assert params is not None
        assert imagelab_info is not None
        assert dataset is not None

        self.update_params(params)

        raw_scores = []
        for idx in tqdm(dataset.index):
            image = dataset[idx]
            raw_scores.append(self.calculate_mean_pixel_value(image))

        score_colname = get_score_colname(self.issue_name)
        is_issue_colname = get_is_issue_colname(self.issue_name)

        scores = pd.DataFrame(index=dataset.index)
        scores[score_colname] = self.get_scores(raw_scores)

        is_issue = pd.DataFrame(index=dataset.index)
        is_issue[is_issue_colname] = self.mark_issue(
            scores[score_colname], self.params["threshold"]
        )

        self.issues = pd.DataFrame(index=dataset.index)
        self.issues = self.issues.join(scores)
        self.issues = self.issues.join(is_issue)

        self.info[self.issue_name] = {"PixelValue": raw_scores}
        summary_dict = self._compute_summary(
            self.issues[get_is_issue_colname(self.issue_name)]
        )

        self.update_summary(summary_dict)

9. Run CleanVision with a custom issue#

[26]:
imagelab = Imagelab(data_path=dataset_path)

issue_name = CustomIssueManager.issue_name


# To ensure your issue manager is registered, check list of possible issue types
# issue_name should be present in this list
imagelab.list_possible_issue_types()
Reading images from /home/docs/checkouts/readthedocs.org/user_builds/cleanvision/checkouts/v0.2.1/docs/source/tutorials/image_files
All possible issues checked by Imagelab:

image_property
exact_duplicates
near_duplicates
blurry
low_information
grayscale
custom
light
odd_aspect_ratio
duplicate
dark


[27]:
issue_types = {issue_name: {}}
imagelab.find_issues(issue_types)
imagelab.report()
Checking for custom images ...
100%|██████████| 595/595 [00:01<00:00, 311.03it/s]
Issue checks completed. To see a detailed report of issues found, use imagelab.report().
Issues found in order of severity in the dataset

|    | issue_type   |   num_images |
|---:|:-------------|-------------:|
|  0 | custom       |          204 |


Top 4 examples with custom issue in the dataset.
../_images/tutorials_tutorial_67_3.svg

Beyond the collection of image files demonstrated here, you can alternatively run CleanVision on:Hugging Face datasetsandtorchvision datasets.