Imagelab#

Imagelab is the core class in CleanVision for finding all types of issues in an image dataset. The methods in this module should suffice for most use-cases, but advanced users can get extra flexibility via the code in other CleanVision modules.

Classes:

Imagelab([data_path, filepaths, hf_dataset, ...])

A single class to find all types of issues in image datasets.

class cleanvision.imagelab.Imagelab(data_path=None, filepaths=None, hf_dataset=None, image_key=None, torchvision_dataset=None, storage_opts={})[source]#

Bases: object

A single class to find all types of issues in image datasets. Imagelab detects issues in any image dataset and thus can be useful in most computer vision tasks including supervised and unsupervised training. Imagelab supports various formats for datasets: local folder containing images, a list of image filepaths, HuggingFace dataset and Torchvision dataset. Specify only one of these arguments: data_path, filepaths, (hf_dataset, image_key), torchvision_dataset

Parameters:
  • data_path (str) – Path to image files. Imagelab will recursively retrieve all image files from the specified path

  • filepaths (List[str], optional) – Issue checks will be run on this list of image paths specified in filepaths.

  • hf_dataset (datasets.Dataset) – Hugging Face dataset with images in PIL format accessible via some key in hf_dataset.features.

  • image_key (str) – Key used to access images within the Hugging Face dataset.features object. For many datasets, this key is just called “image”. This argument must be specified if you provide a Hugging Face dataset; for other types of dataset this argument has no effect.

  • torchvision_dataset (torchvision.datasets.vision.VisionDataset) – torchvision dataset where each individual example is a tuple containing exactly one image in PIL format.

Variables:
  • issues (pd.DataFrame) –

    Dataframe where each row corresponds to an image and columns specify which issues were detected in this image. It has two types of columns for each issue type:

    1. <issue_type>_score - This column contains a quality-score for each image for a particular type of issue. Scores are between 0 and 1, lower values indicate images exhibiting more severe instances of this issue.

    1. is_<issue_type>_issue - This column indicates whether or not the issue_type is detected in each image (a binary decision rather than numeric score).

  • issue_summary (pd.DataFrame) – Dataframe where each row corresponds to a type of issue and columns summarize the overall prevalence of this issue in the dataset. Specifically, it shows the number of images detected with the issue.

  • info (Dict) – Nested dictionary that contains statistics and other useful information about the dataset. Also contains additional information saved while checking for issues in the dataset.

Raises:

ValueError – If no images are found in the specified paths. If both data_path and filepaths are given or none of them are specified.

Examples

Basic usage of Imagelab class

from cleanvision import Imagelab
imagelab = Imagelab(data_path="FOLDER_WITH_IMAGES/")
imagelab.find_issues()
imagelab.report()

Methods:

list_default_issue_types()

Returns a list of the issue types that are run by default in Imagelab.find_issues()

list_possible_issue_types()

Returns a list of all the possible issue types that can be run in Imagelab.find_issues() This list will also include custom issue types if properly added.

find_issues([issue_types, n_jobs, verbose])

Finds issues in the dataset.

report([issue_types, max_prevalence, ...])

Prints summary of the issues found in your dataset.

visualize([image_files, indices, ...])

Show specific images.

get_stats()

Returns dict of statistics computed from images when auditing the data such as: brightness, color space, aspect ratio, etc.

save(path[, force])

Saves this Imagelab instance, issues and issue_summary into a folder at the given path.

load(path[, data_path])

Loads Imagelab from given path.

static list_default_issue_types()[source]#

Returns a list of the issue types that are run by default in Imagelab.find_issues()

Return type:

List[str]

static list_possible_issue_types()[source]#

Returns a list of all the possible issue types that can be run in Imagelab.find_issues() This list will also include custom issue types if properly added.

Return type:

List[str]

find_issues(issue_types=None, n_jobs=None, verbose=True)[source]#

Finds issues in the dataset. If issue_types is not provided, dataset is checked for a default set of issue types. To see default set: Imagelab.list_default_issue_types()

Parameters:
  • issue_types (Dict[str, Any], optional) – Dict with issue types to check as keys. The value of this dict is a dict containing hyperparameters for each issue type.

  • n_jobs (int, default=None) – Number of processing threads used by multiprocessing. Default None sets to the number of cores on your CPU (physical cores if you have psutil package installed, otherwise logical cores). Set this to 1 to disable parallel processing (if its causing issues). Windows users may see a speed-up with n_jobs=1.

  • verbose (bool, default=True) – If True, prints helpful information while checking for issues.

Return type:

None

Examples

To check for all default issue types use

imagelab.find_issues()

To check for specific issue types with default settings

issue_types = {
    "dark": {},
    "blurry": {}
}
imagelab.find_issues(issue_types)

To check for issue types with different hyperparameters. Different issue types can have different hyperparameters.

issue_types = {
    "dark": {"threshold": 0.1},
    "blurry": {}
}
imagelab.find_issues(issue_types)
report(issue_types=None, max_prevalence=None, num_images=None, verbosity=1, print_summary=True, show_id=False)[source]#

Prints summary of the issues found in your dataset. By default, this method depicts the images representing top-most severe instances of each issue type.

Parameters:
  • issue_types (List[str], optional) – List of issue types to consider in report. This must be subset of the issue types specified in Imagelab.find_issues`().

  • max_prevalence (float, default=0.5) – Value between 0 and 1 Issue types that are detected in more than max_prevalence fraction of the images in dataset will be omitted from the report. You are presumably already aware of these in your dataset.

  • num_images (int, default=4) – Maximum number of images to show for issue type reported. These are examples of the top-most severe instances of the issue in your dataset.

  • verbosity (int, {1, 2, 3, 4}) – Increasing verbosity increases the detail of the report. Set this to 1 to report less information, or to 4 to report the most information.

  • print_summary (bool, default=True) – If True, prints the summary of issues found in the dataset.

  • show_id (bool, default=False) – If True, prints the dataset ID of each image shown in the report.

Return type:

None

Examples

Default usage

imagelab.report()

Report specific issue types

issue_types = ["dark", "near_duplicates"]
imagelab.report(issue_types=issue_types)
visualize(image_files=None, indices=None, issue_types=None, num_images=4, cell_size=(2, 2), show_id=False)[source]#

Show specific images.

Can be used for visualizing either: 1. Particular images with paths given in image_files. 2. Images representing top-most severe instances of given issue_types detected the dataset. 3. If no image_files or issue_types are given, random images will be shown from the dataset.

If image_files is given, this overrides the argument issue_types.

Parameters:
  • image_files (List[str], optional) – List of filepaths for images to visualize.

  • indices (List[str|int], optional) – List of indices of images in the dataset to visualize. If the dataset is a local data_path, the indices are filepaths, which is also the index in imagelab.issues dataframe. If the dataset is a huggingface or torchvision dataset, indices are of type int and corresponding to the indices in the dataset object.

Return type:

None

issue_types: List[str], optional

List of issue types to visualize. For each type of issue, will show a few images representing the top-most severe instances of this issue in the dataset.

num_imagesint, optional

Number of images to visualize from the dataset. These images are randomly selected if issue_types is None. If issue_types is given, then this is the number of images for each issue type to visualize (images representing top-most severe instances of this issue will be shown). If image_files is given, this argument is ignored.

cell_sizeTuple[int, int], optional

Dimensions controlling the size of each image in the depicted image grid.

Examples

To visualize random images from the dataset

imagelab.visualize()
imagelab.visualize(num_images=8)

To visualize specific images from the dataset

image_files = ["./dataset/cat.png", "./dataset/dog.png", "./dataset/mouse.png"]
imagelab.visualize(image_files=image_files)

To visualize top examples of specific issue types from the dataset

issue_types = ["dark", "odd_aspect_ratio"]
imagelab.visualize(issue_types=issue_types)
get_stats()[source]#

Returns dict of statistics computed from images when auditing the data such as: brightness, color space, aspect ratio, etc. If statistics have not been computed yet, then returns None.

Return type:

Any

save(path, force=False)[source]#

Saves this Imagelab instance, issues and issue_summary into a folder at the given path. Your saved Imagelab should be loaded from the same version of the CleanVision package to avoid inconsistencies. This method does not save your image files.

Parameters:
  • path (str) – Path to folder where this Imagelab instance will be saved on disk.

  • force (bool, default=False) – If set to True, any existing files at path will be overwritten.

Return type:

None

classmethod load(path, data_path=None)[source]#

Loads Imagelab from given path.

Parameters:
  • path (str) – Path to the saved Imagelab folder previously specified in Imagelab.save() (not the individual pickle file).

  • data_path (str) – Path to image dataset previously used in Imagelab, if your data exists locally as images in a folder. If the data_path is changed, the code will break as Imagelab functionalities are dependent on it. You should be using the same version of the CleanVision package previously used when saving Imagelab.

Returns:

Returns a saved instance of Imagelab

Return type:

Imagelab