stable_datasets package

Subpackages

Submodules

stable_datasets.cassava module

class cassava[source]

Bases: object

Plant images classification.

The data consists of two folders, a training folder that contains 5 subfolders that contain the respective images for the different 5 classes and a test folder containing test images.

Participants are to train their models using the images in the training folder and provide a submission file like the sample provided which contains the image name exactly matching the image name in the test folder and the corresponding class prediction with labels corresponding to the disease categories, cmd, healthy, cgm, cbsd, cbb.

Please cite this paper if you use the dataset for your project: https://arxiv.org/pdf/1908.02900.pdf

classes = ['cbb', 'cmd', 'cbsd', 'cgm', 'healthy']
static download(path)[source]

Download the cassava dataset and store the result into the given path

Parameters:

path (str) – the path where the downloaded files will be stored. If the directory does not exist, it is created.

static load(path=None)[source]
Parameters:

path (str (optional)) – default ($DATASET_PATH), the path to look for the data and where the data will be downloaded if not present

Returns:

  • train_images (array)

  • train_labels (array)

  • valid_images (array)

  • valid_labels (array)

  • test_images (array)

  • test_labels (array)

stable_datasets.utils module

class BaseDatasetBuilder(*args, split=None, processed_cache_dir=None, download_dir=None, **kwargs)[source]

Bases: GeneratorBasedBuilder

Base class for stable-datasets that enables direct dataset loading.

SOURCE: Mapping
bulk_download(urls: Iterable[str], dest_folder: str | Path, backend: str = 'filesystem', cache_dir: str = '~/.stable_datasets/') list[Path][source]

Download multiple files concurrently and return their local paths.

Parameters:
  • urls – Iterable of URL strings to download.

  • dest_folder – Destination folder for downloads.

  • backend – requests_cache backend (e.g. “filesystem”).

  • cache_dir – Cache directory for requests_cache.

Returns:

Local file paths in the same order as the input URLs.

Return type:

list[Path]

download(url: str, dest_folder: str | Path | None = None, backend: str = 'filesystem', cache_dir: str = '~/.stable_datasets/', progress_bar: bool = True, _progress_dict=None, _task_id=None) Path[source]

Download a single file from a URL with caching and optional progress tracking.

Parameters:
  • url – URL to download from.

  • dest_folder – Destination folder for the downloaded file. If None, defaults to ~/.stable_datasets/downloads/.

  • backend – requests_cache backend (e.g. “filesystem”).

  • cache_dir – Cache directory for requests_cache.

  • progress_bar – Whether to show a tqdm progress bar (for standalone use).

  • _progress_dict – Internal shared dict for bulk_download progress reporting.

  • _task_id – Internal task ID key for bulk_download progress reporting.

Returns:

Local path to the downloaded file.

Return type:

Path

Raises:

Exception – Any exception from network/file operations is logged and re-raised.

load_from_tsfile_to_dataframe(full_file_path_and_name, return_separate_X_and_y=True, replace_missing_vals_with='NaN')[source]

Load data from a .ts file into a Pandas DataFrame. Credit to https://github.com/sktime/sktime/blob/7d572796ec519c35d30f482f2020c3e0256dd451/sktime/datasets/_data_io.py#L379 :param full_file_path_and_name: The full pathname of the .ts file to read. :type full_file_path_and_name: str :param return_separate_X_and_y: true if X and Y values should be returned as separate Data Frames (

X) and a numpy array (y), false otherwise. This is only relevant for data that

Parameters:

replace_missing_vals_with (str) – The value that missing values in the text file should be replaced with prior to parsing.

Returns:

  • DataFrame (default) or ndarray (i – If return_separate_X_and_y then a tuple containing a DataFrame and a numpy array containing the relevant time-series and corresponding class values.

  • DataFrame – If not return_separate_X_and_y then a single DataFrame containing all time-series and (if relevant) a column “class_vals” the associated class values.

Module contents