`wildboar.datasets`#

Submodules#

wildboar.datasets.outlier

Package Contents#

Classes#

`ArffBundle`	bundle of .arff-files
`NpyBundle`	bundle of numpy binary files
`Bundle`	Base class for handling dataset bundles
`Repository`	A repository is a collection of bundles
`JSONRepository`	A repository is a collection of bundles

Functions#

`set_cache_dir`(cache_dir)	Change the global cache directory
`load_synthetic_control`([merge_train_test])	Load the Synthetic_Control dataset
`load_two_lead_ecg`([merge_train_test])	Load the TwoLeadECG dataset
`load_gun_point`([merge_train_test])	Load the GunPoint dataset
`load_datasets`([repository, cache_dir, ...])	Load all datasets as a generator
`load_all_datasets`([repository, cache_dir, ...])
`load_dataset`(name, *[, repository, dtype, contiguous, ...])	Load a dataset from a repository
`list_datasets`([repository, cache_dir, ...])	List the datasets in the repository
`clear_cache`([repository, cache_dir, keep_last_version])	Clear the cache by deleting cached datasets
`get_repository`(repository)	Get repository by name
`install_repository`(repository)	Install repository
`get_bundles`(repository)	Get all bundles in the repository
`list_bundles`(repository)	Get a list of all bundle names in the specified repository.
`list_repositories`()	List the key of all installed repositories

class wildboar.datasets.ArffBundle(*, key, version, name, description=None, class_index=-1, encoding='utf-8')#

Bases: Bundle

bundle of .arff-files

class wildboar.datasets.NpyBundle(*, key, version, name, description=None, class_index=-1)#

Bases: Bundle

bundle of numpy binary files

class wildboar.datasets.Bundle(*, key, version, name, description=None, class_index=-1)#

Base class for handling dataset bundles

name#

Human-readable name of the bundle

Type:: str

description#

Description of the bundle

Type:: str

class_index#

Index of the class label(s)

Type:: int or array-like

get_filename(version=None, tag=None, ext=None)#

list(archive)#

List all datasets in this bundle

Parameters:: archive (ZipFile) – The bundle file
Returns:: dataset_names – A sorted list of datasets in the bundle
Return type:: list

load(name, archive, *, dtype=None)#

Load a dataset from the bundle

Parameters:

name (str) – Name of the dataset
archive (ZipFile) – The zip-file bundle
dtype (object, optional) – Cast the data and label matrix to a specific type

Returns:

x (ndarray) – Data samples
y (ndarray) – Data labels
n_training_samples (int) – Number of samples that are for training. The value is <= x.shape[0]

class wildboar.datasets.Repository#

A repository is a collection of bundles

abstract property name#

Name of the repository

Returns:: str
Return type:: the name of the repository

abstract property version#

The repository version

Returns:: str
Return type:: the version of the repository

abstract property download_url#

The url template for downloading bundles

Returns:: str
Return type:: the download url

abstract property wildboar_requires#

The minimum required wildboar version

Returns:: str
Return type:: the min version

abstract get_bundles()#

Get all bundles

Returns:: dict
Return type:: a dictionary of key and bundle

get_bundle(key)#

Get a bundle with the specified key

Parameters:: key (str) – Key of the bundle
Returns:: bundle – A bundle or None
Return type:: Bundle, optional

load_dataset(bundle, dataset, *, cache_dir, version=None, tag=None, create_cache_dir=True, progress=True, dtype=None, force=False)#

list_datasets(bundle, *, cache_dir, version=None, tag=None, create_cache_dir=True, progress=True, force=False)#

clear_cache(cache_dir, keep_last_version=True)#

refresh()#: Refresh the repository

class wildboar.datasets.JSONRepository(url)#

Bases: Repository

A repository is a collection of bundles

property wildboar_requires#

The minimum required wildboar version

Returns:: str
Return type:: the min version

property name#

Name of the repository

Returns:: str
Return type:: the name of the repository

property version#

The repository version

Returns:: str
Return type:: the version of the repository

property download_url#

The url template for downloading bundles

Returns:: str
Return type:: the download url

get_bundles()#

Get all bundles

Returns:: dict
Return type:: a dictionary of key and bundle

refresh()#: Refresh the repository

wildboar.datasets.set_cache_dir(cache_dir)#

Change the global cache directory

cache_dirstr: The cache directory root

wildboar.datasets.load_synthetic_control(merge_train_test=True)#

Load the Synthetic_Control dataset

See also

load_dataset: load a named dataset

wildboar.datasets.load_datasets(repository='wildboar/ucr', *, cache_dir=None, create_cache_dir=True, progress=True, force=False, filter=None, **kwargs)#

Load all datasets as a generator

Parameters:

repository (str) – The repository string
progress (bool, optional) – If progress indicator is shown while downloading the repository.
cache_dir (str, optional) – The cache directory for downloaded dataset repositories.
create_cache_dir (bool, optional) – Create the cache directory if it does not exist.
force (bool, optional) – Force re-download of cached repository
filter (dict or callable, optional) –
Filter the datasets
- if callable, only yield those datasets for which the callable returns True. f(dataset, x, y) -> bool
- if dict, filter based on the keys and values
  - ”dataset”: regex matching dataset name
  - ”n_samples”: comparison spec
  - ”n_timestep”: comparison spec
Comparison spec
str of two parts, comparison operator (<, <=, >, >= or =) and a number, e.g., “<100”, “<= 200”, or “>300”
kwargs (dict) – Optional arguments to load_dataset

Yields:

x (array-like) – Data samples
y (array-like) – Data labels

Examples

>>> from wildboar.datasets import load_datasets
>>> for dataset, (x, y) in load_datasets(repository='wildboar/ucr'):
>>>     print(dataset, x.shape, y.shape)

Print the names of datasets with more than 200 samples

>>> for dataset, (x, y) in load_datasets(repository='wildboar/ucr', filter={"n_samples": ">200"}):
>>>     print(dataset)

wildboar.datasets.load_all_datasets(repository='wildboar/ucr', *, cache_dir=None, create_cache_dir=True, progress=True, force=False, **kwargs)#

wildboar.datasets.load_dataset(name, *, repository='wildboar/ucr', dtype=None, contiguous=True, merge_train_test=True, cache_dir=None, create_cache_dir=True, progress=True, force=False)#

Load a dataset from a repository

Parameters:

name (str) – The name of the dataset to load.
repository (str, optional) – The data repository formatted as {repository}/{bundle}[:{version}][:{tag}]
dtype (dtype, optional) – The data type of the returned data
contiguous (bool, optional) – Ensure that the returned dataset is memory contiguous.
merge_train_test (bool, optional) – Merge the existing training and testing partitions.
progress (bool, optional) – Show a progress bar while downloading a bundle.
cache_dir (str, optional) – The directory where downloaded files are cached
create_cache_dir (bool, optional) – Create cache directory if missing (default=True)
force (bool, optional) –
Force re-download of already cached bundle

..versionadded :: 1.0.4

Returns:

x (ndarray) – The data samples
y (ndarray) – The labels
x_train (ndarray, optional) – The training samples if merge_train_test=False
x_test (ndarray, optional) – The testing samples if merge_train_test=False
y_train (ndarray, optional) – The training labels if merge_train_test=False
y_test (ndarray, optional) – The testing labels if merge_train_test=False

Examples

Load a dataset from the default repository

>>> x, y = load_dataset("SyntheticControl")

or if original training and testing splits are to be preserved

>>> x_train, x_test, y_train, y_test = load_dataset("SyntheticControl", merge_train_test=False)

or for a specific version of the dataset

>>> x_train, x_test, y_train, y_test = load_dataset("Wafer", repository='wildboar/ucr-tiny:1.0')

wildboar.datasets.list_datasets(repository='wildboar/ucr', *, cache_dir=None, create_cache_dir=True, progress=True, force=False)#

List the datasets in the repository

Parameters:

repository (str or Bundle, optional) –
The data repository
- if str load a named bundle, format {repository}/{bundle}
progress (bool, optional) – Show a progress bar while downloading a bundle.
cache_dir (str, optional) – The directory where downloaded files are cached (default=’wildboar_cache’)
create_cache_dir (bool, optional) – Create cache directory if missing (default=True)
force (bool, optional) – Force re-download of cached bundle

Returns:

dataset – A set of dataset names

Return type:

set

wildboar.datasets.clear_cache(repository=None, *, cache_dir=None, keep_last_version=True)#

Clear the cache by deleting cached datasets

Parameters:

repository (str, optional) –
The name of the repository to clear cache.
- if None, clear cache of all repositories
cache_dir (str, optional) – The cache directory
keep_last_version (bool, optional) – If true, keep the latest version of each repository.

wildboar.datasets.get_repository(repository)#

Get repository by name

Parameters:: repository (str) – Repository name
Returns:: repository – A repository
Return type:: Repository

wildboar.datasets.install_repository(repository)#

Install repository

Parameters:: repository (str or Repository) – A repository

wildboar.datasets.get_bundles(repository)#

Get all bundles in the repository

Parameters:: repository (str) – Name of the repository
Returns:: dict
Return type:: A dict of key Bundle pairs

wildboar.datasets.list_bundles(repository)#

Get a list of all bundle names in the specified repository.

Parameters:: repository (str) – The name of the repository
Returns:: bundle – The name of the bundle
Return type:: str

wildboar.datasets.list_repositories()#: List the key of all installed repositories

wildboar.datasets#

Submodules#

Package Contents#

Classes#

Functions#

`wildboar.datasets`#