wildboar.datasets#
Submodules#
Package Contents#
Classes#
bundle of .arff-files |
|
bundle of numpy binary files |
|
Base class for handling dataset bundles |
|
A repository is a collection of bundles |
|
A repository is a collection of bundles |
Functions#
|
Change the global cache directory |
|
Load the Synthetic_Control dataset |
|
Load the TwoLeadECG dataset |
|
Load the GunPoint dataset |
|
Load all datasets as a generator |
|
|
|
Load a dataset from a repository |
|
List the datasets in the repository |
|
Clear the cache by deleting cached datasets |
|
Get repository by name |
|
Install repository |
|
Get all bundles in the repository |
|
Get a list of all bundle names in the specified repository. |
List the key of all installed repositories |
- class wildboar.datasets.ArffBundle(*, key, version, name, description=None, class_index=-1, encoding='utf-8')#
Bases:
Bundlebundle of .arff-files
- class wildboar.datasets.NpyBundle(*, key, version, name, description=None, class_index=-1)#
Bases:
Bundlebundle of numpy binary files
- class wildboar.datasets.Bundle(*, key, version, name, description=None, class_index=-1)#
Base class for handling dataset bundles
- name#
Human-readable name of the bundle
- Type:
str
- description#
Description of the bundle
- Type:
str
- class_index#
Index of the class label(s)
- Type:
int or array-like
- get_filename(version=None, tag=None, ext=None)#
- list(archive)#
List all datasets in this bundle
- Parameters:
archive (ZipFile) – The bundle file
- Returns:
dataset_names – A sorted list of datasets in the bundle
- Return type:
list
- load(name, archive, *, dtype=None)#
Load a dataset from the bundle
- Parameters:
name (str) – Name of the dataset
archive (ZipFile) – The zip-file bundle
dtype (object, optional) – Cast the data and label matrix to a specific type
- Returns:
x (ndarray) – Data samples
y (ndarray) – Data labels
n_training_samples (int) – Number of samples that are for training. The value is <= x.shape[0]
- class wildboar.datasets.Repository#
A repository is a collection of bundles
- abstract property name#
Name of the repository
- Returns:
str
- Return type:
the name of the repository
- abstract property version#
The repository version
- Returns:
str
- Return type:
the version of the repository
- abstract property download_url#
The url template for downloading bundles
- Returns:
str
- Return type:
the download url
- abstract property wildboar_requires#
The minimum required wildboar version
- Returns:
str
- Return type:
the min version
- abstract get_bundles()#
Get all bundles
- Returns:
dict
- Return type:
a dictionary of key and bundle
- get_bundle(key)#
Get a bundle with the specified key
- Parameters:
key (str) – Key of the bundle
- Returns:
bundle – A bundle or None
- Return type:
Bundle, optional
- load_dataset(bundle, dataset, *, cache_dir, version=None, tag=None, create_cache_dir=True, progress=True, dtype=None, force=False)#
- list_datasets(bundle, *, cache_dir, version=None, tag=None, create_cache_dir=True, progress=True, force=False)#
- clear_cache(cache_dir, keep_last_version=True)#
- refresh()#
Refresh the repository
- class wildboar.datasets.JSONRepository(url)#
Bases:
RepositoryA repository is a collection of bundles
- property wildboar_requires#
The minimum required wildboar version
- Returns:
str
- Return type:
the min version
- property name#
Name of the repository
- Returns:
str
- Return type:
the name of the repository
- property version#
The repository version
- Returns:
str
- Return type:
the version of the repository
- property download_url#
The url template for downloading bundles
- Returns:
str
- Return type:
the download url
- get_bundles()#
Get all bundles
- Returns:
dict
- Return type:
a dictionary of key and bundle
- refresh()#
Refresh the repository
- wildboar.datasets.set_cache_dir(cache_dir)#
Change the global cache directory
- cache_dirstr
The cache directory root
- wildboar.datasets.load_synthetic_control(merge_train_test=True)#
Load the Synthetic_Control dataset
See also
load_datasetload a named dataset
- wildboar.datasets.load_two_lead_ecg(merge_train_test=True)#
Load the TwoLeadECG dataset
See also
load_datasetload a named dataset
- wildboar.datasets.load_gun_point(merge_train_test=True)#
Load the GunPoint dataset
See also
load_datasetload a named dataset
- wildboar.datasets.load_datasets(repository='wildboar/ucr', *, cache_dir=None, create_cache_dir=True, progress=True, force=False, filter=None, **kwargs)#
Load all datasets as a generator
- Parameters:
repository (str) – The repository string
progress (bool, optional) – If progress indicator is shown while downloading the repository.
cache_dir (str, optional) – The cache directory for downloaded dataset repositories.
create_cache_dir (bool, optional) – Create the cache directory if it does not exist.
force (bool, optional) – Force re-download of cached repository
filter (dict or callable, optional) –
Filter the datasets
if callable, only yield those datasets for which the callable returns True.
f(dataset, x, y) -> bool- if dict, filter based on the keys and values
”dataset”: regex matching dataset name
”n_samples”: comparison spec
”n_timestep”: comparison spec
- Comparison spec
str of two parts, comparison operator (<, <=, >, >= or =) and a number, e.g., “<100”, “<= 200”, or “>300”
kwargs (dict) – Optional arguments to
load_dataset
- Yields:
x (array-like) – Data samples
y (array-like) – Data labels
Examples
>>> from wildboar.datasets import load_datasets >>> for dataset, (x, y) in load_datasets(repository='wildboar/ucr'): >>> print(dataset, x.shape, y.shape)
Print the names of datasets with more than 200 samples
>>> for dataset, (x, y) in load_datasets(repository='wildboar/ucr', filter={"n_samples": ">200"}): >>> print(dataset)
- wildboar.datasets.load_all_datasets(repository='wildboar/ucr', *, cache_dir=None, create_cache_dir=True, progress=True, force=False, **kwargs)#
- wildboar.datasets.load_dataset(name, *, repository='wildboar/ucr', dtype=None, contiguous=True, merge_train_test=True, cache_dir=None, create_cache_dir=True, progress=True, force=False)#
Load a dataset from a repository
- Parameters:
name (str) – The name of the dataset to load.
repository (str, optional) – The data repository formatted as {repository}/{bundle}[:{version}][:{tag}]
dtype (dtype, optional) – The data type of the returned data
contiguous (bool, optional) – Ensure that the returned dataset is memory contiguous.
merge_train_test (bool, optional) – Merge the existing training and testing partitions.
progress (bool, optional) – Show a progress bar while downloading a bundle.
cache_dir (str, optional) – The directory where downloaded files are cached
create_cache_dir (bool, optional) – Create cache directory if missing (default=True)
force (bool, optional) –
Force re-download of already cached bundle
..versionadded :: 1.0.4
- Returns:
x (ndarray) – The data samples
y (ndarray) – The labels
x_train (ndarray, optional) – The training samples if
merge_train_test=Falsex_test (ndarray, optional) – The testing samples if
merge_train_test=Falsey_train (ndarray, optional) – The training labels if
merge_train_test=Falsey_test (ndarray, optional) – The testing labels if
merge_train_test=False
Examples
Load a dataset from the default repository
>>> x, y = load_dataset("SyntheticControl")
or if original training and testing splits are to be preserved
>>> x_train, x_test, y_train, y_test = load_dataset("SyntheticControl", merge_train_test=False)
or for a specific version of the dataset
>>> x_train, x_test, y_train, y_test = load_dataset("Wafer", repository='wildboar/ucr-tiny:1.0')
- wildboar.datasets.list_datasets(repository='wildboar/ucr', *, cache_dir=None, create_cache_dir=True, progress=True, force=False)#
List the datasets in the repository
- Parameters:
repository (str or Bundle, optional) –
The data repository
if str load a named bundle, format {repository}/{bundle}
progress (bool, optional) – Show a progress bar while downloading a bundle.
cache_dir (str, optional) – The directory where downloaded files are cached (default=’wildboar_cache’)
create_cache_dir (bool, optional) – Create cache directory if missing (default=True)
force (bool, optional) – Force re-download of cached bundle
- Returns:
dataset – A set of dataset names
- Return type:
set
- wildboar.datasets.clear_cache(repository=None, *, cache_dir=None, keep_last_version=True)#
Clear the cache by deleting cached datasets
- Parameters:
repository (str, optional) –
The name of the repository to clear cache.
if None, clear cache of all repositories
cache_dir (str, optional) – The cache directory
keep_last_version (bool, optional) – If true, keep the latest version of each repository.
- wildboar.datasets.get_repository(repository)#
Get repository by name
- Parameters:
repository (str) – Repository name
- Returns:
repository – A repository
- Return type:
- wildboar.datasets.install_repository(repository)#
Install repository
- Parameters:
repository (str or Repository) – A repository
- wildboar.datasets.get_bundles(repository)#
Get all bundles in the repository
- Parameters:
repository (str) – Name of the repository
- Returns:
dict
- Return type:
A dict of key Bundle pairs
- wildboar.datasets.list_bundles(repository)#
Get a list of all bundle names in the specified repository.
- Parameters:
repository (str) – The name of the repository
- Returns:
bundle – The name of the bundle
- Return type:
str
- wildboar.datasets.list_repositories()#
List the key of all installed repositories