Datasets#

Wildboar is distributed with an advanced system for handling dataset repositories. A dataset repository can be used to load benchmark datasets or to distribute or store datasets.

In its simplest for we can use the function datasets.load_dataset:

from wildboar.datasets import load_dataset
x, y = load_dataset('GunPoint', repository='wildboar/ucr')

Loading datasets#

As described previously, load_dataset is the main entry point for easy loading of datasets, but we can also iteratively load multiple datasets using load_datasets. Currently, Wildboar only installs one repository by default, the wildboar repository. We hope that others will find the feature useful, and will distribute their datasets as Wildboar repositories.

Note

One drawback of the current distribution approach is that we have to download the full bundle to load a single dataset. We hope to improve this in the future and download assets on-demand.

For small experiments, we can load a small selection of datasets from the wildboar/ucr-tiny bundle, either using load_dataset or using one of the named functions, e.g., load_gun_point (browse wildboar.datasets for all such functions).

Loading a single dataset#

We can load a single dataset as follows:

>>> from wildboar.datasets import load_dataset
>>> x, y = load_dataset("GunPoint", repository="wildboar/ucr-tiny")
Downloading ucr-tiny-v1.0.2-default.zip (688.43 KB)
   |██████████████████████████████████████████████----| 668.43/688.43 KB
>>> x.shape
(200, 150)

Wildboar offers additional operations that we can perform while loading datasets, for example, we can preprocess the time series or return optional training/testing parts by setting merge_train_test to False.

>>> x_train, x_test, y_train, y_test = load_dataset("GunPoint", merge_train_test=False)
>>> x_train.shape, x_test.shape
((50, 150), (150, 150))

We can also force a re-download of an already cached bundle by setting force to {python}`True`, and changing the dtype of the returned time series:

>>> load_datasets("GunPoint", dtype=float, force=True)
# ... re-download dataset

Note

To reduce the download size, the datasets downloaded from the wildboar-repository are 32-bit floating point values. However, load_dataset converts the values to 64-bit when loading the data to conform with the default value conventions of Wildboar.

Loading multiple datasets#

When running experiments, a common workflow is to load multiple dataset, fit and evaluate some estimator. In Wildboar, we can repeatedly load datasets from a bundle using the load_datasets-function:

>>> from wildboar.datasets import load_datasets
>>> for name, (x, y) in load_datasets("wildboar/ucr-tiny"):
...     print(name, x.shape)
...
Beef (60, 470)
Coffee (56, 286)
GunPoint (200, 150)
SyntheticControl (600, 60)
TwoLeadECG (1162, 82)

Loading multiple datasets also support setting the merge_train_test to False:

>>> for name, (x_train, x_test, y_train, y_test) in load_datasets("wildboar/ucr-tiny"):
...     print(name, x_train.shape)
...

Filters#

We can also specify filters to filter the datasets on the number of dimensions, samples, timesteps, labels and dataset names. We specify filters with the filter parameter, which accepts a list, dict or str. We express string filters as:

                     ┌── Operator specification
         ┌───────────┴───────────┐
(attribute)[<|<=|>|>=|=|~=](\d+|\w+)
└────┬────┘└───────┬──────┘└───┬───┘
   │             │           └── A number or (part of) a dataset name
   │             └── The comparision operator
   └── The attribute name

The attribute name is one of (the self-explanatory) attributes:

n_samples (int): The number of samples.
n_timesteps (int): The number of time steps.
n_dims (int): The number of dimensions.
n_labels (int): The number of labels
dataset (str): The dataset name

The comparison operators for int are <, <=, >, >= and =, for less-than, less-than-or-equal, greater-than, greater-than-or-equal and exactly-equal-to respectively. The str comparison operators are = and ~=, for exactly-equal-to and exists-in respectively.

Filters can be chained to support and-also using a list or a dict:

>>> large = "n_samples>=100"
>>> large_multivariate = ["n_samples>=100", "n_dims>1"]
>>> large_multiclass = {
...     "n_samples": ">=100",
...     "n_labels": ">2",
... }
>>> load_datasets("wildboar/ucr-tiny", filter=large_multiclass)
<generator object load_datasets at 0x7f262ce95d00>

Warning

If we load multiple datasets with the parameter merge_train_test set to False filters are applied to the training part only.

load_datasets also accepts all parameters that are valid for load_dataset, so we can also preprocess the time series:

>>> load_datasets("wildboar/ucr-tiny", filter=large, preprocess="minmax_scale")
<generator object load_datasets at 0x7f262ce95d00>