Datasets#
Wildboar is distributed with an advanced system for handling dataset repositories. A dataset repository can be used to load benchmark datasets or to distribute or store datasets.
In its simplest for we can use the function datasets.load_dataset
:
from wildboar.datasets import load_dataset
x, y = load_dataset('GunPoint', repository='wildboar/ucr')
Loading datasets#
As described previously, load_dataset
is the main entry point
for easy loading of datasets, but we can also iteratively load multiple datasets
using load_datasets
. Currently, Wildboar only installs one
repository by default, the wildboar repository. We hope that others will find
the feature useful, and will distribute their datasets as Wildboar repositories.
Note
One drawback of the current distribution approach is that we have to download the full bundle to load a single dataset. We hope to improve this in the future and download assets on-demand.
For small experiments, we can load a small selection of datasets from the
wildboar/ucr-tiny
bundle, either using load_dataset
or
using one of the named functions, e.g., load_gun_point
(browse wildboar.datasets
for all such functions).
Loading a single dataset#
We can load a single dataset as follows:
>>> from wildboar.datasets import load_dataset
>>> x, y = load_dataset("GunPoint", repository="wildboar/ucr-tiny")
Downloading ucr-tiny-v1.0.2-default.zip (688.43 KB)
|██████████████████████████████████████████████----| 668.43/688.43 KB
>>> x.shape
(200, 150)
Wildboar offers additional operations that we can perform while loading
datasets, for example, we can
preprocess the time series or return
optional training/testing parts by setting merge_train_test
to
False
.
>>> x_train, x_test, y_train, y_test = load_dataset("GunPoint", merge_train_test=False)
>>> x_train.shape, x_test.shape
((50, 150), (150, 150))
We can also force a re-download of an already cached bundle by setting force to {python}`True`, and changing the dtype of the returned time series:
>>> load_datasets("GunPoint", dtype=float, force=True)
# ... re-download dataset
Note
To reduce the download size, the datasets downloaded from the
wildboar
-repository are 32-bit floating point values. However,
load_dataset
converts the values to 64-bit when
loading the data to conform with the default value conventions of
Wildboar.
Loading multiple datasets#
When running experiments, a common workflow is to load multiple dataset, fit
and evaluate some estimator. In Wildboar, we can repeatedly load datasets from
a bundle using the load_datasets
-function:
>>> from wildboar.datasets import load_datasets
>>> for name, (x, y) in load_datasets("wildboar/ucr-tiny"):
... print(name, x.shape)
...
Beef (60, 470)
Coffee (56, 286)
GunPoint (200, 150)
SyntheticControl (600, 60)
TwoLeadECG (1162, 82)
Loading multiple datasets also support setting the merge_train_test
to
False
:
>>> for name, (x_train, x_test, y_train, y_test) in load_datasets("wildboar/ucr-tiny"):
... print(name, x_train.shape)
...
Filters#
We can also specify filters to filter the datasets on the number of dimensions,
samples, timesteps, labels and dataset names. We specify filters with the
filter parameter, which accepts a list
, dict
or
str
. We express string filters as:
┌── Operator specification
┌───────────┴───────────┐
(attribute)[<|<=|>|>=|=|~=](\d+|\w+)
└────┬────┘└───────┬──────┘└───┬───┘
│ │ └── A number or (part of) a dataset name
│ └── The comparision operator
└── The attribute name
The attribute name is one of (the self-explanatory) attributes:
n_samples
(int
)The number of samples.
n_timesteps
(int
)The number of time steps.
n_dims
(int
)The number of dimensions.
n_labels
(int
)The number of labels
dataset
(str
)The dataset name
The comparison operators for int
are <
, <=
, >
, >=
and
=
, for less-than, less-than-or-equal, greater-than,
greater-than-or-equal and exactly-equal-to respectively. The str
comparison operators are =
and ~=
, for exactly-equal-to and
exists-in respectively.
Filters can be chained to support and-also using a list or a dict:
>>> large = "n_samples>=100"
>>> large_multivariate = ["n_samples>=100", "n_dims>1"]
>>> large_multiclass = {
... "n_samples": ">=100",
... "n_labels": ">2",
... }
>>> load_datasets("wildboar/ucr-tiny", filter=large_multiclass)
<generator object load_datasets at 0x7f262ce95d00>
Warning
If we load multiple datasets with the parameter merge_train_test set to False filters are applied to the training part only.
load_datasets
also accepts all parameters that are valid for
load_dataset
, so we can also preprocess the time series:
>>> load_datasets("wildboar/ucr-tiny", filter=large, preprocess="minmax_scale")
<generator object load_datasets at 0x7f262ce95d00>