.. currentmodule:: wildboar .. _guide-datasets: ######## Datasets ######## Wildboar is distributed with an advanced system for handling dataset repositories. A dataset repository can be used to load benchmark datasets or to distribute or store datasets. In its simplest for we can use the function :func:`datasets.load_dataset`: .. code-block:: python from wildboar.datasets import load_dataset x, y = load_dataset('GunPoint', repository='wildboar/ucr') **************** Loading datasets **************** As described previously, :func:`~datasets.load_dataset` is the main entry point for easy loading of datasets, but we can also iteratively load multiple datasets using :func:`~datasets.load_datasets`. Currently, Wildboar only installs one repository by default, the `wildboar` repository. We hope that others will find the feature useful, and will distribute their datasets as Wildboar repositories. .. note:: One drawback of the current distribution approach is that we have to download the full bundle to load a single dataset. We hope to improve this in the future and download assets on-demand. For small experiments, we can load a small selection of datasets from the ``wildboar/ucr-tiny`` bundle, either using :func:`~datasets.load_dataset` or using one of the named functions, e.g., :func:`~datasets.load_gun_point` (browse :mod:`wildboar.datasets` for all such functions). Loading a single dataset ======================== We can load a single dataset as follows: .. code-block:: python >>> from wildboar.datasets import load_dataset >>> x, y = load_dataset("GunPoint", repository="wildboar/ucr-tiny") Downloading ucr-tiny-v1.0.2-default.zip (688.43 KB) |██████████████████████████████████████████████----| 668.43/688.43 KB >>> x.shape (200, 150) Wildboar offers additional operations that we can perform while loading datasets, for example, we can :doc:`preprocess the time series ` or return optional training/testing parts by setting ``merge_train_test`` to :python:`False`. .. code-block:: python >>> x_train, x_test, y_train, y_test = load_dataset("GunPoint", merge_train_test=False) >>> x_train.shape, x_test.shape ((50, 150), (150, 150)) We can also force a re-download of an already cached bundle by setting `force` to {python}`True`, and changing the `dtype` of the returned time series: .. code-block:: python >>> load_datasets("GunPoint", dtype=float, force=True) # ... re-download dataset .. note:: To reduce the download size, the datasets downloaded from the ``wildboar``-repository are 32-bit floating point values. However, :func:`~datasets.load_dataset` converts the values to 64-bit when loading the data to conform with the default value conventions of Wildboar. Loading multiple datasets ========================= When running experiments, a common workflow is to load multiple dataset, fit and evaluate some estimator. In Wildboar, we can repeatedly load datasets from a bundle using the :func:`~datasets.load_datasets`-function: .. code-block:: python >>> from wildboar.datasets import load_datasets >>> for name, (x, y) in load_datasets("wildboar/ucr-tiny"): ... print(name, x.shape) ... Beef (60, 470) Coffee (56, 286) GunPoint (200, 150) SyntheticControl (600, 60) TwoLeadECG (1162, 82) Loading multiple datasets also support setting the ``merge_train_test`` to :python:`False`: .. code-block:: python >>> for name, (x_train, x_test, y_train, y_test) in load_datasets("wildboar/ucr-tiny"): ... print(name, x_train.shape) ... .. _datasets_filter: Filters ======= We can also specify filters to filter the datasets on the number of dimensions, samples, timesteps, labels and dataset names. We specify filters with the `filter` parameter, which accepts a :python:`list`, :python:`dict` or :python:`str`. We express string filters as: :: ┌── Operator specification ┌───────────┴───────────┐ (attribute)[<|<=|>|>=|=|~=](\d+|\w+) └────┬────┘└───────┬──────┘└───┬───┘ │ │ └── A number or (part of) a dataset name │ └── The comparision operator └── The attribute name The attribute name is one of (the self-explanatory) attributes: ``n_samples`` (:python:`int`) The number of samples. ``n_timesteps`` (:python:`int`) The number of time steps. ``n_dims`` (:python:`int`) The number of dimensions. ``n_labels`` (:python:`int`) The number of labels ``dataset`` (:python:`str`) The dataset name The comparison operators for :python:`int` are ``<``, ``<=``, ``>``, ``>=`` and ``=``, for *less-than*, *less-than-or-equal*, *greater-than*, *greater-than-or-equal* and *exactly-equal-to* respectively. The :python:`str` comparison operators are ``=`` and ``~=``, for *exactly-equal-to* and *exists-in* respectively. Filters can be chained to support `and-also` using a `list` or a `dict`: .. code-block:: python >>> large = "n_samples>=100" >>> large_multivariate = ["n_samples>=100", "n_dims>1"] >>> large_multiclass = { ... "n_samples": ">=100", ... "n_labels": ">2", ... } >>> load_datasets("wildboar/ucr-tiny", filter=large_multiclass) .. warning:: If we load multiple datasets with the parameter `merge_train_test` set to `False` filters are applied to the **training** part only. :func:`~datasets.load_datasets` also accepts all parameters that are valid for :func:`~datasets.load_dataset`, so we can also preprocess the time series: .. code-block:: python >>> load_datasets("wildboar/ucr-tiny", filter=large, preprocess="minmax_scale") .. toctree:: :maxdepth: 2 :hidden: datasets/repositories datasets/preprocess