.. currentmodule:: wildboar .. _guide-datasets-repositories: ############ Repositories ############ We can either initialize repositories directly or use them together with the :func:`~datasets.load_dataset`-function: .. code-block:: python from wildboar.datasets import load_dataset x, y = load_dataset('GunPoint', repository='wildboar/ucr') Installed repositories and dataset bundles can be listed using the function :func:`~datasets.list_repositories` and :func:`~datasets.list_bundles` respectively. .. code-block:: python >>> from wildboar.datasets import list_repositories, list_bundles, list_datasets >>> list_repositories() ['wildboar'] >>> list_bundles("wildboar") ['ucr', 'ucr-tiny', ... (and more)] >>> list_datasets("wildboar/ucr-tiny") ['Beef', 'Coffee', 'GunPoint', 'SyntheticControl', 'TwoLeadECG'] ********************** Repository definitions ********************** A wildboar repository string is composed of 2 required and one optional components written as: .. versionchanged:: 1.2 The ``{version}`` specifier has been removed. The version is determined by the repository. :: {repository}/{bundle}[:{tag}] └─────┬────┘ └───┬──┘└───┬──┘ │ │ └── (optional) The tag as defined below. │ └── (required) The bundle as listed by list_bundles(). └── (required) The repository as listed by list_repositories(). Each part of the repository has the following requirements: ``{repository}`` The repository identifier, as listed by :func:`datasets.list_repositories`. The identifier is composed of letters, i.e., matching the regular expression, `\w+`. ``{bundle}`` The bundle identifier, as listed by :func:`datasets.list_bundles`. The identifier is composed of alphanumeric characters and ``-``, matching the regular expression ``[a-zA-Z0-9\-]+``. ``{tag}`` The bundle tag (defaults to ``default``). The bundle tag is composed of letters and ``-``, matching the regular expression ``[a-zA-Z-]+``. To exemplify, these are valid repository declarations: ``wildboar/ucr`` The ``ucr`` bundle from the ``wildboar`` repository using the default tag. ``wildboar/ucr-tiny`` The ``ucr-tiny`` bundle from the ``wildboar`` repository using the default tag. ``wildboar/outlier:hard`` The ``outlier`` bundle, from the ``wildboar`` repository using the tag ``hard``. *********************** Installing repositories *********************** A repository implements the interface of the class :class:`~datasets.Repository`. .. note:: The default repository (``wildboar``) is loaded by the class :class:`~datasets.JSONRepository`, which can load datasets specified by a JSON endpoint. Repositories are installed using the function :func:`~datasets.install_repository` which takes either a URL to a JSON-file or an instance of (or a class implementing the interface of) :class:`~datasets.Repository`. .. code-block:: python >>> from wildboar.datasets import install_repository >>> install_repository("https://www.example.org/repo.json") >>> list_repositories("example") >>> load_dataset("example", repository="example/example") Repositories can be refreshed using :func:`datasets.refresh_repositories()`, which accepts a repository name to refresh a specific repository or :python:`None` (default) to refresh all repositories. Additionally, we can specify an optional refresh timeout (in seconds), and an optional cache location. .. versionchanged:: 1.1 Wildboar caches the repository definition locally to allow cached datasets to be used while offline. Local cache =========== Wildboar downloads, on-demand, datasets the first time we request a bundle and caches it to disk in a directory determined by the operating system. Wildboar caches datasets and repositories in the following directories: **Windows** ``%LOCALAPPDATA%\cache\wildboar`` **GNU/Linux** ``$XDG_CACHE_HOME/wildboar``. If ``$XDG_CACHE_HOME`` is unset, we default to `.cache`. **macOS** ``~/LibraryCaches/wildboar``. **Fallback** ``~/.cache/wildboar`` The user can change the cache directory, either globally (for as long as the current Python session lasts) with :func:`datasets.set_cache_dir` or locally (for a specific operation) with then ``cache_dir``-parameter: .. code-block:: python >>> from wildboar.datasets import set_cache_dir >>> set_cache_dir("/path/to/wildboar-cache/") # Set the global cache >>> load_dataset("GunPoint", cache_dir="/path/to/another/wildboar-cache/") # Another, local, cache here If called without arguments, :func:`~datasets.set_cache_dir` resets the cache to the default location based on the operating system. ***************** JSON repositories ***************** By default, repositories installed with {func}`datasets.install_repository` should point to a JSON-file, which describes the available datasets and the location where Wildboar can download them. The repository declaration is a JSON-file: .. code-block:: json { "name": "example", // required "version": "1.0", // required "wildboar_requires": "1.1", // required, the minimum required wildboar version "bundle_url": "https://example.org/download/{key}/{tag}-v{version}", // required, the data endpoint "bundles": [ // required { "key": "example", // required, unique key of the bundle "version": "1.0", // required, the default version of dataset "tag": "default" // optional, the default tag "name": "UCR Time series repository", // required "description": "Example dataset", // optional "arrays": ["x", "y"] // optional "collections": {"key": ["example1", "example"]} // optional }, ] } - The attributes ``{key}``, ``{version}`` and ``{tag}`` in the ``bundle_url`` are replaced with the bundle-key, bundle-version and bundle tag from the repository string. All attributes are required in the URL. - The ``arrays`` attribute is optional. However, if it is not present, the dataset is assumed to be a single Numpy array, where the last column contains the class label or a Numpy-dict with both ``x``, and ``y`` keys. - if any other value except ``x`` and/or ``y`` is present in the ``arrays``-list, it will be loaded as an `extras`-dictionary and only returned if requested by the user. - if `y` is not present in arrays :func:`~datasets.load_dataset` return :python:`None` for `y` - The ``bundles/version`` attribute, is the version of the bundle. - The ``bundles/tag`` attribute is the default tag of the bundle which is used unless the user specifies an alternative bundle. If not specified, the tag is `default`. - The ``bundles/collections`` attribute is a dictionary of named collections of datasets which can be specified when using :python:`load_datasets(..., collection="key")` The ``bundle_url`` points to a remote location that for each bundle ``key``, contains two files with extensions ``.zip`` and ``.sha`` respectively. In the example, ``bundle_url`` should contain the two files ``example/default-v1.0.zip`` and ``example/default-v1.0.sha`` The ``.sha``-file should contain the ``sha1`` hash of the ``.zip``-file to ensure the integrity of the downloaded file. The ``zip``-file should contain the datasets. By default, wildboar supports dataset bundles formatted as ``zip``-files containing `npy` or `npz`-files, as created by :func:`numpy.save` and :func:`numpy.savez`. The datasets in the ``zip``-file must be named according to the regular expression ``{dataset_name}(_TRAIN|_TEST)?.(npy|npz)``. That is, the dataset name (as specified when using `load_dataset`) and optionally `_TRAIN` or ``_TEST`` followed by the extension ``npy`` or ``npz``. If there are multiple datasets with the same name but different training or testing tags, they will be merged. As such, if both ``_TRAIN`` and ``_TEST`` files are present for the same name, ``load_dataset`` can return these train and test samples separately by setting :python:`merge_train_test=False`. For example, the ``ucr``-bundle provides the default train/test splits from the UCR time series repository. .. code-block:: python from wildboar.datasets import load_dataset x_train, x_test, y_train, y_test = load_dataset( 'GunPoint', repository='wildboar/ucr', merge_train_test=False )