Repositories#
We can either initialize repositories directly or use them together with the
load_dataset-function:
from wildboar.datasets import load_dataset
x, y = load_dataset('GunPoint', repository='wildboar/ucr')
Installed repositories and dataset bundles can be listed using the function
list_repositories and list_bundles
respectively.
>>> from wildboar.datasets import list_repositories, list_bundles, list_datasets
>>> list_repositories()
['wildboar']
>>> list_bundles("wildboar")
['ucr', 'ucr-tiny', ... (and more)]
>>> list_datasets("wildboar/ucr-tiny")
['Beef', 'Coffee', 'GunPoint', 'SyntheticControl', 'TwoLeadECG']
Repository definitions#
A wildboar repository string is composed of 2 required and one optional components written as:
Changed in version 1.2: The {version} specifier has been removed. The version is determined by the repository.
{repository}/{bundle}[:{tag}]
└─────┬────┘ └───┬──┘└───┬──┘
      │          │       └── (optional) The tag as defined below.
      │          └── (required) The bundle as listed by list_bundles().
      └── (required) The repository as listed by list_repositories().
Each part of the repository has the following requirements:
{repository}The repository identifier, as listed by
datasets.list_repositories. The identifier is composed of letters, i.e., matching the regular expression, w+.{bundle}The bundle identifier, as listed by
datasets.list_bundles. The identifier is composed of alphanumeric characters and-, matching the regular expression[a-zA-Z0-9\-]+.{tag}The bundle tag (defaults to
default). The bundle tag is composed of letters and-, matching the regular expression[a-zA-Z-]+.
To exemplify, these are valid repository declarations:
wildboar/ucrThe
ucrbundle from thewildboarrepository using the default tag.wildboar/ucr-tinyThe
ucr-tinybundle from thewildboarrepository using the default tag.wildboar/outlier:hardThe
outlierbundle, from thewildboarrepository using the taghard.
Installing repositories#
A repository implements the interface of the class
Repository.
Note
The default repository (wildboar) is loaded by the class
JSONRepository, which can load datasets specified by a
JSON endpoint.
Repositories are installed using the function
install_repository which takes either a URL to a JSON-file or
an instance of (or a class implementing the interface of)
Repository.
>>> from wildboar.datasets import install_repository
>>> install_repository("https://www.example.org/repo.json")
>>> list_repositories("example")
>>> load_dataset("example", repository="example/example")
Repositories can be refreshed using datasets.refresh_repositories,
which accepts a repository name to refresh a specific repository or None
(default) to refresh all repositories. Additionally, we can specify an optional
refresh timeout (in seconds), and an optional cache location.
Changed in version 1.1: Wildboar caches the repository definition locally to allow cached datasets to be used while offline.
Local cache#
Wildboar downloads, on-demand, datasets the first time we request a bundle and caches it to disk in a directory determined by the operating system. Wildboar caches datasets and repositories in the following directories:
- Windows
 %LOCALAPPDATA%\cache\wildboar- GNU/Linux
 $XDG_CACHE_HOME/wildboar. If$XDG_CACHE_HOMEis unset, we default to .cache.- macOS
 ~/LibraryCaches/wildboar.- Fallback
 ~/.cache/wildboar
The user can change the cache directory, either globally (for as long as the
current Python session lasts) with datasets.set_cache_dir or locally
(for a specific operation) with then cache_dir-parameter:
>>> from wildboar.datasets import set_cache_dir
>>> set_cache_dir("/path/to/wildboar-cache/") # Set the global cache
>>> load_dataset("GunPoint", cache_dir="/path/to/another/wildboar-cache/") # Another, local, cache here
If called without arguments, set_cache_dir resets the cache
to the default location based on the operating system.
JSON repositories#
By default, repositories installed with {func}`datasets.install_repository` should point to a JSON-file, which describes the available datasets and the location where Wildboar can download them. The repository declaration is a JSON-file:
{
   "name": "example", // required
   "version": "1.0",  // required
   "wildboar_requires": "1.1", // required, the minimum required wildboar version
   "bundle_url": "https://example.org/download/{key}/{tag}-v{version}", // required, the data endpoint
   "bundles": [ // required
      {
      "key": "example", // required, unique key of the bundle
      "version": "1.0", // required, the default version of dataset
      "tag": "default"  // optional, the default tag
      "name": "UCR Time series repository", // required
      "description": "Example dataset", // optional
      "arrays": ["x", "y"] // optional
      "collections": {"key": ["example1", "example"]} // optional
      },
   ]
}
The attributes
{key},{version}and{tag}in thebundle_urlare replaced with the bundle-key, bundle-version and bundle tag from the repository string. All attributes are required in the URL.The
arraysattribute is optional. However, if it is not present, the dataset is assumed to be a single Numpy array, where the last column contains the class label or a Numpy-dict with bothx, andykeys.if any other value except
xand/oryis present in thearrays-list, it will be loaded as an extras-dictionary and only returned if requested by the user.if y is not present in arrays
load_datasetreturnNonefor y
The
bundles/versionattribute, is the version of the bundle.The
bundles/tagattribute is the default tag of the bundle which is used unless the user specifies an alternative bundle. If not specified, the tag is default.The
bundles/collectionsattribute is a dictionary of named collections of datasets which can be specified when usingload_datasets(..., collection="key")
The bundle_url points to a remote location that for each bundle key,
contains two files with extensions .zip and .sha respectively. In the
example, bundle_url should contain the two files
example/default-v1.0.zip and example/default-v1.0.sha The .sha-file
should contain the sha1 hash of the .zip-file to ensure the integrity
of the downloaded file. The zip-file should contain the datasets.
By default, wildboar supports dataset bundles formatted as zip-files
containing npy or npz-files, as created by numpy.save and
numpy.savez. The datasets in the zip-file must be named according
to the regular expression {dataset_name}(_TRAIN|_TEST)?.(npy|npz). That is,
the dataset name (as specified when using load_dataset) and optionally
_TRAIN or _TEST followed by the extension npy or npz. If there
are multiple datasets with the same name but different training or testing
tags, they will be merged. As such, if both _TRAIN and _TEST files are
present for the same name, load_dataset can return these train and test
samples separately by setting merge_train_test=False. For example,
the ucr-bundle provides the default train/test splits from the UCR time
series repository.
from wildboar.datasets import load_dataset
x_train, x_test, y_train, y_test = load_dataset(
   'GunPoint', repository='wildboar/ucr', merge_train_test=False
)