Datasets#

Wildboar is distributed with an advanced system for handling dataset repositories. A dataset repository can be used to load benchmark datasets or to distribute or store datasets.

What is a repository?#

I short, a repository is a collection of datasets bundles. More specifically, a repository links to bundles (zip-files) containing datasets or dataset parts that can be downloaded, cached and loaded by wildboar.

How to use a repository?#

Repositories are either initialized directly or used together with the load_dataset function.

>>> from wildboar.datasets import load_dataset
>>> x, y = load_dataset('GunPoint', repository='wildboar/ucr')
# ... downloading repository to cache folder...
>>> x.shape

Installed repositories and dataset bundles can be listed using the function list_repositories and list_bundles respectively.

>>> from wildboar.datasets import list_repositories, list_bundles
>>> list_repositories()
['wildboar']
>>> list_bundles("wildboar")
['ucr', 'ucr-tiny']

Note

Repositories are cached locally in a folder controlled by the parameter cache_dir. The default directory depends on platform. To change the default cache-directory:

>>> load_dataset("Wafer", repository="wildboar/ucr", cache_dir="/data/my_cache_drive")

Warning

The default cache location changed in version 1.0.4. To use the old location set cache_dir to 'wildboar_cache'

To force re-download of an already cached repository set the parameter force to True.

Note

A wildboar repository string is composed of 2 mandatory and two optional components written as {repository}/{bundle}[:{version}][:{tag}]

{repository}: The repository identifier. List available bundles use list_bundles(repository). The identifier is composed of letters and match \w+. List repositories with list_repositories().
{bundle}: The bundle identifier, i.e., the dataset bundle of a repository. The available datasets can be listed with list_datasets("{repository}/{bundle}"). The identifier is composed of alphanumeric characters and -, matching [a-zA-Z0-9\-]+.
{version}: The bundle version (defaults to the version specified by the repository). The version must match {major}[.{minor}][.{revision}].
{tag}: The bundle tag (defaults to default). The bundle tag is composed of letters and -, matching [a-zA-Z-]+.

Examples

wildboar/ucr: the ucr bundle from the wildboar repository using the latest version and the ´default` tag.
wildboar/ucr-tiny:1.0: the ucr-tiny bundle from the wildboar repository using the version 1.0 and default tag.
wildboar/outlier:1.0:hard: the outlier bundle, with version 1.0, from the wildboar repository using the tag hard.

Installing repositories#

A repository implements the interface of the class wildboar.datasets.Repository

Note

The default wildboar-repository is implemented using a JSONRepository which specifies (versioned) datasets on a JSON endpoint.

Repositories are installed using the function install_repository which takes either an url to a JSON-file or an instance of a Repository.

>>> from wildboar.datasets import install_repository
>>> install_repository("https://www.example.org/repo.json")
>>> list_repositories("example")
>>> load_dataset("example", repository="example/example")

Repository JSON specification#

The JSONRepository expects a JSON-file following the specification below.

{
    "name": "example",
    "version": "1.0",
    "wildboar_requires": "1.0.4",
    "bundle_url": "https://example.org/download/{key}-v{version}.zip",
    "bundles": [
      {
        "key": "example",
        "version": "1.0",
        "name": "UCR Time series repository",
        "description": "Example dataset",
        "class_index": -1
      },
    ]
}