Datasets#
Wildboar is distributed with an advanced system for handling dataset repositories. A dataset repository can be used to load benchmark datasets or to distribute or store datasets.
What is a repository?#
I short, a repository is a collection of datasets bundles. More specifically, a repository links to bundles (zip-files) containing datasets or dataset parts that can be downloaded, cached and loaded by wildboar.
How to use a repository?#
Repositories are either initialized directly or used together with the load_dataset function.
>>> from wildboar.datasets import load_dataset
>>> x, y = load_dataset('GunPoint', repository='wildboar/ucr')
# ... downloading repository to cache folder...
>>> x.shape
Installed repositories and dataset bundles can be listed using the function
list_repositories and list_bundles respectively.
>>> from wildboar.datasets import list_repositories, list_bundles
>>> list_repositories()
['wildboar']
>>> list_bundles("wildboar")
['ucr', 'ucr-tiny']
Note
Repositories are cached locally in a folder controlled by the parameter cache_dir. The default directory
depends on platform. To change the default cache-directory:
>>> load_dataset("Wafer", repository="wildboar/ucr", cache_dir="/data/my_cache_drive")
Warning
The default cache location changed in version 1.0.4. To use the old location set cache_dir
to 'wildboar_cache'
To force re-download of an already cached repository set the parameter force to True.
Note
A wildboar repository string is composed of 2 mandatory and two optional
components written as {repository}/{bundle}[:{version}][:{tag}]
{repository}The repository identifier. List available bundles use
list_bundles(repository). The identifier is composed of letters and match\w+. List repositories withlist_repositories().{bundle}The bundle identifier, i.e., the dataset bundle of a repository. The available datasets can be listed with
list_datasets("{repository}/{bundle}"). The identifier is composed of alphanumeric characters and -, matching[a-zA-Z0-9\-]+.{version}The bundle version (defaults to the version specified by the repository). The version must match
{major}[.{minor}][.{revision}].{tag}The bundle tag (defaults to
default). The bundle tag is composed of letters and -, matching[a-zA-Z-]+.
Examples
wildboar/ucr: the ucr bundle from the wildboar repository using the latest version and the ´default` tag.wildboar/ucr-tiny:1.0: the ucr-tiny bundle from the wildboar repository using the version 1.0 and default tag.wildboar/outlier:1.0:hard: the outlier bundle, with version 1.0, from the wildboar repository using the tag hard.
Installing repositories#
A repository implements the interface of the class wildboar.datasets.Repository
Note
The default wildboar-repository is implemented using a JSONRepository which
specifies (versioned) datasets on a JSON endpoint.
Repositories are installed using the function install_repository which takes
either an url to a JSON-file or an instance of a Repository.
>>> from wildboar.datasets import install_repository
>>> install_repository("https://www.example.org/repo.json")
>>> list_repositories("example")
>>> load_dataset("example", repository="example/example")
Repository JSON specification#
The JSONRepository expects a JSON-file following the specification below.
{
"name": "example",
"version": "1.0",
"wildboar_requires": "1.0.4",
"bundle_url": "https://example.org/download/{key}-v{version}.zip",
"bundles": [
{
"key": "example",
"version": "1.0",
"name": "UCR Time series repository",
"description": "Example dataset",
"class_index": -1
},
]
}