Repositories#
Repositories are either initialized directly or used together with the load_dataset function.
from wildboar.datasets import load_dataset
x, y = load_dataset('GunPoint', repository='wildboar/ucr')
# ... downloading repository to cache folder...
Installed repositories and dataset bundles can be listed using the function
list_repositories and list_bundles respectively.
from wildboar.datasets import list_repositories, list_bundles
list_repositories()
['wildboar']
list_bundles("wildboar")
['ucr', 'ucr-tiny', ... (and more)]
Note
Repositories are cached locally in a folder controlled by the parameter cache_dir. The default directory
depends on platform. To change the default cache-directory:
load_dataset("Wafer", repository="wildboar/ucr", cache_dir="/data/my_cache_drive")
To force re-download of an already cached repository set the parameter force to True.
Note
A wildboar repository string is composed of 2 mandatory and two optional
components written as {repository}/{bundle}[:{version}][:{tag}]
{repository}The repository identifier. List available bundles use
list_bundles(repository). The identifier is composed of letters and match\w+. List repositories withlist_repositories().{bundle}The bundle identifier, i.e., the dataset bundle of a repository. The available datasets can be listed with
list_datasets("{repository}/{bundle}"). The identifier is composed of alphanumeric characters and -, matching[a-zA-Z0-9\-]+.{version}The bundle version (defaults to the version specified by the repository). The version must match
{major}[.{minor}][.{revision}].{tag}The bundle tag (defaults to
default). The bundle tag is composed of letters and -, matching[a-zA-Z-]+.
Examples
wildboar/ucr: the ucr bundle from the wildboar repository using the latest version and the ´default` tag.wildboar/ucr-tiny:1.0: the ucr-tiny bundle from the wildboar repository using the version 1.0 and default tag.wildboar/outlier:1.0:hard: the outlier bundle, with version 1.0, from the wildboar repository using the tag hard.
Installing repositories#
A repository implements the interface of the class wildboar.datasets.Repository
Note
The default wildboar-repository is implemented using a JSONRepository which
specifies (versioned) datasets on a JSON endpoint.
Repositories are installed using the function install_repository which takes
either an url to a JSON-file or an instance of a Repository.
from wildboar.datasets import install_repository
install_repository("https://www.example.org/repo.json")
list_repositories("example")
load_dataset("example", repository="example/example")
Repositories can be refreshed using datasets.refresh_repositories().
Implementation details#
By default, repositories installed with install_repositoriy should point to a
JSON-file, which describes the available datasets and the location where wildboar
can download them. Below, we describe the format of the JSON-file:
{
"name": "example", // required
"version": "1.0", // required
"wildboar_requires": "1.0.4", // required, the minimum required wildboar version
"bundle_url": "https://example.org/download/{key}/{tag}-v{version}", // required, the data endpoint
"bundles": [ // required
{
"key": "example", // required, unique key of the bundle
"version": "1.0", // required, the default version of dataset
"tag": "default" // optional, the default tag
"name": "UCR Time series repository", // required
"description": "Example dataset", // optional
"arrays": ["x", "y"] // optional
"collections": {"key": ["example1", "example"]} // optional
},
]
}
The attributes
{key},{version}and{tag}in thebundle_urlare replaced with the bundle-key, bundle-version and bundle tag from the repository string. All attribute are required in the url.The
arraysattribute is optional. However, if it is not present, the dataset is assumed to be a single numpy array, where the last column contains the class label or a numpy-dict with bothx, andykeys.if any other value except
xand/oryis present in thearrays-list, it will be loaded as anextras-dictionary and only returned if requested by the user.if
yis not present in arraysload_datasetreturnNonefory
The
bundles/versionattribute, is the default version of the bundle which is used unless the user specifies an alternative version in the repository string.The
bundles/tagattribute is the default tag of the bundle which is used unless the user specifies an alternative bundle. If not specified, the tag isdefault.The
bundles/collectionsattribute is a dictionary of named collections of datasets which can be specified when usingload_datasets(..., collection="key")
The bundle_url points to a remote location that for each bundle key, contains
two files with extensions .zip and .sha respectively.
In the example, bundle_url should contain the two files example/default-v1.0.zip
and example/default-v1.0.sha The .sha-file should contain the sha1 hash of the
.zip-file to ensure the integrity of the downloaded file. The zip-file should
contain the datasets.
By default, wildboar supports dataset bundles formatted as zip-files containing
npy or npz-files, as created by numpy.save and numpy.savez.
The datasets in the zip-file must be named according to {dataset_name}(_TRAIN|_TEST)?.(npy|npz).
That is, the dataset name (as specified when using load_dataset) and optionally
_TRAIN or _TEST followed by the extension npy or npz. If there are multiple
datasets with the same name but different training or testing tags, they will be merged.
As such, if both _TRAIN and _TEST files are present for the same name, load_dataset
can return these train and test samples separately by setting merge_train_test=False.
from wildboar.datasets import load_dataset
x_train, x_test, y_train, y_test = load_dataset(
'GunPoint', repository='wildboar/ucr', merge_train_test=False
)