*************************** :py:mod:`wildboar.datasets` *************************** .. py:module:: wildboar.datasets .. autoapi-nested-parse:: Dataset loading utilities. See the dataset section in the :ref:`User Guide ` for more details and examples. .. rubric:: Examples >>> from wildboar.datasets import load_dataset >>> X, y = load_dataset("GunPoint") >>> X.shape (200, 60) .. !! processed by numpydoc !! Submodules ========== .. toctree:: :titlesonly: :maxdepth: 1 outlier/index.rst preprocess/index.rst Package Contents ---------------- Classes ------- .. autoapisummary:: wildboar.datasets.Bundle wildboar.datasets.JSONRepository wildboar.datasets.NpBundle wildboar.datasets.Repository Functions --------- .. autoapisummary:: wildboar.datasets.clear_cache wildboar.datasets.get_bundles wildboar.datasets.get_repository wildboar.datasets.install_repository wildboar.datasets.list_bundles wildboar.datasets.list_collections wildboar.datasets.list_datasets wildboar.datasets.list_repositories wildboar.datasets.load_dataset wildboar.datasets.load_datasets wildboar.datasets.load_gun_point wildboar.datasets.load_synthetic_control wildboar.datasets.load_two_lead_ecg wildboar.datasets.refresh_repositories wildboar.datasets.set_cache_dir .. py:class:: Bundle(*, key, version, name, tag=None, arrays=None, description=None, collections=None) Base class for handling dataset bundles. :Parameters: **key** : str A unique key of the bundle. **version** : str The version of the bundle. **name** : str Human-readable name of the bundle. **tag** : str, optional A bundle tag. **arrays** : list The arrays of the dataset. **description** : str Description of the bundle. **collections** : dict, optional A list of collections. .. !! processed by numpydoc !! .. py:method:: get_collection(collection) Get a dataset collection. :Parameters: **collection** : str, optional The name of the collection. :Returns: list List of datasets in the collection. .. !! processed by numpydoc !! .. py:method:: get_filename(version=None, tag=None, ext=None) Get the cache name of the bundle. :Parameters: **version** : str, optional The bundle version. **tag** : str, optional The tag. **ext** : str, optional The extension of the file. :Returns: str The filename. .. !! processed by numpydoc !! .. py:method:: list(archive, collection=None) List all datasets in this bundle. :Parameters: **archive** : ZipFile The bundle file. **collection** : str, optional The collection name. :Returns: list A sorted list of datasets in the bundle. .. !! processed by numpydoc !! .. py:method:: load(name, archive) Load a dataset from the bundle. :Parameters: **name** : str Name of the dataset. **archive** : ZipFile The zip-file bundle. :Returns: **x** : ndarray Data samples. **y** : ndarray Data labels. **n_training_samples** : int Number of samples that are for training. The value is <= x.shape[0]. **extras** : dict, optional Extra numpy arrays. .. !! processed by numpydoc !! .. py:class:: JSONRepository(url) A repository is a collection of bundles. .. !! processed by numpydoc !! .. py:method:: get_bundle(key) Get a bundle with the specified key. :Parameters: **key** : str Key of the bundle. :Returns: Bundle, optional A bundle or None. .. !! processed by numpydoc !! .. py:method:: get_bundles() Get all bundles. :Returns: dict A dictionary of key and bundle. .. !! processed by numpydoc !! .. py:method:: refresh(timeout=None) Refresh the repository. .. !! processed by numpydoc !! .. py:property:: download_url The url template for downloading bundles. :Returns: str The download url. .. !! processed by numpydoc !! .. py:property:: name Name of the repository. :Returns: str The name of the repository. .. !! processed by numpydoc !! .. py:property:: version The repository version. :Returns: str The version of the repository. .. !! processed by numpydoc !! .. py:property:: wildboar_requires The minimum required wildboar version. :Returns: str The min version. .. !! processed by numpydoc !! .. py:class:: NpBundle(*, key, version, name, tag=None, arrays=None, description=None, collections=None) Bundle of numpy binary files. .. !! processed by numpydoc !! .. py:method:: get_collection(collection) Get a dataset collection. :Parameters: **collection** : str, optional The name of the collection. :Returns: list List of datasets in the collection. .. !! processed by numpydoc !! .. py:method:: get_filename(version=None, tag=None, ext=None) Get the cache name of the bundle. :Parameters: **version** : str, optional The bundle version. **tag** : str, optional The tag. **ext** : str, optional The extension of the file. :Returns: str The filename. .. !! processed by numpydoc !! .. py:method:: list(archive, collection=None) List all datasets in this bundle. :Parameters: **archive** : ZipFile The bundle file. **collection** : str, optional The collection name. :Returns: list A sorted list of datasets in the bundle. .. !! processed by numpydoc !! .. py:method:: load(name, archive) Load a dataset from the bundle. :Parameters: **name** : str Name of the dataset. **archive** : ZipFile The zip-file bundle. :Returns: **x** : ndarray Data samples. **y** : ndarray Data labels. **n_training_samples** : int Number of samples that are for training. The value is <= x.shape[0]. **extras** : dict, optional Extra numpy arrays. .. !! processed by numpydoc !! .. py:class:: Repository A repository is a collection of bundles. .. !! processed by numpydoc !! .. py:method:: get_bundle(key) Get a bundle with the specified key. :Parameters: **key** : str Key of the bundle. :Returns: Bundle, optional A bundle or None. .. !! processed by numpydoc !! .. py:method:: get_bundles() :abstractmethod: Get all bundles. :Returns: dict A dictionary of key and bundle. .. !! processed by numpydoc !! .. py:method:: refresh(timeout=None) Refresh the repository. .. !! processed by numpydoc !! .. py:property:: download_url :abstractmethod: The url template for downloading bundles. :Returns: str The download url. .. !! processed by numpydoc !! .. py:property:: name :abstractmethod: Name of the repository. :Returns: str The name of the repository. .. !! processed by numpydoc !! .. py:property:: version :abstractmethod: The repository version. :Returns: str The version of the repository. .. !! processed by numpydoc !! .. py:property:: wildboar_requires :abstractmethod: The minimum required wildboar version. :Returns: str The min version. .. !! processed by numpydoc !! .. py:function:: clear_cache(repository=None, *, cache_dir=None, keep_last_version=True) Clear the cache by deleting cached datasets. :Parameters: **repository** : str, optional The name of the repository to clear cache. - if None, clear cache of all repositories. **cache_dir** : str, optional The cache directory. **keep_last_version** : bool, optional If true, keep the latest version of each repository. .. !! processed by numpydoc !! .. py:function:: get_bundles(repository, *, refresh=False, timeout=None) Get all bundles in the repository. :Parameters: **repository** : str Name of the repository. **refresh** : bool, optional Refresh the repository. .. versionadded:: 1.1 **timeout** : float, optional Timeout for json request. .. versionadded:: 1.1 :Returns: dict A dict of key Bundle pairs. .. !! processed by numpydoc !! .. py:function:: get_repository(repository) Get repository by name. :Parameters: **repository** : str Repository name. :Returns: Repository A repository. .. !! processed by numpydoc !! .. py:function:: install_repository(repository, *, refresh=True, timeout=None, cache_dir=None) Install repository. :Parameters: **repository** : str or Repository A repository. **refresh** : bool, optional Refresh the repository. .. versionadded:: 1.1 **timeout** : float, optional Timeout for json request. .. versionadded:: 1.1 **cache_dir** : str, optional Cache directory. .. versionadded:: 1.1 .. !! processed by numpydoc !! .. py:function:: list_bundles(repository, *, refresh=False, timeout=None) Get a list of all bundle names in the specified repository. :Parameters: **repository** : str The name of the repository. **refresh** : bool, optional Refresh the repository. .. versionadded:: 1.1 **timeout** : float, optional Timeout for json request. .. versionadded:: 1.1 :Returns: list A list of bundle names. .. rubric:: Examples >>> from wildboar.datasets import list_bundles >>> list_bundles("wildboar") ["ucr", "ucr-tiny", ...] .. !! processed by numpydoc !! .. py:function:: list_collections(repository) List the collections of the repository. :Parameters: **repository** : str or Bundle, optional The data repository - if str load a named bundle, format {repository}/{bundle}. :Returns: collections A list of collections. .. rubric:: Examples >>> from wildboar.datasets import list_collections >>> list_collections("wildboar/ucr") ["bake-off", ...] .. !! processed by numpydoc !! .. py:function:: list_datasets(repository='wildboar/ucr', *, collection=None, cache_dir=None, create_cache_dir=True, progress=True, force=False, refresh=False, timeout=None) List the datasets in the repository. :Parameters: **repository** : str or Bundle, optional The data repository - if str load a named bundle, format `{repository}/{bundle}`. **collection** : str, optional A collection of named datasets. **cache_dir** : str, optional The directory where downloaded files are cached (default='wildboar_cache'). **create_cache_dir** : bool, optional Create cache directory if missing (default=True). **progress** : bool, optional Show a progress bar while downloading a bundle. **force** : bool, optional Force re-download of cached bundle. **refresh** : bool, optional Refresh the repository. .. versionadded:: 1.1 **timeout** : float, optional Timeout for json request. .. versionadded:: 1.1 :Returns: set A set of dataset names. .. !! processed by numpydoc !! .. py:function:: list_repositories(*, refresh=False, timeout=None, cache_dir=None) List the key of all installed repositories. :Parameters: **refresh** : bool, optional Refresh all repositories. .. versionadded:: 1.1 **timeout** : float, optional Timeout for json request. .. versionadded:: 1.1 **cache_dir** : str, optional Cache directory. .. versionadded:: 1.1 :Returns: repositories A list of installed repositories. .. rubric:: Examples >>> from wildboar.datasets import list_repositories >>> list_repositories() ["wildboar", ...] We can also refresh the repositories, to load any newly added but still pending repositories. >>> list_repositories(refresh=True) .. !! processed by numpydoc !! .. py:function:: load_dataset(name, *, repository='wildboar/ucr', dtype=float, preprocess=None, contiguous=True, merge_train_test=True, cache_dir=None, create_cache_dir=True, progress=True, return_extras=False, force=False, refresh=False, timeout=None) Load a dataset from a repository. :Parameters: **name** : str The name of the dataset to load. **repository** : str, optional The data repository formatted as ``{repository}/{bundle}[:{version}][:{tag}]``. Read more in the :ref:`User guide `. **dtype** : dtype, optional The data type of x (train and test). **preprocess** : str, list or callable, optional Preprocess the dataset - if str, use named preprocess function (see ``preprocess._PREPROCESS.keys()`` for valid keys). - if callable, function taking a np.ndarray and returns the preprocessed dataset. - if list, a list of callable or str. **contiguous** : bool, optional Ensure that the returned dataset is memory contiguous. **merge_train_test** : bool, optional Merge the existing training and testing partitions. **cache_dir** : str, optional The directory where downloaded files are cached. **create_cache_dir** : bool, optional Create cache directory if missing (default=True). **progress** : bool, optional Show a progress bar while downloading a bundle. **return_extras** : bool, optional Return optional extras. .. versionadded:: 1.1 **force** : bool, optional Force re-download of already cached bundle. .. versionadded:: 1.0.4 **refresh** : bool, optional Refresh the repository. .. versionadded:: 1.1 **timeout** : float, optional Timeout for json request. .. versionadded:: 1.1 :Returns: **x** : ndarray, optional The samples if `merge_train_test=False`. **y** : ndarray, optional The labels, if `merge_train_test=False`. **x_train** : ndarray, optional The training samples if `merge_train_test=False`. **x_test** : ndarray, optional The testing samples if `merge_train_test=False`. **y_train** : ndarray, optional The training labels if `merge_train_test=False`. **y_test** : ndarray, optional The testing labels if `merge_train_test=False`. **extras** : dict, optional The optional extras if `return_extras=True`. .. rubric:: Examples Load a dataset from the default repository >>> x, y = load_dataset("SyntheticControl") >>> x.shape (600, 60) or if original training and testing splits are to be preserved >>> x_train, x_test, y_train, y_test = load_dataset( ... "SyntheticControl", merge_train_test=False ... ) or for a specific version of the dataset >>> x_train, x_test, y_train, y_test = load_dataset( ... "SyntheticControl", ... repository='wildboar/ucr-tiny:1.0.1', ... merge_train_test=False, ... ) .. !! processed by numpydoc !! .. py:function:: load_datasets(repository='wildboar/ucr', *, collection=None, cache_dir=None, create_cache_dir=True, progress=True, force=False, filter=None, **kwargs) Load all datasets as a generator. :Parameters: **repository** : str The repository string. **collection** : str, optional A collection of named datasets. **cache_dir** : str, optional The cache directory for downloaded dataset repositories. **create_cache_dir** : bool, optional Create the cache directory if it does not exist. **progress** : bool, optional If progress indicator is shown while downloading the repository. **force** : bool, optional Force re-download of cached repository. **filter** : str, dict, list or callable, optional Filter the datasets - if callable, only yield those datasets for which the callable returns `True`. ``f(dataset, x, y) -> bool`` - if dict, filter based on the keys and values, where keys are attributes and values comparison specs - if list, filter based on conjunction of attribute comparisons - if str, filter based on attribute comparison Read more in the :ref:`User guide `. .. warning:: If the parameter ``merge_train_test`` is ``False``, the filter is applied on the *training* part of the data. **kwargs** : dict Optional arguments to :func:`~wildboar.datasets.load_dataset`. :Yields: **name** : str The dataset name **dataset** : list Depends on the ``kwargs``. - If ``merge_train_test=True`` (default), dataset is a tuple of ``(x, y)``. - If ``merge_train_test=False``, dataset is a tuple of ``(x_train, x_test, y_train, y_test)``. - If ``return_extras=True``, the last element of the tuple contains optional extra (or ``None``). .. rubric:: Examples Load all datasets in a repository: >>> for dataset, (x, y) in load_datasets(repository='wildboar/ucr-tiny'): ... print(dataset, x.shape, y.shape) ... Beef (60, 470) (60,) Coffee (56, 286) (56,) GunPoint (200, 150) (200,) SyntheticControl (600, 60) (600,) TwoLeadECG (1162, 82) (1162,) Print the names of datasets with more than 200 samples >>> for dataset, (x, y) in load_datasets( ... repository='wildboar/ucr-tiny', filter={"n_samples": ">200"} ... ): ... print(dataset) SyntheticControl TwoLeadECG >>> for dataset, (x, y) in load_datasets( ... repository='wildboar/ucr-tiny', filter="n_samples>200" ... ): ... print(dataset) SyntheticControl TwoLeadECG .. !! processed by numpydoc !! .. py:function:: load_gun_point(merge_train_test=True) Load the GunPoint dataset. :Parameters: **merge_train_test** : bool, optional Merge the existing training and testing partitions. :Returns: **x** : ndarray, optional The samples if `merge_train_test=False`. **y** : ndarray, optional The labels, if `merge_train_test=False`. **x_train** : ndarray, optional The training samples if `merge_train_test=False`. **x_test** : ndarray, optional The testing samples if `merge_train_test=False`. **y_train** : ndarray, optional The training labels if `merge_train_test=False`. **y_test** : ndarray, optional The testing labels if `merge_train_test=False`. **extras** : dict, optional The optional extras if `return_extras=True`. .. seealso:: :obj:`load_dataset` load a named dataset .. !! processed by numpydoc !! .. py:function:: load_synthetic_control(merge_train_test=True) Load the Synthetic_Control dataset. :Parameters: **merge_train_test** : bool, optional Merge the existing training and testing partitions. :Returns: **x** : ndarray, optional The samples if `merge_train_test=False`. **y** : ndarray, optional The labels, if `merge_train_test=False`. **x_train** : ndarray, optional The training samples if `merge_train_test=False`. **x_test** : ndarray, optional The testing samples if `merge_train_test=False`. **y_train** : ndarray, optional The training labels if `merge_train_test=False`. **y_test** : ndarray, optional The testing labels if `merge_train_test=False`. **extras** : dict, optional The optional extras if `return_extras=True`. .. seealso:: :obj:`load_dataset` load a named dataset .. !! processed by numpydoc !! .. py:function:: load_two_lead_ecg(merge_train_test=True) Load the TwoLeadECG dataset. :Parameters: **merge_train_test** : bool, optional Merge the existing training and testing partitions. :Returns: **x** : ndarray, optional The samples if `merge_train_test=False`. **y** : ndarray, optional The labels, if `merge_train_test=False`. **x_train** : ndarray, optional The training samples if `merge_train_test=False`. **x_test** : ndarray, optional The testing samples if `merge_train_test=False`. **y_train** : ndarray, optional The training labels if `merge_train_test=False`. **y_test** : ndarray, optional The testing labels if `merge_train_test=False`. **extras** : dict, optional The optional extras if `return_extras=True`. .. seealso:: :obj:`load_dataset` load a named dataset .. !! processed by numpydoc !! .. py:function:: refresh_repositories(repository=None, *, timeout=None, cache_dir=None) Refresh the installed repositories. :Parameters: **repository** : str, optional The repository. None means all repositories. **timeout** : float, optional Timeout for request. .. versionadded:: 1.1 **cache_dir** : str, optional Cache directory. .. versionadded:: 1.1 .. !! processed by numpydoc !! .. py:function:: set_cache_dir(cache_dir=None) Change the global cache directory. If called without arguments, the cache directory is reset to the default directory. :Parameters: **cache_dir** : str, optional The cache directory root. .. !! processed by numpydoc !!