DiskDataset#

class bayesflow.datasets.DiskDataset(root: PathLike, *, pattern: str = '*.pkl', batch_size: int, load_fn: Callable = None, adapter: Adapter | None, stage: str = 'training', augmentations: Callable | Mapping[str, Callable] | Sequence[Callable] = None, shuffle: bool = True, **kwargs)[source]#

Bases: PyDataset

A dataset used to load pre-simulated files from disk. The training strategy will be offline.

By default, the expected file structure is as follows: root ├── … ├── sample_1.[ext] ├── … └── sample_n.[ext]

where each file contains a complete sample (e.g., a dictionary of numpy arrays) or is converted into a complete sample using a custom loader function.

Initialize a DiskDataset instance for offline training using a set of simulations that do not fit on disk.

Parameters:
rootos.PathLike

Root directory containing the sample files.

patternstr, default=”*.pkl”

Glob pattern to match sample files.

batch_sizeint

Number of samples per batch.

load_fnCallable, optional

Function to load a single file into a sample. Defaults to pickle_load.

adapterAdapter or None

Optional adapter to transform the loaded batch.

stagestr, default=”training”

Current stage (e.g., “training”, “validation”, etc.) used by the adapter.

augmentationsCallable or Mapping[str, Callable] or Sequence[Callable], optional

A single augmentation function, dictionary of augmentation functions, or sequence of augmentation functions to apply to the batch.

If you provide a dictionary of functions, each function should accept one element of your output batch and return the corresponding transformed element.

Otherwise, your function should accept the entire dictionary output and return a dictionary.

Note - augmentations are applied before the adapter is called and are generally transforms that you only want to apply during training.

shufflebool, optional

Whether to shuffle the dataset at initialization and at the end of each epoch. Default is True.

**kwargs

Additional keyword arguments passed to the base PyDataset.

on_epoch_end()[source]#

Method called at the end of every epoch.

property num_batches#

Number of batches in the PyDataset.

Returns:

The number of batches in the PyDataset or None to indicate that the dataset is infinite.

shuffle()[source]#
property max_queue_size#
on_epoch_begin()#

Method called at the beginning of every epoch.

property use_multiprocessing#
property workers#