DiskDataset#
- class bayesflow.datasets.DiskDataset(root: PathLike, *, pattern: str = '*.pkl', batch_size: int, load_fn: Callable = None, adapter: Adapter | None, stage: str = 'training', augmentations: Callable | Mapping[str, Callable] | Sequence[Callable] = None, shuffle: bool = True, **kwargs)[source]#
Bases:
PyDataset
A dataset used to load pre-simulated files from disk. The training strategy will be offline.
By default, the expected file structure is as follows: root ├── … ├── sample_1.[ext] ├── … └── sample_n.[ext]
where each file contains a complete sample (e.g., a dictionary of numpy arrays) or is converted into a complete sample using a custom loader function.
Initialize a DiskDataset instance for offline training using a set of simulations that do not fit on disk.
- Parameters:
- rootos.PathLike
Root directory containing the sample files.
- patternstr, default=”*.pkl”
Glob pattern to match sample files.
- batch_sizeint
Number of samples per batch.
- load_fnCallable, optional
Function to load a single file into a sample. Defaults to pickle_load.
- adapterAdapter or None
Optional adapter to transform the loaded batch.
- stagestr, default=”training”
Current stage (e.g., “training”, “validation”, etc.) used by the adapter.
- augmentationsCallable or Mapping[str, Callable] or Sequence[Callable], optional
A single augmentation function, dictionary of augmentation functions, or sequence of augmentation functions to apply to the batch.
If you provide a dictionary of functions, each function should accept one element of your output batch and return the corresponding transformed element.
Otherwise, your function should accept the entire dictionary output and return a dictionary.
Note - augmentations are applied before the adapter is called and are generally transforms that you only want to apply during training.
- shufflebool, optional
Whether to shuffle the dataset at initialization and at the end of each epoch. Default is True.
- **kwargs
Additional keyword arguments passed to the base PyDataset.
- property num_batches#
Number of batches in the PyDataset.
- Returns:
The number of batches in the PyDataset or None to indicate that the dataset is infinite.
- property max_queue_size#
- on_epoch_begin()#
Method called at the beginning of every epoch.
- property use_multiprocessing#
- property workers#