10. Using Datasets in BayesFlow#

This notebook explains where datasets enter a BayesFlow workflow and when to use the high-level workflow.fit_* methods versus explicit dataset objects.

It assumes that you already know the basic BasicWorkflow pattern. The focus here is the data-loading layer:

  1. Simulator or pre-simulated data

  • The source of raw data, generated on the fly or loaded from existing simulations.

  1. Dataset

  • OnlineDataset

  • OfflineDataset

  • DiskDataset

  1. Adapter

  • Maps raw data keys to the BayesFlow training keys expected by the model.

  1. Training

  • approximator.fit(...)

The central point is:

In the high-level workflow API, the dataset is usually created for you. The fit_* method determines which dataset class is used under the hood.

10.1. Pick Your Situation#

Use this table as a shortcut. You should not need to read the full notebook to decide which path fits your case.

Situation

Use this first

Dataset used

Practical reason

You have a simulator and generating new samples is cheap enough.

workflow.fit_online(...)

OnlineDataset

Low storage; fresh simulations; speed is limited by simulator cost.

You already have pre-simulated data and it fits comfortably in RAM.

workflow.fit_offline(data=...)

OfflineDataset

Usually fastest per batch; simplest validation; RAM-limited.

You already have pre-simulated data but it is too large for RAM.

workflow.fit_disk(root=..., load_fn=...)

DiskDataset

Loads batches from disk; scalable; adds disk I/O overhead.

Your files need custom pairing or preprocessing while loading.

workflow.fit_disk(..., load_fn=...)

DiskDataset

load_fn turns one matched file path into one complete sample dictionary.

You need disk-backed validation data too.

workflow.approximator.fit(dataset=train_dataset, validation_data=val_dataset, ...)

explicit DiskDatasets

Direct dataset objects are more flexible than the high-level validation path.

You train an ensemble.

EnsembleWorkflow or EnsembleDataset

EnsembleDataset wrapper

Ensemble datasets wrap another data source; they are not a separate storage mode.

Rule of thumb:

cheap simulator + fresh data desired    -> fit_online
pre-simulated + fits in RAM             -> fit_offline
pre-simulated + too large for RAM       -> fit_disk
custom file layout                      -> fit_disk with load_fn
custom validation/control/debugging     -> create datasets explicitly

A simulator is not required for fit_offline(...) or fit_disk(...). It is required when BayesFlow has to generate data itself, such as in fit_online(...) or when passing an integer such as validation_data=300.

10.2. Imports#

This notebook uses a tiny toy example. The data are image-shaped so that the loader resembles common real workflows, but the scientific model itself is deliberately trivial.

from pathlib import Path
import shutil

import numpy as np
import bayesflow as bf
import keras

rng = np.random.default_rng(2026)

10.3. Minimal BasicWorkflow Recap#

A typical high-level BayesFlow workflow looks like this:

workflow = bf.BasicWorkflow(
    inference_network=bf.networks.FlowMatching(),
    inference_variables=["parameters"],
    inference_conditions=["observables"],
    simulator=bf.simulators.SIR(),
)

history = workflow.fit_online(
    epochs=2,
    batch_size=32,
    num_batches_per_epoch=200,
)

The dataset object is not shown, but it is still there. Because this call uses fit_online, BayesFlow constructs an OnlineDataset internally.

The same logic applies to the other high-level fit methods:

fit_online  -> OnlineDataset   -> calls a simulator during training
fit_offline -> OfflineDataset  -> batches arrays already in RAM
fit_disk    -> DiskDataset     -> loads samples from files on disk

10.4. Dataset classes in one paragraph#

BayesFlow exposes three main dataset types for ordinary training:

  • OnlineDataset: generates simulations on the fly from a simulator.

  • OfflineDataset: batches a pre-simulated dictionary of arrays already loaded into memory.

  • DiskDataset: loads pre-simulated samples from files on disk.

For completeness, BayesFlow also exposes ensemble dataset classes: EnsembleDataset, EnsembleIndexedDataset, and EnsembleOnlineDataset. These are wrappers or specialized implementations for ensemble training. They are useful once the basic source of data is already clear, but they are not a fourth storage mode.

Two implementation details matter for custom loading:

  1. Dataset objects are Keras-compatible PyDatasets.

  2. Augmentations are applied before the adapter, and the adapter maps raw sample keys such as "parameters" and "observations" to BayesFlow training keys such as "inference_variables" and "summary_variables".

10.5. Worked Example: Separate Parameter and Observation Folders#

A common disk layout stores parameters and observations separately:

toy_sbi_dataset/
├── train/
│   ├── parameters/
│   │   ├── sample_000000.npy
│   │   ├── sample_000001.npy
│   │   └── ...
│   └── observations/
│       ├── sample_000000.npy
│       ├── sample_000001.npy
│       └── ...
├── val/
│   ├── parameters/
│   └── observations/
└── test/
    ├── parameters/
    └── observations/

In this example, DiskDataset will match files in the observations/ folder. The load_fn then uses the observation file name to find the corresponding parameter file.

def make_toy_sample(rng):
    # Create one toy parameter-observation pair.
    # parameters: shape (2,)
    # observations: shape (28, 28, 1)
    # The observation is image-like, but this is only a didactic example.
    parameters = rng.normal(size=(2,)).astype("float32")

    grid_x, grid_y = np.meshgrid(
        np.linspace(-1.0, 1.0, 28, dtype="float32"),
        np.linspace(-1.0, 1.0, 28, dtype="float32"),
        indexing="ij",
    )

    observations = (
        parameters[0] * grid_x
        + parameters[1] * grid_y
        + 0.10 * rng.normal(size=(28, 28)).astype("float32")
    )
    observations = observations[..., None].astype("float32")

    return parameters, observations


def write_split(root, split, num_samples, rng):
    parameter_dir = root / split / "parameters"
    observation_dir = root / split / "observations"
    parameter_dir.mkdir(parents=True, exist_ok=True)
    observation_dir.mkdir(parents=True, exist_ok=True)

    for i in range(num_samples):
        parameters, observations = make_toy_sample(rng)
        filename = f"sample_{i:06d}.npy"
        np.save(parameter_dir / filename, parameters)
        np.save(observation_dir / filename, observations)


root = Path("data/toy_sbi_dataset")

# Keep the cell re-runnable.
if root.exists():
    shutil.rmtree(root)

write_split(root, "train", num_samples=32, rng=rng)
write_split(root, "val", num_samples=4, rng=rng)
write_split(root, "test", num_samples=4, rng=rng)

print(root)

10.6. Write the Loader#

For DiskDataset, the loader receives one path matched by root and pattern. It must return one complete sample dictionary.

Here, the matched path points to an observation file:

.../train/observations/sample_000123.npy

The loader then finds the paired parameter file:

.../train/parameters/sample_000123.npy
def load_parameter_observation_pair(observation_path):
    observation_path = Path(observation_path)

    parameter_path = (
        observation_path.parent.parent
        / "parameters"
        / observation_path.name
    )

    return {
        "parameters": np.load(parameter_path).astype("float32"),
        "observations": np.load(observation_path).astype("float32"),
    }

Before training, inspect the loader on a single file. This checks the raw sample, before batching and before the adapter is applied.

example_path = sorted((root / "train" / "observations").glob("*.npy"))[0]
example = load_parameter_observation_pair(example_path)

for key, value in example.items():
    print(f"{key:12s}", value.shape, value.dtype)

10.7. Define a Small Workflow#

We use a small coupling flow for the two-dimensional parameter vector and a small convolutional summary network for the image-shaped observation.

def make_workflow():
    return bf.BasicWorkflow(
        inference_network=bf.networks.CouplingFlow(depth=2),
        summary_network=bf.networks.ConvolutionalNetwork(
            summary_dim=8,
            widths=(8, 16),
            blocks_per_stage=1,
        ),
        inference_variables=["parameters"],
        summary_variables=["observations"]
    )
workflow = make_workflow()

10.8. High-Level Path: workflow.fit_disk(...)#

This is the recommended user-facing path for disk-backed training. You provide root, pattern, and load_fn; the workflow creates the DiskDataset internally.

For validation at this high level, use an in-memory dictionary. This keeps the example simple and matches the current high-level workflow behavior.

def load_split_into_memory(root, split):
    observation_paths = sorted((root / split / "observations").glob("*.npy"))
    samples = [load_parameter_observation_pair(path) for path in observation_paths]

    return {
        "parameters": np.stack([sample["parameters"] for sample in samples]),
        "observations": np.stack([sample["observations"] for sample in samples]),
    }


validation_data = load_split_into_memory(root, "val")

for key, value in validation_data.items():
    print(f"{key:12s}", value.shape, value.dtype)
history = workflow.fit_disk(
    root=root / "train" / "observations",
    pattern="*.npy",
    load_fn=load_parameter_observation_pair,
    batch_size=8,
    epochs=2,
    validation_data=validation_data,
)

The data path in the previous cell is:

observation files on disk
    -> load_fn(path) returns {"parameters": ..., "observations": ...}
    -> DiskDataset stacks samples into batches
    -> workflow adapter maps raw keys to BayesFlow keys
    -> approximator is trained

Hint: It is sometimes more practical and requires less overhead to store your data as pickled Python dicts containing all necessary data and metadata.

10.9. Compare Directly With fit_offline(...)#

If the same data fit comfortably in RAM, the offline path is simpler and usually faster per batch because no file I/O happens during training.

The trade-off is memory: fit_offline(...) needs the full dictionary of arrays in memory.

def total_size_mb(data):
    return sum(array.nbytes for array in data.values()) / 1024**2


train_data = load_split_into_memory(root, "train")
validation_data = load_split_into_memory(root, "val")

print(f"train_data size: {total_size_mb(train_data):.2f} MB")
print(f"val_data size:   {total_size_mb(validation_data):.2f} MB")
offline_workflow = make_workflow()

history_offline = offline_workflow.fit_offline(
    data=train_data,
    validation_data=validation_data,
    batch_size=32,
    epochs=2
)

For this tiny example, fit_offline(...) is the more convenient choice. The disk version becomes useful when the dataset is too large for RAM, when the data are already stored as many files, or when loading requires custom pairing/preprocessing.

10.10. Low-level Path: Create DiskDataset Explicitly#

Create dataset objects yourself when you need more control. Common cases:

  • inspect adapted batches;

  • use disk-backed validation data;

  • customize dataset construction beyond the high-level fit_* methods;

  • wrap the dataset for ensemble training.

The next cell creates a training and validation DiskDataset explicitly. Notice that we pass the workflow adapter to the dataset. This means the dataset returns BayesFlow-ready batch keys.

low_level_workflow = make_workflow()

train_dataset = bf.datasets.DiskDataset(
    root=root / "train" / "observations",
    pattern="*.npy",
    load_fn=load_parameter_observation_pair,
    batch_size=8,
    adapter=low_level_workflow.adapter,
)

val_dataset = bf.datasets.DiskDataset(
    root=root / "val" / "observations",
    pattern="*.npy",
    load_fn=load_parameter_observation_pair,
    batch_size=8,
    adapter=low_level_workflow.adapter,
    shuffle=False,
)

Inspect one adapted batch. This is not a training loop. It is a sanity check for keys, shapes, and dtypes.

batch = train_dataset[0]

for key, value in batch.items():
    print(f"{key:20s}", value.shape, value.dtype)

Now train the approximator directly. This still uses the workflow components, but skips workflow.fit_disk(...).

low_level_workflow.approximator.compile(
    optimizer=keras.optimizers.AdamW(learning_rate=5e-4, weight_decay=1e-4)
)

history_low_level = low_level_workflow.approximator.fit(
    dataset=train_dataset,
    validation_data=val_dataset,
    epochs=2,
)

10.11. On-the-Fly Simulation With OnlineDataset#

Use online training when the simulator is available and generating new simulations is cheap enough.

workflow = bf.BasicWorkflow(
    inference_network=bf.networks.CouplingFlow(depth=2),
    inference_variables=["parameters"],
    inference_conditions=["observables"],
    simulator=bf.simulators.SIR()
)

history = workflow.fit_online(
    epochs=3,
    batch_size=32,
    num_batches_per_epoch=100
)

This creates an OnlineDataset internally. The important trade-off is that online training avoids storing a full dataset, but each batch now pays the simulator cost.

10.12. Summary#

  • Use fit_online(...) when the simulator is cheap and you want fresh simulations.

  • Use fit_offline(...) when pre-simulated data fit in RAM.

  • Use fit_disk(...) when pre-simulated data are large or stored as files.

  • Use an explicit DiskDataset when you need direct control, inspection, or disk-backed validation.

  • Treat ensemble datasets as wrappers around the basic dataset sources.

References:

  • BayesFlow datasets API: https://bayesflow.org/v2.0.10/api/bayesflow.datasets.html

  • BasicWorkflow API: https://bayesflow.org/v2.0.10/api/bayesflow.workflows.BasicWorkflow.html

  • DiskDataset API: https://bayesflow.org/v2.0.10/api/bayesflow.datasets.DiskDataset.html

  • OfflineDataset API: https://bayesflow.org/v2.0.10/api/bayesflow.datasets.OfflineDataset.html