1. Quickstart: Amortized Posterior Estimation#
1.1. Table of Contents#
import numpy as np
import bayesflow as bf
2024-02-05 03:05:37.850165: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 03:05:38.303673: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-05 03:05:38.304006: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-05 03:05:38.402257: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-05 03:05:38.559381: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-05 03:05:38.563812: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-02-05 03:05:41.393576: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/home/pritom/miniconda3/envs/bayesflow/lib/python3.10/site-packages/bayesflow/trainers.py:27: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from tqdm.autonotebook import tqdm
1.2. Introduction #
Welcome to the very first tutorial on using BayesFlow for amortized posterior estimation! In this notebook, we will estimate the means of a multivariate Gaussian model and illustrate some features of the library along the way.
Here is a brief description of amortized posterior estimation:
In traditional posterior estimation, as in Bayesian inference, we seek to compute or approximate the posterior distribution of model parameters given observed data for each new data instance separately. This process can be computationally expensive, especially for complex models or large datasets, because it often involves iterative optimization or sampling methods. This step needs to be repeated for each new instance of data.
Amortized posterior estimation offers a solution to this problem. “Amortization” here refers to spreading out the computational cost over multiple instances. Instead of computing a new posterior from scratch for each data instance, amortized inference learns a function. This function is parameterized by a neural network, that directly maps observations to an approximation of the posterior distribution. This function is trained over the dataset to approximate the posterior for any new data instance efficiently. In this example, we will use a simple Gaussian model to illustrate the basic concepts of amortized posterior estimation.
Above, we have already imported the core entities that we will need for this notebook. In brief:
The module
simulation
contains high-level wrappers for gluing together priors, simulators, and context generators into a singleGenerativeModel
object, which will generate all quantities of interest for a modeling scenario.The module
networks
contains the core neural architectures used for various tasks, e.g., anInvariantNetwork
for realizing normalizing flows (https://paperswithcode.com/method/normalizing-flows) or aDeepSet
for learning permutation-invariant summary representations (embeddings).The module
amortizers
contains high-level wrappers which connect the various networks together and instruct them about their particular goals in the inference pipeline.The module
trainers
contains high-level wrappers for dictating the training phase of an amortized posterior. Typically, the standardTrainer
will take care of most scenarios.
The nuts and bolts of using BayesFlow for Bayesian parameter estimation have already been described in the corresponding papers:
Radev, S. T., Mertens, U. K., Voss, A., Ardizzone, L., Köthe, U. (2020). BayesFlow: Learning complex stochastic models with invertible neural networks. IEEE Transactions on Neural Networks and Learning Systems, 33(4), 1452-1466.
Radev, S. T., Graw, F., Chen, S., Mutters, N. T., Eichel, V. M., Bärnighausen, T., & Köthe, U. (2021). OutbreakFlow: Model-based Bayesian inference of disease outbreak dynamics with invertible neural networks and its application to the COVID-19 pandemics in Germany. PLoS computational biology, 17(10), e1009472.
Schmitt, M., Bürkner, P. C., Köthe, U., & Radev, S. T. (2021). Detecting model misspecification in amortized Bayesian inference with neural networks. arXiv preprint arXiv:2112.08866.
At a high level, our architecture consists of a summary network \(h\) and an inference network \(f\) which jointly amortize a generative model. The summary network transforms input data \(\boldsymbol{x}\) of potentially variable size to a fixed-length representations. The inference network generates random draws from an approximate posterior \(q\) via a conditional invertible neural network (cINN). This process is illustrated in the figure below:
The left panel illustrates the training phase. During this phase, only the model (i.e., simulator and prior) is used to jointly train the summary and inference networks. A parameter vector \(\theta\) is sampled from the prior distribution and then fed to the simulator. The simulator returns \(N\) counts of \(x\) vectors that serve as input to the summary network. The right panel illustrates the inference phase. During this phase, arbitrarily many actually observed data sets can be fed through the networks to obtain posteriors. For instance, in one recent paper (https://www.nature.com/articles/s41562-021-01282-7), the authors applied pre-trained networks to more than one million observed data sets! Now let’s get into some coding…
First and foremost, we set a local seed for reproducibility (best numpy
practice as of 2022). See (https://numpy.org/neps/nep-0019-rng-policy.html) for more details.
RNG = np.random.default_rng(2023)
1.3. Defining the Generative Model #
From the perspective of the BayesFlow framework, a generative model is more than just a prior (encoding beliefs about the parameters before observing data) and a simulator (a likelihood function, often implicit, that generates data given parameters). In addition, it consists of various implicit context assumptions, which we can make explicit at any time. Furthermore, we can also amortize over these context variables, thus making our real-world inference more flexible (i.e., applicable to more contexts). We are leveraging the concept of amortized inference and extending it to context variables as well. The figure below illustrates the skeleton of a generative model as conceptualized in the BayesFlow framework.
This conceptual model allows you to tackle very flexible model families with BayesFlow, as well as various other Bayesian tasks, such as prior sensitivity analysis or multiverse analysis.
Prior sensitivity analysis: it is a technique used in Bayesian statistics to assess how sensitive the results of a model are to the choice of the prior distribution. In Bayesian inference, the prior represents our existing knowledge or assumptions about the parameters before observing the data. However, the selection of an appropriate prior can sometimes be subjective, and different priors can lead to different posterior estimates. Prior sensitivity analysis involves systematically varying the priors and examining how these variations affect the posterior estimates.
The toy Gaussian model we will use for this tutorial takes a particularly simple form:
where \(\mathcal{N}_D\) denotes a multivariate Gaussian (normal) density with \(D\) dimensions, which we set at \(D = 4\) for the current example. For simplicity, we will also set \(\sigma_0 =1\) and \(\sigma_1 = 1\). We will now implement this model using the latest numpy interface.
1.3.1. Prior #
We first define a function generating single draws from the prior (as specified by our model formulation above), which we pass to the Prior
wrapper.
def prior_fun(D=4):
return RNG.normal(size=D)
prior = bf.simulation.Prior(prior_fun=prior_fun)
That’s it. The Prior
object is now callable with a batch_size
argument which dictates how many draws are generated from the prior. We can take a look at the outputs of the prior by doing:
prior(batch_size=10)
{'prior_draws': array([[ 0.60172129, 1.15161897, -1.35946236, 0.22205533],
[-0.77586755, 0.8087058 , -0.19862826, -1.57869386],
[-0.6292893 , -0.38775694, 0.05018619, -0.90704364],
[ 0.13213809, 1.40490249, 0.40410205, -1.03741343],
[-0.74180264, 1.26349944, -0.68932403, 0.70801477],
[-0.03504752, -0.83781649, -0.73251795, 1.04738251],
[-1.4238012 , 1.35404102, -0.39509257, -0.92921568],
[ 0.44406269, 0.14221762, 1.36874071, -0.49876242],
[-0.0742687 , 0.93937197, -0.62614275, -1.02895874],
[-2.13303915, 0.90438789, -2.08440387, 0.28253568]]),
'batchable_context': None,
'non_batchable_context': None}
Wow! The prior generated some other stuff that we never specified and packed it into a Python dict
. That definitely needs some explanation. Remember our picture above? A prior can also accept context variables which modify its behavior, whenever this is desirable. We will see this when we illustrate how to perform prior sensitivity analysis.
We also see two types of context variables. These are worth mentioning as well.
The interface distinguishes between two types of context: batchable_context
and non_batchable_context
. This distinction is a purely technical, rather then a conceptual one:
Batchable context variables differ for each simulation in each training batch of simulations;
Non-batchable context variables stay the same for each simulation in a batch, but differ across simulated batches;
Examples for batchable context variables include experimental design variables, design matrices, etc. Examples for non-batchable context variables include the number of observations in an experiment, positional encodings, time indices, etc. While the latter can also be considered batchable in principle, batching them would require non-Tensor (i.e., non-rectangular) data structures. When you have a variable that changes its structure or size across different batches (like varying the number of observations), it doesn’t fit neatly into a rectangular tensor structure. One would need to use non-tensor data structures, which are not as efficiently processed by the hardware typically used for machine learning computations (like GPUs).
1.3.2. Simulator #
In this case, our simulator function is equally simple to our prior function. We will call it a likelihood function, in correspondence with standard Bayesian terminology, and pass it to the Simulator
wrapper.
def likelihood_fun(params, n_obs=50):
return RNG.normal(loc=params, size=(n_obs, params.shape[0]))
simulator = bf.simulation.Simulator(simulator_fun=likelihood_fun)
Note, that we define our simulator function or likelihood_fun
with two arguments. A positional argument, params
which stands for a single random draw from the prior and a keyword argument n_obs
which represents the number of observations \(N\) that we will generate from the likelihood for each draw from the prior. As some point, we want to vary \(N\) during training, so that the architecture can generalize to different values of \(N\) during inference.
1.3.3. Generative Model #
We will now connect the prior with the likelihood (simulator) via the GenerativeModel
interface:
model = bf.simulation.GenerativeModel(prior=prior, simulator=simulator)
INFO:root:Performing 2 pilot runs with the anonymous model...
INFO:root:Shape of parameter batch after 2 pilot simulations: (batch_size = 2, 4)
INFO:root:Shape of simulation batch after 2 pilot simulations: (batch_size = 2, 50, 4)
INFO:root:No optional prior non-batchable context provided.
INFO:root:No optional prior batchable context provided.
INFO:root:No optional simulation non-batchable context provided.
INFO:root:No optional simulation batchable context provided.
The generative model will also provide an internal consistency check and report on the tensor shapes of the different quantities output by the model. We can also manually inspect its outputs for batch_size = 3
(i.e., three simulations):
out = model(batch_size=3)
print(list(out.keys()))
['prior_non_batchable_context', 'prior_batchable_context', 'prior_draws', 'sim_non_batchable_context', 'sim_batchable_context', 'sim_data']
print("Shape of sim_data: ", out["sim_data"].shape)
Shape of sim_data: (3, 50, 4)
The output of the GenerativeModel
is also a Python dict
with even more keys than before. You should probably have an intuition what these keys represent, namely, the different types of context variables (none in this case) for prior and simulator. With this simple set-up, we can now proceed to do some posterior estimation.
1.4. Defining the Neural Approximator #
1.4.1. Summary Network #
Since our likelihood generates data exchangeably, we need to respect the permutation invariance of the data. Exchangeability in data means that the probability distribution of a sequence of observations remains the same regardless of the order in which the observations appear. In other words, the data is permutation invariant. For that, we will use a DeepSet
which does exactly that. This network will take (at least) 3D tensors of shape (batch_size, n_obs, D)
and reduce them to 2D tensors of shape (batch_size, summary_dim)
, where summary_dim
is a hyperparameter to be set by the user (you). Heuristically, this number should not be lower than the number of parameters in a model. Below, we create a permutation-invariant network with summary_dim = 10
:
summary_net = bf.networks.DeepSet(summary_dim=10)
Note, that the hyperparameter setting for the InvariantNetwork
are all provided inside a single Python dictionary. Here, InvariantNetwork
is used to learn a summary representation of data that is invariant to the permutation of its elements. It helps to inspect the outputs of the summary network manually and confirm its operation:
test_inp = model(batch_size=3)
summary_rep = summary_net(test_inp["sim_data"]).numpy()
print("Shape of simulated data sets: ", test_inp["sim_data"].shape)
print("Shape of summary vectors: ", summary_rep.shape)
Shape of simulated data sets: (3, 50, 4)
Shape of summary vectors: (3, 10)
It is these summary vectors that will enter as conditions for the inference network. Upon convergence of the simulation-based training, we can think of them as learned summary statistics or data embeddings.
1.4.2. Inference Network #
Next, we define the main workhorse of our our framework for amortized posterior inference - the conditional invertible neural network (cINN). The only mandatory hyperparameter for the InvertibleNetwork
is the number of parameters we aim to estimate, in our case num_params = 4
. However, we can change some more, for instance set the number of coupling layers num_coupling_layers = 4
, which will make training a bit faster than using the default num_coupling_layers = 6
, but also reduce the expressiveness (performance) of our network. Naturally, we don’t need a lot of expressiveness for our trivial Gaussian model, so we can proceed with num_coupling_layers = 4
.
The invertible inference network has the following further hyperparameters:
num_params
(mandatory) - The number of model parameters (eq. the dimensionality of the latent space).num_coupling_layers
- The number of invertible layers. The more layers, the more powerful the network, but the slower and possibly less stable the training. Typically \(6 - 10\) coupling layers should be sufficient.coupling_settings
- The settings for the internal coupling layers. Typically, the defaults work well. For online training, you should switch off the default regularization as it may prevent optimal performance.coupling_design
- You could try setting the design tospline
for more superior performance on lower-dimensional problems.soft_clamping
- The soft-clamping parameter. Just use the default value.permutation
- Whether to use permutations before each coupling layer. Should be used by default.use_act_norm
- Whether to apply activation normalization after each coupling layer (https://arxiv.org/abs/1807.03039). Works well in practice and stabilizes training.act_norm_init
- In some cases, you can perform data-dependent initialization of theActNorm
layers, as in https://arxiv.org/abs/1807.03039.use_soft_flow
- Whether to use a SoftFlow architecture (https://arxiv.org/abs/2006.04604). Useful for degenerate distributions.soft_flow_bounds
- The bounds for the varying standard deviation ofSoftFlow
’s noise. Do not touch, unless you have good reasons to.
You can glean all the defaults in the default_settings
module. For most applications, you only need to define the num_params
and num_coupling_layers
hyperparameters.
Note, that we also remove the default L2
and dropout
regularization from the networks, as we need this only for offline learning with pre-simulated data. In the context of online learning, where the model is continuously updated with new data, regularization can be counterproductive. It can unnecessarily constrain the model, preventing it from fully adapting to the latest data.
inference_net = bf.networks.InvertibleNetwork(
num_params=4,
num_coupling_layers=4,
coupling_settings={"dense_args": dict(kernel_regularizer=None), "dropout": False},
)
Again, we can inspect the raw outputs of the cINN by feeding it the parameter draws and corresponding data summaries. This network is slightly more involved than the summary network, as it has two mandatory inputs: targets
and condition
. It also has two outputs: z
and log_det_J
, which represent the latent representation of the parameters and the log of the Jacobian determinant, respectively.
z, log_det_J = inference_net(test_inp["prior_draws"], summary_rep)
Note about z
: The inference network takes summary vectors as input and outputs latent vectors (z
). The latent space is a lower-dimensional space that is assumed to capture the essential information about the parameters. It is essentially transforming the parameter space to the latent space (Gaussian in this case).
We can inspect the shapes of the outputs as well:
print("Shape of latent variables:", z.numpy().shape)
print("Shape of log det Jacobian:", log_det_J.numpy().shape)
Shape of latent variables: (3, 4)
Shape of log det Jacobian: (3,)
1.4.3. Amortized Posterior #
We can now connect the summary_net
and the inference_net
via the high-level wrapper AmortizedPosterior
. This wrapper knows how to compute its loss function, draw samples from the approximate posterior given new data and also compute normalized posterior densities. Normalized posterior density refers to a posterior probability distribution that has been adjusted so that its total area (or volume, in the case of multi-dimensional distributions) sums up to one.
amortizer = bf.amortizers.AmortizedPosterior(inference_net, summary_net)
1.5. Defining the Trainer #
The Trainer
instance connects a generative model with an amortizer and enables various types of simulation-based training. Actually, it has only a single mandatory argument, amortizer
, which expect an Amortized*
instance. However, in order to be able to perform on-the-fly simulation-based training (see below), we also need to provide the generative model. Note, that the generative model does not need to use our provided wrappers, but the keys of its dictionary output should adhere to BayesFlow’s expectations.
Note: If you want to automatically save the amortizer
and related loss history, you can provide a checkpoint_path
argument indicating the folder for storing the checkpoints.
trainer = bf.trainers.Trainer(amortizer=amortizer, generative_model=model)
INFO:root:Performing a consistency check with provided components...
INFO:root:Done.
Actually, a Trainer
instance does a little more than connect a generative model to an amortizer. It does so through the help of a configurator
. In our case the configurator was inferred from the type of amortizer provided, but for more involved models, you should define the configurator explicitly.
What does a configurator do? It takes the raw outputs of the generative models and turns them into something with which neural networks can work:
Let’s see how this actually works by accessing the default (inferred) configurator from the Trainer
instance.
# Simulate some data again
out = model(3)
print("Keys of simulated dict: ", list(out.keys()))
Keys of simulated dict: ['prior_non_batchable_context', 'prior_batchable_context', 'prior_draws', 'sim_non_batchable_context', 'sim_batchable_context', 'sim_data']
conf_out = trainer.configurator(out)
print("Keys of configured dict: ", list(conf_out.keys()))
Keys of configured dict: ['summary_conditions', 'direct_conditions', 'parameters']
The default configurator for posterior inference differentiates between three types of model outputs:
parameters
- these are the quantities for which we want posteriors.summary_conditions
- these are the quantities that go through the summary network (typically the raw data).direct_conditions
– these are concatenated with the outputs of the summary network and passed directly to the inference network.
In our case, summary_conditions
simply correspond to the data, and parameters
correspond to the prior draws, but you can imagine that more complex scenarios are possible. Let’s confirm the former claims.
print(np.allclose(out["sim_data"], conf_out["summary_conditions"]))
print(np.allclose(out["prior_draws"], conf_out["parameters"]))
True
True
Here, we are not using direct equality, since the configurator converts float64
numbers to float32
so as to use GPU memory more efficiently.
1.6. Training Phase #
The following training modes are currently available:
Online training - This training regime is optimal for fast generative models which can efficiently simulate data on-the-fly. In order for this training regime to be efficient, on-the-fly batch simulations should not take longer than 2-3 seconds. The networks never see the same simulations twice, so no regularization (which are primarily used to prevent overfitting on a static dataset) is necessary. This training mode ensures that the model is constantly adapting to fresh data, enhancing its ability to generalize and learn diverse patterns.
Experience replay - This training regime is also good for fast generative models which can efficiently simulated data on-the-fly. It will use a memory replay buffer, as utilized in reinforcement learning, so the network will eventually “experience” some simulations multiple times. This allows it to reinforce its learning from these repeated experiences. Experience replay can help improve learning efficiency and stability, especially in scenarios where generating entirely new simulations on-the-fly for every training step may not be optimal.
Round-based training - This training regime is optimal for slow, but still reasonably performant generative models. In order for this training regime to be efficient, on-the-fly batch simulations should not take longer than one 2-3 minutes. In round-based training, simulations are generated in rounds, and each round of simulations is used for a phase of training before generating new simulations for the next round. This creates a balance between the need for new data and the computational cost of generating it.
Offline training - This training regime is optimal for very slow, external simulators, which take several minutes for a single simulation. It assumes that all training data has already been simulated and stored on disk. Default regularization should be used (or even increase for very small data sets). This method is less flexible but it is used in cases where the generative model is too slow to generate data on-the-fly.
Usually, domain modelers have a pretty good understanding of how fast a simulation model runs. We can also quickly measure the time taken for a given number of simulations (batch_size
) directly inside the notebook.
%%time
_ = model(32)
CPU times: user 3.35 ms, sys: 10 µs, total: 3.36 ms
Wall time: 3.67 ms
We are well below the recommended 2-3 seconds for online training, so that is what we will do. Online training has three mandatory parameters: epochs
, iterations_per_epoch
, and batch_size
. Thus, the total number of simulations that will be performed by a single call (run) will be epochs
\(\times\) iterations_per_epoch
\(\times\) batch_size
. Moreover, the networks will never experience the same simulation twice, since each batch will contain a new set of simulations.
epochs
: This parameter defines the total number of complete passes through the training dataset. Each epoch consists of several iterations, where the model is exposed to different subsets of the data.iterations_per_epoch
: This specifies the number of training iterations that occur in each epoch. During each iteration, the model is updated based on a batch of new data.batch_size
: This determines the number of simulations in each batch. It’s the size of the data subset that the model will see and learn from in each iteration.
1.6.1. Online Training #
Note how the average loss goes down, along with the learning rate (LR). The latter happens, because BayesFlow uses a cosine decay for the learning rate by default, unless a custom optimizer from tensorflow.optimizers
is provided. Thus, the learning rate will decrease atuomatically from its default value of \(0.0005\) to \(0\) over the course of the training. We will also use \(200\) simulations for tracking validation error (even though we don’t strictly need to do this for online learning, where the network sees a completely new batch of data in each iteration). This step is about assessing the model’s performance on a separate set of data that it hasn’t been trained on. It’s a way to monitor how well the model generalizes to new, unseen data.
Depending on your hardware, this training should take between \(30\) seconds and \(5\) minutes. Note, that for actual applications, we will train considerably longer than \(5\) epochs.
%%time
history = trainer.train_online(epochs=10, iterations_per_epoch=1000, batch_size=32, validation_sims=200)
INFO:root:Generated 200 simulations for validation.
Training epoch 1: 0%| | 0/1000 [00:00<?, ?it/s]
Training epoch 1: 100%|██████████| 1000/1000 [01:42<00:00, 9.72it/s, Epoch: 1, Iter: 1000,Loss: -0.002,Avg.Loss: 1.520,LR: 4.88E-04]
INFO:root:Validation, Epoch: 1, Loss: -0.310
Training epoch 2: 100%|██████████| 1000/1000 [00:53<00:00, 18.72it/s, Epoch: 2, Iter: 1000,Loss: -1.684,Avg.Loss: -1.179,LR: 4.52E-04]
INFO:root:Validation, Epoch: 2, Loss: -1.507
Training epoch 3: 100%|██████████| 1000/1000 [00:52<00:00, 18.95it/s, Epoch: 3, Iter: 1000,Loss: -2.126,Avg.Loss: -1.550,LR: 3.97E-04]
INFO:root:Validation, Epoch: 3, Loss: -1.887
Training epoch 4: 100%|██████████| 1000/1000 [00:52<00:00, 19.08it/s, Epoch: 4, Iter: 1000,Loss: -1.558,Avg.Loss: -1.746,LR: 3.27E-04]
INFO:root:Validation, Epoch: 4, Loss: -1.805
Training epoch 5: 100%|██████████| 1000/1000 [00:52<00:00, 19.06it/s, Epoch: 5, Iter: 1000,Loss: -1.588,Avg.Loss: -1.856,LR: 2.50E-04]
INFO:root:Validation, Epoch: 5, Loss: -2.132
Training epoch 6: 100%|██████████| 1000/1000 [00:53<00:00, 18.74it/s, Epoch: 6, Iter: 1000,Loss: -1.779,Avg.Loss: -1.938,LR: 1.73E-04]
INFO:root:Validation, Epoch: 6, Loss: -2.102
Training epoch 7: 100%|██████████| 1000/1000 [00:52<00:00, 19.00it/s, Epoch: 7, Iter: 1000,Loss: -2.131,Avg.Loss: -2.011,LR: 1.03E-04]
INFO:root:Validation, Epoch: 7, Loss: -2.171
Training epoch 8: 100%|██████████| 1000/1000 [00:46<00:00, 21.44it/s, Epoch: 8, Iter: 1000,Loss: -2.051,Avg.Loss: -2.069,LR: 4.78E-05]
INFO:root:Validation, Epoch: 8, Loss: -2.229
Training epoch 9: 100%|██████████| 1000/1000 [00:39<00:00, 25.57it/s, Epoch: 9, Iter: 1000,Loss: -2.375,Avg.Loss: -2.123,LR: 1.23E-05]
INFO:root:Validation, Epoch: 9, Loss: -2.249
Training epoch 10: 100%|██████████| 1000/1000 [00:39<00:00, 25.55it/s, Epoch: 10, Iter: 1000,Loss: -2.243,Avg.Loss: -2.145,LR: 1.49E-11]
INFO:root:Validation, Epoch: 10, Loss: -2.274
CPU times: user 12min 41s, sys: 44 s, total: 13min 25s
Wall time: 9min 8s
1.6.2. Inspecting the Loss #
Bayesian models can be complex and computationally intensive, and metrics like training and validation loss can provide critical insights into the model’s performance and stability. A decreasing loss over time indicates that the model is learning effectively, while fluctuations or increases in loss might suggest issues in the training process, such as overfitting, underfitting, or inappropriate learning rate settings. We can inspect the evolution of the loss via a utility function plot_losses
, for which we have imported the diagnostics
module from BayesFlow
.
f = bf.diagnostics.plot_losses(history["train_losses"], history["val_losses"], moving_average=True)
1.6.3. Validating Consistency #
Validating the consistency of our model-amortizer coupling is an important step which should be performed before any real data are presented to the networks. In other words, the model should work in the ‘’small world’’, before going out in the world of real data. This involves testing the model under known conditions and ensuring that it behaves logically and accurately. It is a critical step to avoid surprises when the model is later exposed to real and more complex data. In addition to a smooth loss reduction curve, we can use at least four handy diagnostic utilities.
For a better illustration, we will start by generating some test simulations (not seen during training) using the simulator model
. Note, that we also use the default configurator to prepare these test simulations for interacting with the networks.
test_sims = trainer.configurator(model(500))
1.6.3.1. Latent space inspection #
Since our training objective prescribes a unit Gaussian to the latent variable \(\boldsymbol{z}\) (see: https://arxiv.org/abs/2003.06281), we expect that, upon good convergence, the latent space will exhibit the prescribed probabilistic structure. Good convergence means that the model has learned an appropriate representation of the data in its latent space. We can quickly inspect this structure by calling the plot_latent_space_2d
function from the diagnostics
module.
z_samples, _ = amortizer(test_sims)
f = bf.diagnostics.plot_latent_space_2d(z_samples)
1.6.3.2. Simulation-Based Calibration #
By now a classic in Bayesian analysis. If you are not familiar with this procedure, you must read about it here: https://arxiv.org/abs/1804.06788
SBC is a technique used to assess whether a probabilistic model correctly infers parameters from data. The basic idea is to simulate a large number of datasets from the model’s prior distribution and then perform posterior inference on these simulated datasets. The goal is to check if the inferred posterior distributions are consistent with the priors. Essentially, SBC tests if the model can accurately recover the parameters it used to generate the data in the first place. This process helps identify any systematic biases or inaccuracies in the model’s inference process.
To perform SBC, we first need to obtain L
number of posterior draws from M
simulated data sets. While the procedure is typically intractable, amortized inference allows us to perform SBC instantly.
# Obtain 100 posterior samples for each simulated data set in test_sims
posterior_samples = amortizer.sample(test_sims, n_samples=100)
f = bf.diagnostics.plot_sbc_histograms(posterior_samples, test_sims["parameters"], num_bins=10)
INFO:root:The ratio of simulations / posterior draws should be > 20 for reliable variance reduction, but your ratio is 5. Confidence intervals might be unreliable!
Note, that the above function complains about the simulations-to-posterior-samples ratio, which is too low for reasonable density estimation and confidence intervals. Thus, we may want to use the more modern version of SBC which is based on empirical cumulative distribution functions (ECDFs) and does not have a num_bins
hyperparameter. You can read more about this method at https://arxiv.org/abs/2103.10522.
f = bf.diagnostics.plot_sbc_ecdf(posterior_samples, test_sims["parameters"], difference=True)
1.6.3.3. Posterior z-score and contraction #
Posterior z-score: It measures how many standard deviations away the mean of the posterior distribution is from the true value of the parameter. A z-score of 0 indicates the mean perfectly aligns with the true value (no bias) while positive/negative z-scores indicate positive/negative bias, respectively.
Posterior contraction: It measures how much the posterior distribution contracts around the true value of the parameter as more data is observed. A higher contraction indicates that the data provides strong evidence, narrowing the uncertainty range.
Ideally, we should obtain high contraction and a z-score near 0. This means the model accurately captures the true value with little bias and high confidence.
A quick and dirty way to gain an understanding of how good point estimates and uncertainty estimates capture the “true” parameters, assuming the generative model is well-specified. For this, we will draw more samples from the posteriors in order to get smaller Monte Carlo error.
post_samples = amortizer.sample(test_sims, n_samples=1000)
Note the shapes of our resulting array: (500, 1000, 4)
. The resulting array holds the \(1000\) posterior draws (axis 1) for each of the \(500\) data sets (axis 0). The final axis (axis 2) represents the number of target parameters.
print("Shape of posterior samples array:", post_samples.shape)
Shape of posterior samples array: (500, 1000, 4)
f = bf.diagnostics.plot_recovery(post_samples, test_sims["parameters"])
Even better, you might want to inspect the sensitivity of the model in terms of how good some expectation (e.g., the mean) captures the ground truth parameter and how much the posterior shrinks with regard to the prior (i.e., so called posterior contraction). For that, we can compute the prior variance analytically or simply estimate it via Monte Carlo.
f = bf.diagnostics.plot_z_score_contraction(post_samples, test_sims["parameters"])
We observe the best case of model adequacy - no bias and large contraction. You can play around with different samples sizes per simulation (i.e., the n_obs
argument in the likelihood function) and check all diagnostics again. Or, even better, you can try estimating your own models!
1.7. Inference Phase #
Once the approximator has passed all consistency checks, we can now go ahead and apply it to real data! Since the data-generating parameters of real systems are per definition unobservable, we cannot use the same methods as below for ascertaining real-world validity of our inferences. Hence, as in any modeling scenario, we would need external validation and posterior predictive checks.
# Your code here