1. Bayesian Linear Regression Starter#
Authors: Paul Bürkner, Lars Kühmichel, Stefan T. Radev
1.1. Introduction#
Welcome to the very first tutorial on using BayesFlow for amortized posterior estimation! In this notebook, we will estimate a linear regression model and illustrate some features of the library along the way. This tutorial introduces the lower-level interface, which allows full control over all network and training parameters, as well as passing of all familiar keras
callback functions.
In traditional Bayesian inference, we seek to approximate the posterior distribution of model parameters given observed data for each new data instance separately. This process can be computationally expensive, especially for complex models or large datasets, because it often involves iterative optimization or sampling methods. This step needs to be repeated for each new instance of data.
Amortized Bayesian inference offers a solution to this problem. “Amortization” here refers to spreading out the computational cost over multiple instances. Instead of computing a new posterior from scratch for each data instance, amortized inference learns a function. This function is parameterized by a neural network, that directly maps observations to an approximation of the posterior distribution. This function is trained over the dataset to approximate the posterior for any new data instance efficiently. In this example, we will use a simple Gaussian model to illustrate the basic concepts of amortized posterior estimation.
At a high level, our architecture consists of a summary network
We load a bunch of libraries and choose the keras backend, we want to use. Here we use JAX but you can freely change that and the notebook will work all the same.
# ensure the backend is set
import os
if "KERAS_BACKEND" not in os.environ:
# set this to "torch", "tensorflow", or "jax"
os.environ["KERAS_BACKEND"] = "jax"
import matplotlib.pyplot as plt
import numpy as np
import keras
import bayesflow as bf
BayesFlow offers flexible modules you can adapt to different Amortized Bayesian Inference (ABI) workflows. In brief:
The module
simulators
contains high-level wrappers for gluing together priors, simulators, and meta-functions, and generating all quantities of interest for a modeling scenario.The module
adapters
contains utilities that preprocess the generated data from the simulator to a format more friendly for the neural approximators.The module
networks
contains the core neural architecture used for various tasks, e.g., a generativeFlowMatching
architecture for approximating distributions, or aDeepSet
for learning permutation-invariant summary representations (embeddings).The module
appoximators
contains high-level wrappers which connect the various networks together and instruct them about their particular goals in the inference pipeline.
In this notebook we will take components from each of these modules and show how they work together.
# avoid scientific notation for outputs
np.set_printoptions(suppress=True)
1.2. Generative Model#
From the perspective of BayesFlow, a generative model is more than just a prior (encoding beliefs about the parameters before observing data) and a data simulator (a likelihood function, often implicit, that generates data given parameters).
In addition, a model consists of various implicit context assumptions, which we can make explicit at any time. Furthermore, we can also amortize over these context variables, thus making our real-world inference more flexible (i.e., applicable to more contexts). We are leveraging the concept of amortized inference and extending it to context variables as well.
We will start with our data simulator:
def likelihood(beta, sigma, N):
# x: predictor variable
x = np.random.normal(0, 1, size=N)
# y: response variable
y = np.random.normal(beta[0] + beta[1] * x, sigma, size=N)
return dict(y=y, x=x)
As inputs, it takes the regression coefficient vector beta
(intercept and slope), the residual SD sigma
, and the number of observations N
to simulate both our predictor variable x
and subsequently our response variable y
.
data_draws = likelihood(beta = [2, 1], sigma = 1, N = 3)
print(data_draws["y"].shape)
print(data_draws["y"])
(3,)
[-0.02379952 1.47901078 3.01718255]
Next, we define our prior simulator to sample draws of the model parameters beta
and sigma
:
def prior():
# beta: regression coefficients (intercept, slope)
beta = np.random.normal([2, 0], [3, 1])
# sigma: residual standard deviation
sigma = np.random.gamma(1, 1)
return dict(beta=beta, sigma=sigma)
prior_draws = prior()
print(prior_draws["beta"].shape)
print(prior_draws["beta"])
(2,)
[ 6.00418464 -0.45589576]
If we fix the number of observations N
, the combination of likelihood and prior already fully defines our model simulator. However, we want to train BayesFlow to perform posterior approximations of linear regression models for varying number of observations. As such, we also need a simulator for N
:
def meta():
# N: number of observation in a dataset
N = np.random.randint(5, 15)
return dict(N=N)
meta_draws = meta()
print(meta_draws["N"])
5
We can combine these three functions into a bayesflow simulator via:
simulator = bf.simulators.make_simulator([prior, likelihood], meta_fn=meta)
We passed the meta
simulator separately to the meta_fn
argument to make sure
that the number of observations N
constant within each batch of simulated datasets. This is required since, within each batch, the generated datasets need to have the same shape for them to be easily transformable to tensors for deep learning.
Let’s see how sampling from the simulator works by sampling a batch of 500 datasets:
# Generate a batch of three training samples
sim_draws = simulator.sample(500)
print(sim_draws["N"])
print(sim_draws["beta"].shape)
print(sim_draws["sigma"].shape)
print(sim_draws["x"].shape)
print(sim_draws["y"].shape)
5
(500, 2)
(500, 1)
(500, 5)
(500, 5)
Let’s define the parameter key and corresponding names of individual parameters for easy filtering and plotting down the line.
par_keys = ["beta", "sigma"]
par_names = [r"$\beta_0$", r"$\beta_1$", r"$\sigma$"]
f = bf.diagnostics.plots.pairs_samples(
samples=sim_draws,
variable_keys=par_keys,
variable_names=par_names
)
c:\Users\radevs\AppData\Local\anaconda3\envs\bf\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)

1.3. Adapter#
To ensure that the training data generated by the simulator can be used for deep learning, we have do a bunch of transformations via adapter
objects. They provides multiple flexible functionalities, from standardization to renaming, and so on.
Below, we build our own adapter
from scratch but later on, BayesFlo
will also provide default adapters that will already automate most of the commonly required steps.
adapter = (
bf.Adapter()
.broadcast("N", to="x")
.as_set(["x", "y"])
.constrain("sigma", lower=0)
.standardize(exclude=["N"])
.sqrt("N")
.concatenate(["beta", "sigma"], into="inference_variables")
.concatenate(["x", "y"], into="summary_variables")
.rename("N", "inference_conditions")
)
Let me elaborate on a few adapter steps:
The .broadcast("N", to="x")
transform will copy the value of N
batch-size times to ensure that it will also have a batch_size
dimension even though it was actually just a single value, constant over all datasets within a batch. The batch dimension will be inferred from x
(this needs to be present during inference).
The .as_set(["x", "y"])
transform indicates that both x
and y
are treated as sets. That is, their values will be treated as exchangable such that they will imply the same inference regardless of the values’ order. This makes sense, since in linear regression, we can index the observations in arbitrary order and always get the same regression line.
The .constrain("sigma", lower=0)
transform ensures that the residual standard deviation parameter sigma
will always be positive. Without this constrain, the neural networks may attempt to predict negative sigma
which of course would not make much sense.
Standardidazation via .standardize()
is important for neural networks to learn
reliably without, for example, exploding or vanishing gradients during training. However, we need to exclude the variable N
from standardization, via standardize(exclude=["N"])
. This is because N
is a constant within each batch of training data and can hence not be standardized. In the future, bayesflow will automatically detect this case so that we don’t have to manually exclude such constant variables from standardization.
Let’s check the shape of our processed data to be passed to the neural networks:
processed_draws = adapter(sim_draws)
print(processed_draws["summary_variables"].shape)
print(processed_draws["inference_conditions"].shape)
print(processed_draws["inference_variables"].shape)
(500, 5, 2)
(500, 1)
(500, 3)
Those shapes are as we expect them to be. The first dimenstion is always the batch size which was 500 for our example data. All variables adhere to this rule since the first dimension is indeed 500.
For summary_variables
, the second dimension is equal to N
, which happend to be sampled as 14
for these example data. It’s third dimension is 2
, since we have combined x
and y
into summary variables, each of which are vectors of length N
within each simulated dataset.
For inference_conditions
, the second dimension is just 1
because we have passed only the scalar variable N
there.
For inference_variables
, the second dimension is 3
because it consists of
beta
(a vector of length 2
) and sigma
(a scalar).
1.4. Neural Approximator#
Below, we will define the neural networks that we will use to estimate the posterior distribution of our linear regression parameters.
1.4.1. Summary Network#
Since our likelihood generates data exchangeably, we need to respect the permutation invariance of the data. Exchangeability in data means that the probability distribution of a sequence of observations remains the same regardless of the order in which the observations appear. In other words, the data is permutation invariant. For that, we will use a DeepSet
which does exactly that. This network will take (at least) 3D tensors of shape (batch_size, n_obs, D)
and reduce them to 2D tensors of shape (batch_size, summary_dim)
, where summary_dim
is a hyperparameter to be set by the user (you). Heuristically, this number should not be lower than the number of parameters in a model. Below, we create a permutation-invariant network with summary_dim = 10
:
summary_network = bf.networks.DeepSet(summary_dim=10)
1.4.2. Inference Network#
To actually approximate the posterior distribution, we need to define a generative neural network. Here we choose a simple coupling flow network:
inference_network = bf.networks.CouplingFlow()
For more complicated models and corresponding posteriors, we may need to choose a different, more flexible generative network, for example bf.networks.FlowMatching()
.
We can now define our posterior approximator
consisting of the two networks and our adapter from above.
approximator = bf.ContinuousApproximator(
inference_network=inference_network,
summary_network=summary_network,
adapter=adapter,
)
We define some training hyperparameters such as the learning rate and optimization algorithm to apply before compile the approximator with these choices. This is made easier with Workflow
objects, as demonstrated in other tutorials.
epochs = 30
num_batches = 128
batch_size = 64
learning_rate = keras.optimizers.schedules.CosineDecay(5e-4, decay_steps=epochs*num_batches, alpha=1e-6)
optimizer = keras.optimizers.Adam(learning_rate=learning_rate, clipnorm=1.0)
approximator.compile(optimizer=optimizer)
Now, we are ready to train our approximator to learn posterior distributions for linear regression models. To achieve this, we will all approximator.fit
passing the simulator
and a bunch of hyperparameters that control how long we want to train.
Note: when using JAX and the shape of your data differs in every batch (e.g., when observations vary), you will observe some compilation overhead during the first few steps. The total training time for this example is around 2 minutes on a standard laptop.
history = approximator.fit(
epochs=epochs,
num_batches=num_batches,
batch_size=batch_size,
simulator=simulator,
)
INFO:bayesflow:Building dataset from simulator instance of SequentialSimulator.
INFO:bayesflow:Using 20 data loading workers.
INFO:bayesflow:Building on a test batch.
Epoch 1/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 42s 298ms/step - loss: 2.7195 - loss/inference_loss: 2.7195
Epoch 2/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - loss: 1.5215 - loss/inference_loss: 1.5215
Epoch 3/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - loss: 1.1648 - loss/inference_loss: 1.1648
Epoch 4/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - loss: 0.8991 - loss/inference_loss: 0.8991
Epoch 5/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.7519 - loss/inference_loss: 0.7519
Epoch 6/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.5804 - loss/inference_loss: 0.5804
Epoch 7/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.5204 - loss/inference_loss: 0.5204
Epoch 8/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.4657 - loss/inference_loss: 0.4657
Epoch 9/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - loss: 0.3375 - loss/inference_loss: 0.3375
Epoch 10/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.2869 - loss/inference_loss: 0.2869
Epoch 11/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.2040 - loss/inference_loss: 0.2040
Epoch 12/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - loss: 0.2421 - loss/inference_loss: 0.2421
Epoch 13/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: 0.1711 - loss/inference_loss: 0.1711
Epoch 14/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.0909 - loss/inference_loss: 0.0909
Epoch 15/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: 0.0994 - loss/inference_loss: 0.0994
Epoch 16/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.0054 - loss/inference_loss: 0.0054
Epoch 17/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.0202 - loss/inference_loss: 0.0202
Epoch 18/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: 0.0617 - loss/inference_loss: 0.0617
Epoch 19/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: -0.1003 - loss/inference_loss: -0.1003
Epoch 20/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: -0.0775 - loss/inference_loss: -0.0775
Epoch 21/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: -0.0360 - loss/inference_loss: -0.0360
Epoch 22/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: -0.0990 - loss/inference_loss: -0.0990
Epoch 23/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 21ms/step - loss: -0.1538 - loss/inference_loss: -0.1538
Epoch 24/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 20ms/step - loss: -0.1302 - loss/inference_loss: -0.1302
Epoch 25/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: -0.2126 - loss/inference_loss: -0.2126
Epoch 26/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: -0.2000 - loss/inference_loss: -0.2000
Epoch 27/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: -0.1525 - loss/inference_loss: -0.1525
Epoch 28/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: -0.2035 - loss/inference_loss: -0.2035
Epoch 29/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: -0.1277 - loss/inference_loss: -0.1277
Epoch 30/30
128/128 ━━━━━━━━━━━━━━━━━━━━ 3s 19ms/step - loss: -0.1609 - loss/inference_loss: -0.1609
f = bf.diagnostics.plots.loss(history)

1.5. Diagnostics#
Let’s check out the resulting inference. Say we want to obtain 1000 posterior samples from our approximated posterior of a simulated dataset where we know the ground truth values.
# Set the number of posterior draws you want to get
num_samples = 1000
# Simulate validation data (unseen during training)
val_sims = simulator.sample(200)
# Obtain num_samples samples of the parameter posterior for every validation dataset
post_draws = approximator.sample(conditions=val_sims, num_samples=num_samples)
# post_draws is a dictionary of draws with one element per named parameters
post_draws.keys()
dict_keys(['beta', 'sigma'])
Initial sanity checks of the posterior samples look good. post_draws["beta"]
has shape (200, 2000, 2)
which makes sense since we asked for inference of a 200 data sets (first dimension is 200), for which we wanted to generated 1000 posterior samples (second dimension is 1000). The third dimension is 2, since the beta
variable was defined as a vector of length 2 (intercept and slope).
post_draws["sigma"].min()
8.839590354457375e-07
The minimun posterior sample of sigma
is positive indicating that our positivity enforcing constraing in the data adapter has indeed worked.
Let’s plot the joint posterior distribution of beta
(both intercept and slope). Based on how we generated this particular dataset, we would expect the posteriors of beta
to vary around its true values from val_sims["beta"]
. Of course, if this was real data, we wouldn’t know the ground truth values. Still, it is always a good idea to first perform inference on simulated data as a diagnostic for whether the approximator has learned to approximate the true posteriors well enough. We will examplariy check the posterior of the first dataset:
f = bf.diagnostics.plots.pairs_posterior(
estimates=post_draws,
targets=val_sims,
dataset_id=0,
variable_names=par_names,
)
c:\Users\radevs\AppData\Local\anaconda3\envs\bf\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)

The true parameter values of the first dataset are indeed well covered by the posterior. Let’s check this more systematically for all validation datasets:
Note: We are validating recovery for a given
f = bf.diagnostics.plots.recovery(
estimates=post_draws,
targets=val_sims,
variable_names=par_names
)

Accuracy looks good for most datasets There is some more variation especially for
If this was not the case, than our posterior approximation may be too wide. Unfortunately, in many cases we don’t have access to the correct posterior, so we need a method that provides us with an indication of the posterior approximations’ accuracy without. This is where simulation-based calibration (SBC) comes into play. In short, if the true values are simulated from the prior used during inference (as is the case for our validatian data above), We would expect the rank of the true parameter value to be uniformly distributed from 1 to num_samples
.
There are multiple graphical methods that use this property for diagnostics. For example, we can use histograms together with an uncertainty band within which we would expect the histogram bars to be if the rank statistics were indeed uniform.
f = bf.diagnostics.plots.calibration_histogram(
estimates=post_draws,
targets=val_sims,
variable_names=par_names
)
WARNING:bayesflow:The ratio of simulations / posterior draws should be > 20 for reliable variance reduction, but your ratio is 0. Confidence intervals might be unreliable!

The histograms look quite good overall, but could be a bit more uniform especially for
f = bf.diagnostics.plots.calibration_ecdf(
estimates=post_draws,
targets=val_sims,
variable_names=par_names,
difference=True,
rank_type="distance"
)

The plot confirms that the approximate posteriors are well calibrated, except for the small issues in the posteriors of
After having convinced us that the posterior approximation are overall reasonable, we can check how much and what kind of information in the data we encode in the posterior. Specifically, we might want to look at two interesting scores: (a) The posterior contraction, which measures how much smaller the posterior variance is relative to the prior variance (higher values indicate more contraction relative to the prior). (b) The posterior z-score which indicates the standardized difference between the posterior mean and the true parameter value. Since the posterior z-score requires the true parameter values, it can only be computed in simulated data settings. Here, we show the results for the variable_keys
argument.
f = bf.diagnostics.plots.z_score_contraction(
estimates=post_draws,
targets=val_sims,
variable_keys=["beta"],
variable_names=par_names[0:2]
)

We clearly see strong posterior contraction in almost all posteriors of
In terms of posterior z-score, most estimates are between -2 and 2, which makes sense if our posterior is approximately normal and well calibrated. However, again, there are some notable exceptions with quite large posterior z-scores over greater than 3 in absolute values. These may be cases, where the learned posterior approximation was not yet fully accurate. So likely, these extreme cases would vanish if we trained our approximator a little longer.
1.5.1. Saving and Loading the Trained Networks#
# Recommended - full serialization (checkpoints folder must exist)
approximator.save(filepath="checkpoints/regression.keras")
# Not recommended due to adapter mismatches - weights only
# approximator.save_weights(filepath="checkpoints/regression.h5")
# Overwrite approximator object with loaded object
approximator = keras.saving.load_model("checkpoints/regression.keras")
c:\Users\radevs\AppData\Local\anaconda3\envs\bf\Lib\site-packages\keras\src\saving\serialization_lib.py:734: UserWarning: `compile()` was not called as part of model loading because the model's `compile()` method is custom. All subclassed Models that have `compile()` overridden should also override `get_compile_config()` and `compile_from_config(config)`. Alternatively, you can call `compile()` manually after loading.
instance.compile_from_config(compile_config)
# Again, obtain num_samples samples of the parameter posterior for every validation dataset
post_draws = approximator.sample(conditions=val_sims, num_samples=num_samples)
f = bf.diagnostics.plots.recovery(
estimates=post_draws,
targets=val_sims,
variable_names=par_names
)
