calibration_error#

bayesflow.diagnostics.calibration_error(estimates: Mapping[str, ~numpy.ndarray] | ~numpy.ndarray, targets: Mapping[str, ~numpy.ndarray] | ~numpy.ndarray, variable_keys: Sequence[str] = None, variable_names: Sequence[str] = None, test_quantities: dict[str, ~collections.abc.Callable]=None, resolution: int = 20, aggregation: Callable = <function median>, min_quantile: float = 0.005, max_quantile: float = 0.995) dict[str, any][source]#

Computes an aggregate score for the marginal calibration error over an ensemble of approximate posteriors. The calibration error is given as the aggregate (e.g., median) of the absolute deviation between an alpha-CI and the relative number of inliers from estimates over multiple alphas in (0, 1).

Parameters:
estimatesnp.ndarray of shape (num_datasets, num_draws, num_variables)

The random draws from the approximate posteriors over num_datasets

targetsnp.ndarray of shape (num_datasets, num_variables)

The corresponding ground-truth values sampled from the prior

variable_keysSequence[str], optional (default = None)

Select keys from the dictionaries provided in estimates and targets. By default, select all keys.

variable_namesSequence[str], optional (default = None)

Optional variable names to show in the output.

test_quantitiesdict or None, optional, default: None

A dict that maps plot titles to functions that compute test quantities based on estimate/target draws.

The dict keys are automatically added to variable_keys and variable_names. Test quantity functions are expected to accept a dict of draws with shape (batch_size, ...) as the first (typically only) positional argument and return an NumPy array of shape (batch_size,). The functions do not have to deal with an additional sample dimension, as appropriate reshaping is done internally.

resolutionint, optional, default: 20

The number of credibility intervals (CIs) to consider

aggregationcallable or None, optional, default: np.median

The function used to aggregate the marginal calibration errors. If None provided, the per-alpha calibration errors will be returned.

min_quantilefloat in (0, 1), optional, default: 0.005

The minimum posterior quantile to consider.

max_quantilefloat in (0, 1), optional, default: 0.995

The maximum posterior quantile to consider.

Returns:
resultdict

Dictionary containing:

  • “values”float or np.ndarray

    The aggregated calibration error per variable

  • “metric_name”str

    The name of the metric (“Calibration Error”).

  • “variable_names”str

    The (inferred) variable names.