Analyzing Benchmarks

We will demonstrate below how to dstribute our benchmark runner pipeline over multiple benchmarks in conjunction with our suite of benchmark analysis tools to easily compare and visualize the performance of different algorithms over all benchmark problems.

Installation and reference imports

!pip install google-vizier[jax,algorithms]

from vizier import benchmarks as vzb
from vizier.algorithms import designers
from vizier.benchmarks import experimenters
from vizier.benchmarks import analyzers

Algorithm and Experimenter Factories

To compare algorithms across multiple benchmarks, we want to first create a bunch of relevant benchmark experimenters. To do so, we use SerializableExperimenterFactory from our Experimenters API to modularize the construction of multiple benchmark components.

For example, here we can create a diverse set of BBOB functions with different dimensions via the BBOBExperimenterFactory. Then, we can print out the full serialization of the benchmarks that we have created.

import itertools
import numpy as np
from vizier.benchmarks import experimenters

function_names = ['Sphere', 'Discus']
dimensions = [4, 8]
product_list = list(itertools.product(function_names, dimensions))

experimenter_factories = []
for product in product_list:
  name, dim = product
  bbob_factory = experimenters.BBOBExperimenterFactory(name=name, dim=dim)
  experimenter_factories.append(bbob_factory)
  print(bbob_factory.dump())

As mentioned in our previous tutorial, we can create a BenchmarkState from our algorithm and experimenter factories and apply a BenchmarkRunner benchmarking protocol to run the algorithm. We end up with a list of BenchmarkState objects, each representing a different benchmark run, possibly with repeats.

Conveniently, we provide analysis utility functions in our Analyzers API that convert our BenchmarkState into summarized curves stored compactly in BenchmarkRecord, which also holds the algorithm name and experimenter factory serialization. We can visualize and later analyze our results using a dataframe.

NUM_REPEATS = 5  # @param
NUM_ITERATIONS = 150  # @param

runner = vzb.BenchmarkRunner(
    benchmark_subroutines=[
        vzb.GenerateSuggestions(),
        vzb.EvaluateActiveTrials(),
    ],
    num_repeats=NUM_ITERATIONS,
)
algorithms = {
    'grid': designers.GridSearchDesigner.from_problem,
    'random': designers.RandomDesigner.from_problem,
    'eagle': designers.EagleStrategyDesigner,
}

records = []
for experimenter_factory in experimenter_factories:
  for algo_name, algo_factory in algorithms.items():
    benchmark_state_factory = vzb.ExperimenterDesignerBenchmarkStateFactory(
        experimenter_factory=experimenter_factory, designer_factory=algo_factory
    )
    states = []
    for _ in range(NUM_REPEATS):
      benchmark_state = benchmark_state_factory()
      runner.run(benchmark_state)
      states.append(benchmark_state)
    record = analyzers.BenchmarkStateAnalyzer.to_record(
        algorithm=algo_name,
        experimenter_factory=experimenter_factory,
        states=states,
    )
    records.append(record)

import pandas as pd

records_list = [
    (rec.algorithm, dict(rec.experimenter_metadata), rec) for rec in records
]
df = pd.DataFrame(records_list, columns=['algorithm', 'experimenter', 'record'])
df

Visualization from Records

Given a sequence of BenchmarkRecords, we provide utility plotting functions via the matplotlib.pyplot library to plot and visualize the relative performance of each algorithm on each benchmark. Currently, for single-objective optimization, we extract and plot the objective metric, which represents the objective of the best Trial seen so far as a function of Trial id/count (default).

Note: this objective curve is monotonic and is computing upon converting to BenchmarkRecord.

analyzers.plot_from_records(records)

Observe that plot_from_records is a general plotting utility function that generates a grid of algorithm comparison plots. Specifically, it generates one plot for each Experimenter x Metrics in records, where each row represents an Experimenter and each column is a Metric represented in the record’s elements dictionary. Each plot has a curve for each algorithm.

Adding Analysis

Oftentimes, further analysis is needed to normalize metrics across multiple benchmarks or to visualize more context-dependent metrics, such as visualizing the Pareto frontier as a scatter plot.

We focus on the former case, where objective curves require some form of normalization for each comparison across benchmarks. Many success metrics have been proposed: win rates, relative convergence, normalized objective score, NeurIPS competition scores.

To broadly cover such analysis scores, our API introduces the ConvergenceComparator abstraction that compares two ConvergenceCurve at specified quantiles:

@attr.define
class ConvergenceComparator(abc.ABC):
  """(Simplified) Base class for convergence curve comparators.

  Attributes:
    baseline_curve: The baseline ConvergenceCurve.
    compared_curve: The compared ConvergenceCurve.
  """

  _baseline_curve: ConvergenceCurve = attr.field()
  _compared_curve: ConvergenceCurve = attr.field()

  @abc.abstractmethod
  def score(self) -> float:
    """Returns a summary score for the comparison between base and compared.

    Usually, higher positive numbers mean the compared curve is better than the
    baseline and vice versa.
    """
    pass

  @abc.abstractmethod
  def curve(self) -> ConvergenceCurve:
    """Returns a score curve for each xs."""
    pass

Generally, a higher score by convention should indicate that the compared curve is better than the baseline. Furthermore, a score of 0.0 indicates that the performance is similar and it would make sense of these scores to be symmetric. However, there is no such restrictions imposed on the API.

As an example, we can add the LogEfficiencyScore, which is based off of performance profiles, a gold standard in optimization benchmarking. The LogEfficiencyScore essentially measures the percentage of Trials needed for the compared algorithm to match the baseline performance. If score = 1, then the compared algorithm uses \(e^{-1}*T\) Trials to reach the same objective as the baseline algorithm in \(T\) trials.

analyzed_records = analyzers.BenchmarkRecordAnalyzer.add_comparison_metrics(
    records=records, baseline_algo='random'
)
analyzers.plot_from_records(analyzed_records)

Custom Comparators

To write a custom ConvergenceComparator, simply follow the abstract class defined above and form a ConvergenceComparatorFactory, which can then be passed into add_comparison_metrics. Note that we are constantly adding more benchmarking scores into our analyzers base and welcome submissions.

Let us try to write a custom WinRateComparator that looks at the simple metric of comparing whether one curve is better than the other, for each xs.

NOTE: You may always assume in a Comparator that both curves are either INCREASING (sign = 1) or DECREASING (signand that the sign of the curves is stored in self._sign.

class WinRateComparator(analyzers.ConvergenceComparator):
  """Comparator method based on simple win rate comparison."""

  def score(self) -> float:
    return np.nanmedian(self.curve().ys)

  def curve(self) -> analyzers.ConvergenceCurve:
    baseline_ys = self._sign * self._baseline_curve.ys
    compared_ys = self._sign * self._compared_curve.ys

    # Compares all pairs of compared to baseline curve.
    all_comparisons = np.apply_along_axis(
        lambda base: np.mean(compared_ys > base, axis=0),
        axis=1,
        arr=baseline_ys,
    )
    return analyzers.ConvergenceCurve(
        xs=self._baseline_curve.xs,
        ys=np.mean(all_comparisons, axis=0, keepdims=True),
    )

Now, we add a simple ComparatorFactory and inject the factory into add_comparison_metrics to create our new scoring plots. Note that one can also manually add customized PlotElements that can be in histogram, or scatter form.

class WinRateComparatorFactory(analyzers.ConvergenceComparatorFactory):
  """Factory class for WinRateComparatorFactory."""

  def __call__(
      self,
      baseline_curve: analyzers.ConvergenceCurve,
      compared_curve: analyzers.ConvergenceCurve,
      baseline_quantile: float = 0.5,
      compared_quantile: float = 0.5,
  ) -> analyzers.ConvergenceComparator:
    return WinRateComparator(
        baseline_curve=baseline_curve,
        compared_curve=compared_curve,
        baseline_quantile=baseline_quantile,
        compared_quantile=compared_quantile,
        name='win_rate',
    )


# Add WinRateComparator plots and visualize them.
analyzed_records_with_winrate = (
    analyzers.BenchmarkRecordAnalyzer.add_comparison_metrics(
        records=analyzed_records,
        baseline_algo='random',
        comparator_factory=WinRateComparatorFactory(),
    )
)
analyzers.plot_from_records(analyzed_records_with_winrate)

References

Benchmark analysis tools can be found here.
Convergence curve utils and comparators can be found here