Running Benchmarks

We will demonstrate below how to use our benchmark runner pipeline.

Installation and reference imports

!pip install google-vizier[jax,algorithms]

from vizier import algorithms as vza
from vizier import benchmarks as vzb
from vizier.algorithms import designers
from vizier.benchmarks import experimenters

Example experimenter and designer factory which we will use later.

experimenter = experimenters.NumpyExperimenter(
    experimenters.bbob.Sphere, experimenters.bbob.DefaultBBOBProblemStatement(5)
)

designer_factory = designers.GridSearchDesigner.from_problem

Algorithms and Experimenters

Every study can be seen conceptually as a simple loop between an algorithm and objective. In terms of code, the algorithm corresponds to a Designer/Policy and objective to an Experimenter.

Below is a simple sequential loop.

designer = designer_factory(experimenter.problem_statement())

for _ in range(100):
  suggestion = designer.suggest()[0]
  trial = suggestion.to_trial()
  experimenter.evaluate([trial])
  completed_trials = vza.CompletedTrials([trial])
  designer.update(completed_trials, vza.ActiveTrials())

As seen above however, one modification we can make is to use variable batch sizes, rather than only suggesting and evaluating one-by-one. More generally, certain implementation details may arise:

How many parallel suggestions should the algorithm generate?
How many suggestions can be evaluated at once?
Should we use early stopping on certain unpromising trials?
Should we use a custom stopping condition instead of a fixed for-loop?
Can we swap in a different algorithm mid-loop?
Can we swap in a different objective mid-loop?

API

The code flexibility needed to simulate these real-life scenarios may cause complications as the evaluation benchmark may no longer be stateless. In order to broadly cover such scenarios, our API introduces the BenchmarkSubroutine:

class BenchmarkSubroutine(Protocol):
  """Abstraction for core benchmark routines.

  Benchmark protocols are modular alterations of BenchmarkState by reference.
  """

  def run(self, state: BenchmarkState) -> None:
    """Abstraction to alter BenchmarkState by reference."""

All routines use and potentially modify a BenchmarkState, which holds information about the objective via an Experimenter and the algorithm itself wrapped by a PolicySuggester.

class BenchmarkState:
  """State of a benchmark run. It is altered via benchmark protocols."""

  experimenter: Experimenter
  algorithm: PolicySuggester

To wrap multiple BenchmarkSubRoutines together, we can use the BenchmarkRunner:

class BenchmarkRunner(BenchmarkSubroutine):
  """Run a sequence of subroutines, all repeated for a few iterations."""

  # A sequence of benchmark subroutines that alter BenchmarkState.
  benchmark_subroutines: Sequence[BenchmarkSubroutine]
  # Number of times to repeat applying benchmark_subroutines.
  num_repeats: int

  def run(self, state: BenchmarkState) -> None:
    """Run algorithm with benchmark subroutines with repetitions."""

Example usage

Below is a typical example of simple suggestion and evaluation:

runner = vzb.BenchmarkRunner(
    benchmark_subroutines=[
        vzb.GenerateSuggestions(),
        vzb.EvaluateActiveTrials(),
    ],
    num_repeats=100,
)

benchmark_state_factory = vzb.DesignerBenchmarkStateFactory(
    experimenter=experimenter, designer_factory=designer_factory
)
benchmark_state = benchmark_state_factory()

runner.run(benchmark_state)

We may obtain the evaluated trials via the benchmark_state, which contains a PolicySupporter via its algorithm field:

all_trials = benchmark_state.algorithm.supporter.trials
print(all_trials)

Note that this design is maximally informative on everything that has happened so far in the study. For instance, we may also query incomplete/unused suggestions using the PolicySupporter.

References

Benchmark Runners can be found here.