Running Benchmarks
We will demonstrate below how to use our benchmark runner pipeline.
Installation and reference imports
!pip install google-vizier[jax,algorithms]
from vizier import algorithms as vza
from vizier import benchmarks as vzb
from vizier.algorithms import designers
from vizier.benchmarks import experimenters
Example experimenter and designer factory which we will use later.
experimenter = experimenters.NumpyExperimenter(
experimenters.bbob.Sphere, experimenters.bbob.DefaultBBOBProblemStatement(5)
)
designer_factory = designers.GridSearchDesigner.from_problem
Algorithms and Experimenters
Every study can be seen conceptually as a simple loop between an algorithm and objective. In terms of code, the algorithm corresponds to a Designer
/Policy
and objective to an Experimenter
.
Below is a simple sequential loop.
designer = designer_factory(experimenter.problem_statement())
for _ in range(100):
suggestion = designer.suggest()[0]
trial = suggestion.to_trial()
experimenter.evaluate([trial])
completed_trials = vza.CompletedTrials([trial])
designer.update(completed_trials, vza.ActiveTrials())
As seen above however, one modification we can make is to use variable batch sizes, rather than only suggesting and evaluating one-by-one. More generally, certain implementation details may arise:
How many parallel suggestions should the algorithm generate?
How many suggestions can be evaluated at once?
Should we use early stopping on certain unpromising trials?
Should we use a custom stopping condition instead of a fixed for-loop?
Can we swap in a different algorithm mid-loop?
Can we swap in a different objective mid-loop?
API
The code flexibility needed to simulate these real-life scenarios may cause
complications as the evaluation benchmark may no longer be stateless. In order
to broadly cover such scenarios, our API introduces the BenchmarkSubroutine
:
class BenchmarkSubroutine(Protocol):
"""Abstraction for core benchmark routines.
Benchmark protocols are modular alterations of BenchmarkState by reference.
"""
def run(self, state: BenchmarkState) -> None:
"""Abstraction to alter BenchmarkState by reference."""
All routines use and potentially modify a BenchmarkState
, which holds information about the objective via an Experimenter
and the algorithm itself wrapped by a PolicySuggester
.
class BenchmarkState:
"""State of a benchmark run. It is altered via benchmark protocols."""
experimenter: Experimenter
algorithm: PolicySuggester
To wrap multiple BenchmarkSubRoutines
together, we can use the BenchmarkRunner
:
class BenchmarkRunner(BenchmarkSubroutine):
"""Run a sequence of subroutines, all repeated for a few iterations."""
# A sequence of benchmark subroutines that alter BenchmarkState.
benchmark_subroutines: Sequence[BenchmarkSubroutine]
# Number of times to repeat applying benchmark_subroutines.
num_repeats: int
def run(self, state: BenchmarkState) -> None:
"""Run algorithm with benchmark subroutines with repetitions."""
Example usage
Below is a typical example of simple suggestion and evaluation:
runner = vzb.BenchmarkRunner(
benchmark_subroutines=[
vzb.GenerateSuggestions(),
vzb.EvaluateActiveTrials(),
],
num_repeats=100,
)
benchmark_state_factory = vzb.DesignerBenchmarkStateFactory(
experimenter=experimenter, designer_factory=designer_factory
)
benchmark_state = benchmark_state_factory()
runner.run(benchmark_state)
We may obtain the evaluated trials via the benchmark_state
, which contains a
PolicySupporter
via its algorithm
field:
all_trials = benchmark_state.algorithm.supporter.trials
print(all_trials)
Note that this design is maximally informative on everything that has happened
so far in the study. For instance, we may also query incomplete/unused
suggestions using the PolicySupporter
.
References
Benchmark Runners can be found here.