{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QWYBvrd2-GNH"
      },
      "source": [
        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/vizier/blob/main/docs/guides/benchmarks/analyzing_benchmarks.ipynb)\n",
        "\n",
        "# Analyzing Benchmarks\n",
        "\n",
        "We will demonstrate below how to dstribute our benchmark runner pipeline over\n",
        "multiple benchmarks in conjunction with our suite of benchmark analysis tools to\n",
        "easily compare and visualize the performance of different algorithms over all\n",
        "benchmark problems."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MEydHqZa4F7C"
      },
      "source": [
        "## Installation and reference imports"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wkTkplOd4IDZ"
      },
      "outputs": [],
      "source": [
        "!pip install google-vizier[jax,algorithms]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "O_vf3X2W4KrB"
      },
      "outputs": [],
      "source": [
        "from vizier import benchmarks as vzb\n",
        "from vizier.algorithms import designers\n",
        "from vizier.benchmarks import experimenters\n",
        "from vizier.benchmarks import analyzers\n",
        "\n",
        "import itertools\n",
        "import numpy as np\n",
        "import pandas as pd"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0mV63H3C4Mw7"
      },
      "source": [
        "## Algorithm and Experimenter Factories\n",
        "\n",
        "To compare algorithms across multiple benchmarks, we want to first create a bunch of relevant benchmark experimenters. To do so, we use `SerializableExperimenterFactory` from our [Experimenters API](https://github.com/google/vizier/blob/main/vizier/benchmarks/experimenters/__init__.py) to modularize the construction of multiple benchmark components.\n",
        "\n",
        "For example, here we can create a diverse set of BBOB functions with different dimensions via the `BBOBExperimenterFactory`. Then, we can print out the full serialization of the benchmarks that we have created."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Py7bK9oF4OzR"
      },
      "outputs": [],
      "source": [
        "function_names = ['Sphere', 'Discus']\n",
        "dimensions = [4, 8]\n",
        "product_list = list(itertools.product(function_names, dimensions))\n",
        "\n",
        "experimenter_factories = []\n",
        "for product in product_list:\n",
        "  name, dim = product\n",
        "  bbob_factory = experimenters.BBOBExperimenterFactory(name=name, dim=dim)\n",
        "  experimenter_factories.append(bbob_factory)\n",
        "  print(bbob_factory.dump())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hAo_IKMV4TFa"
      },
      "source": [
        "As mentioned in our previous tutorial, we can create a `BenchmarkState` from our algorithm and experimenter factories and apply a `BenchmarkRunner` benchmarking protocol to run the algorithm. We end up with a list of `BenchmarkState` objects, each representing a different benchmark run, possibly with repeats.\n",
        "\n",
        "Conveniently, we provide analysis utility functions in our [Analyzers API](https://github.com/google/vizier/blob/main/vizier/benchmarks/analyzers.py) that convert our `BenchmarkState` into summarized curves stored compactly in `BenchmarkRecord`, which also holds the algorithm name and experimenter factory serialization. We can visualize and later analyze our results using a dataframe."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "c8RI8XN94Tp7"
      },
      "outputs": [],
      "source": [
        "NUM_REPEATS = 5  # @param\n",
        "NUM_ITERATIONS = 150  # @param\n",
        "\n",
        "runner = vzb.BenchmarkRunner(\n",
        "    benchmark_subroutines=[\n",
        "        vzb.GenerateSuggestions(),\n",
        "        vzb.EvaluateActiveTrials(),\n",
        "    ],\n",
        "    num_repeats=NUM_ITERATIONS,\n",
        ")\n",
        "algorithms = {\n",
        "    'grid': designers.GridSearchDesigner.from_problem,\n",
        "    'random': designers.RandomDesigner.from_problem,\n",
        "    'eagle': designers.EagleStrategyDesigner,\n",
        "}\n",
        "\n",
        "records = []\n",
        "for experimenter_factory in experimenter_factories:\n",
        "  for algo_name, algo_factory in algorithms.items():\n",
        "    benchmark_state_factory = vzb.ExperimenterDesignerBenchmarkStateFactory(\n",
        "        experimenter_factory=experimenter_factory, designer_factory=algo_factory\n",
        "    )\n",
        "    states = []\n",
        "    for _ in range(NUM_REPEATS):\n",
        "      benchmark_state = benchmark_state_factory()\n",
        "      runner.run(benchmark_state)\n",
        "      states.append(benchmark_state)\n",
        "    record = analyzers.BenchmarkStateAnalyzer.to_record(\n",
        "        algorithm=algo_name,\n",
        "        experimenter_factory=experimenter_factory,\n",
        "        states=states,\n",
        "    )\n",
        "    records.append(record)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Gdyzlkgs4XCf"
      },
      "outputs": [],
      "source": [
        "records_list = [\n",
        "    (rec.algorithm, dict(rec.experimenter_metadata), rec) for rec in records\n",
        "]\n",
        "df = pd.DataFrame(records_list, columns=['algorithm', 'experimenter', 'record'])\n",
        "df"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NwyTnBSu4bLk"
      },
      "source": [
        "## Visualization from Records\n",
        "\n",
        "Given a sequence of `BenchmarkRecords`, we provide utility plotting functions via the `matplotlib.pyplot` library to plot and visualize the relative performance of each algorithm on each benchmark. Currently, for single-objective optimization, we extract and plot the `objective` metric, which represents the objective of the best Trial seen so far as a function of Trial id/count (default).\n",
        "\n",
        "**Note**: this `objective` curve is monotonic and is computing upon converting to `BenchmarkRecord`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "IbddkINO4fvO"
      },
      "outputs": [],
      "source": [
        "analyzers.plot_from_records(records)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "r2lPYeHZ4eHQ"
      },
      "source": [
        "Observe that `plot_from_records` is a general plotting utility function that generates a grid of algorithm comparison plots. Specifically, it generates one plot for each Experimenter x Metrics in records, where each row represents an Experimenter and each column is a Metric represented in the record's elements dictionary. Each plot has a curve for each algorithm."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jZh90Hxt4kFu"
      },
      "source": [
        "## Adding Analysis\n",
        "\n",
        "Oftentimes, further analysis is needed to normalize metrics across multiple benchmarks or to visualize more context-dependent metrics, such as visualizing the Pareto frontier as a scatter plot.\n",
        "\n",
        "We focus on the former case, where objective curves require some form of normalization for each comparison across benchmarks. Many success metrics have been proposed: win rates, relative convergence, normalized objective score, [NeurIPS competition scores](https://arxiv.org/pdf/2012.03826.pdf).\n",
        "\n",
        "To broadly cover such analysis scores, our [API](https://github.com/google/vizier/blob/main/vizier/benchmarks/__init__.py) introduces the `ConvergenceComparator` abstraction that compares two `ConvergenceCurve` at specified quantiles:\n",
        "\n",
        "```python\n",
        "@attr.define\n",
        "class ConvergenceComparator(abc.ABC):\n",
        "  \"\"\"(Simplified) Base class for convergence curve comparators.\n",
        "\n",
        "  Attributes:\n",
        "    baseline_curve: The baseline ConvergenceCurve.\n",
        "    compared_curve: The compared ConvergenceCurve.\n",
        "  \"\"\"\n",
        "\n",
        "  _baseline_curve: ConvergenceCurve = attr.field()\n",
        "  _compared_curve: ConvergenceCurve = attr.field()\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def score(self) -\u003e float:\n",
        "    \"\"\"Returns a summary score for the comparison between base and compared.\n",
        "\n",
        "    Usually, higher positive numbers mean the compared curve is better than the\n",
        "    baseline and vice versa.\n",
        "    \"\"\"\n",
        "    pass\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def curve(self) -\u003e ConvergenceCurve:\n",
        "    \"\"\"Returns a score curve for each xs.\"\"\"\n",
        "    pass\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0x8eLgvL4m9D"
      },
      "source": [
        "Generally, a higher score by convention should indicate that the compared curve is better than the baseline. Furthermore, a score of 0.0 indicates that the performance is similar and it would make sense of these scores to be symmetric. However, there is no such restrictions imposed on the API.\n",
        "\n",
        "As an example, we can add the `LogEfficiencyScore`, which is based off of [performance profiles](https://arxiv.org/pdf/cs/0102001.pdf), a gold standard in optimization benchmarking. The LogEfficiencyScore essentially measures the percentage of Trials needed for the compared algorithm to match the baseline performance. If score = 1, then the compared algorithm uses $e^{-1}*T$ Trials to reach the same objective as the baseline algorithm in $T$ trials.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PF0fABok4hun"
      },
      "outputs": [],
      "source": [
        "analyzed_records = analyzers.BenchmarkRecordAnalyzer.add_comparison_metrics(\n",
        "    records=records, baseline_algo='random'\n",
        ")\n",
        "analyzers.plot_from_records(analyzed_records)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7oRyLMHw4rmX"
      },
      "source": [
        "## Custom Comparators\n",
        "\n",
        " To write a custom `ConvergenceComparator`, simply follow the abstract class defined above and form a `ConvergenceComparatorFactory`, which can then be passed into `add_comparison_metrics`. Note that we are constantly adding more benchmarking scores into our analyzers base and welcome submissions.\n",
        "\n",
        " Let us try to write a custom `WinRateComparator` that looks at the simple metric of comparing whether one curve is better than the other, for each `xs`.\n",
        "\n",
        " **NOTE:** You may always assume in a `Comparator` that both curves are either `INCREASING` (sign = 1) or `DECREASING` (signand that the sign of the curves is stored in `self._sign`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ESUcbAQhCrMg"
      },
      "outputs": [],
      "source": [
        "class WinRateComparator(analyzers.ConvergenceComparator):\n",
        "  \"\"\"Comparator method based on simple win rate comparison.\"\"\"\n",
        "\n",
        "  def score(self) -\u003e float:\n",
        "    return np.nanmedian(self.curve().ys)\n",
        "\n",
        "  def curve(self) -\u003e analyzers.ConvergenceCurve:\n",
        "    baseline_ys = self._sign * self._baseline_curve.ys\n",
        "    compared_ys = self._sign * self._compared_curve.ys\n",
        "\n",
        "    # Compares all pairs of compared to baseline curve.\n",
        "    all_comparisons = np.apply_along_axis(\n",
        "        lambda base: np.mean(compared_ys \u003e base, axis=0),\n",
        "        axis=1,\n",
        "        arr=baseline_ys,\n",
        "    )\n",
        "    return analyzers.ConvergenceCurve(\n",
        "        xs=self._baseline_curve.xs,\n",
        "        ys=np.mean(all_comparisons, axis=0, keepdims=True),\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "If1ct2I8tV-z"
      },
      "source": [
        "Now, we add a simple ComparatorFactory and inject the factory into `add_comparison_metrics` to create our new scoring plots. Note that one can also manually add customized `PlotElements` that can be in histogram, or scatter form."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5CEuy-nhv2nW"
      },
      "outputs": [],
      "source": [
        "class WinRateComparatorFactory(analyzers.ConvergenceComparatorFactory):\n",
        "  \"\"\"Factory class for WinRateComparatorFactory.\"\"\"\n",
        "\n",
        "  def __call__(\n",
        "      self,\n",
        "      baseline_curve: analyzers.ConvergenceCurve,\n",
        "      compared_curve: analyzers.ConvergenceCurve,\n",
        "      baseline_quantile: float = 0.5,\n",
        "      compared_quantile: float = 0.5,\n",
        "  ) -\u003e analyzers.ConvergenceComparator:\n",
        "    return WinRateComparator(\n",
        "        baseline_curve=baseline_curve,\n",
        "        compared_curve=compared_curve,\n",
        "        baseline_quantile=baseline_quantile,\n",
        "        compared_quantile=compared_quantile,\n",
        "        name='win_rate',\n",
        "    )\n",
        "\n",
        "\n",
        "# Add WinRateComparator plots and visualize them.\n",
        "analyzed_records_with_winrate = (\n",
        "    analyzers.BenchmarkRecordAnalyzer.add_comparison_metrics(\n",
        "        records=analyzed_records,\n",
        "        baseline_algo='random',\n",
        "        comparator_factory=WinRateComparatorFactory(),\n",
        "    )\n",
        ")\n",
        "analyzers.plot_from_records(analyzed_records_with_winrate)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BOlTT1OGGuyM"
      },
      "source": [
        "## Summarizing with Comparators\n",
        "\n",
        "One of the main benefits of using `ConvergenceComparator` is that these metrics are usually already normalized across different optimization problems and contexts. We provide utility functions to perform summarization procedures across `BenchmarkRecords`, which can used with the `summarize` feature of the `BenchmarkRecordAnalyzer`.\n",
        "\n",
        "Note that we can write a custom score and summary function, but we use our default summarizer that simply plots the distribution of scores across problems as histograms.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7HrPNvSaHxzC"
      },
      "outputs": [],
      "source": [
        "import json\n",
        "# Summarized across experimenters by giving reduced keys.\n",
        "def record_to_reduced_keys(record):\n",
        "  bbob_dict = json.loads(record.experimenter_metadata['bbob_factory'])\n",
        "  experimenter = bbob_dict.pop('name')\n",
        "  return experimenter, json.dumps(bbob_dict)\n",
        "\n",
        "summarized_records = analyzers.BenchmarkRecordAnalyzer.summarize(analyzed_records_with_winrate, record_to_reduced_keys)\n",
        "analyzers.plot_from_records(summarized_records)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qdnQhNhQ4tVo"
      },
      "source": [
        "## References\n",
        "*   Benchmark analysis tools can be found [here](https://github.com/google/vizier/tree/main/vizier/benchmarks/analyzers.py).\n",
        "*   Convergence curve utils and comparators can be found [here](https://github.com/google/vizier/tree/main/vizier/_src/benchmarks/analyzers/convergence_curve.py)\n",
        "\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "last_runtime": {
        "build_target": "//learning/vizier/service/colab:notebook",
        "kind": "private"
      },
      "name": "Analyzing Benchmarks.ipynb",
      "private_outputs": true,
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}