docs: Add basic tutorials

Add two basic tutorials for grid_search and hp_optimization. The goal of these tutorials is to get new users started. As such, they are intentionally minimalistic and do not explain all the available options. They should enable beginners to get cluster_utils running fast and from there allow them to explore more options step by step (though for that much more documentation needs to be written eventually). I put them in the "Basics" section for now, but if more tutorials will be added, it might make sense to add a dedicated "Tutorials" section.
martius-lab · Oct 31, 2024 · 71d0873 · 71d0873
1 parent 58a363c
commit 71d0873
Show file tree

Hide file tree

Showing 5 changed files with 1,824 additions and 0 deletions.
diff --git a/docs/configuration.rst b/docs/configuration.rst
@@ -530,6 +530,8 @@ settings (i.e. the ones independent of the optimisation method set in
     not set, the user will be asked at runtime in this case.
 
 
+.. _config.hp_optimization_iterations:
+
 About Iterations
 ~~~~~~~~~~~~~~~~
 

diff --git a/docs/images/Rosenbrock-contour.svg b/docs/images/Rosenbrock-contour.svg
diff --git a/docs/index.rst b/docs/index.rst
@@ -64,6 +64,8 @@ For more information see :doc:`usage` and the examples in the ``examples/basic/`
 
    installation
    usage
+   tutorials/grid_search.rst
+   tutorials/hp_optimization.rst
 
 
 .. toctree::

diff --git a/docs/tutorials/grid_search.rst b/docs/tutorials/grid_search.rst
@@ -0,0 +1,173 @@
+***************************
+Tutorial: Basic Grid Search
+***************************
+
+In this tutorial, we learn how to set up cluster_utils to run a basic grid search on an
+arbitrary optimization function.  It does not cover all available options but instead
+shows the minimal steps needed to get started.
+
+--------
+
+Prepare your code
+=================
+
+For the sake of this tutorial, we will use the two-dimensional Rosenbrock function.
+However, any other function could be used here without affecting the general setup to
+run with cluster_utils.
+
+.. code-block:: python
+
+    def rosenbrock(x, y):
+        return (1 - x) ** 2 + 100 * (y - x**2) ** 2
+
+The function has a minimum value of zero at (x, y) = (1, 1):
+
+.. figure:: ../images/Rosenbrock-contour.svg
+   :alt: Plot of the Rosenbrock function.
+
+   Image by Nschloe - Own work, CC BY-SA 4.0, `link <https://commons.wikimedia.org/w/index.php?curid=114931732>`_
+
+
+To be able to run the grid search on this function, we need to write a little script,
+called ``rosenbrock.py`` in the following:
+
+
+.. code-block:: python
+
+    # rosenbrock.py
+    from cluster_utils import cluster_main
+
+    def rosenbrock(x, y):
+        return (1 - x) ** 2 + 100 * (y - x**2) ** 2
+
+    @cluster_main
+    def main(**params):
+        value = rosenbrock(params["x"], params["y"])
+
+        metrics = {"rosenbrock_value": value}
+        return metrics
+
+    if __name__ == "__main__":
+        main()
+
+
+This script will later be called by cluster_utils for each set of parameters in the grid
+search.
+
+**cluster_utils expects your code to be committed to a Git repository.**  This
+helps to keep track of the  exact version of the code you ran the grid search on (the
+Git revision will be included in the report).  Thus, create a git repository, commit the
+``rosenbrock.py`` script and push to the remote (cluster_utils will later pull from
+there).
+
+
+Write a cluster_utils configuration file
+========================================
+
+Now we need to write a configuration file to tell cluster_utils how to run it, which
+parameters to do the grid search over, where to save results, etc.
+
+This config file can be either JSON, YAML or TOML.  In the following, we use TOML but
+the other formats would work just as well (JSON is discouraged, though, as it is rather
+annoying to write by hand and doesn't support comments).
+
+
+.. code-block:: toml
+
+    # Name and base of the output directory.  With the given config, results will be
+    # written to /tmp/rosenbrock_grid_search/.
+    optimization_procedure_name = "rosenbrock_grid_search"
+    results_dir = "/tmp"
+
+    # Automatically generate a PDF report when finished
+    generate_report = "when_finished"
+
+    # Path to the job script.  Note that this is relative to the repositories root
+    # directory, not to this config file!
+    script_relative_path = "rosenbrock.py"
+
+    # How often to run each configuration (useful if there is some randomness
+    # in the result).
+    restarts = 1
+
+    [git_params]
+    # which repo/branch to check out
+    url = "<url to your git repository>"
+    branch = "main"
+
+    [cluster_requirements]
+    request_cpus = 1
+
+    [environment_setup]
+    # This section is required, even if no options are set here.
+
+    [fixed_params]
+    # Likewise required but may be empty.
+
+    [[hyperparam_list]]
+    param = "x"
+    values = [0.0, 0.5, 1.0, 1.5, 2.0]
+
+    [[hyperparam_list]]
+    param = "y"
+    values = [0.0, 0.5, 1.0, 1.5, 2.0]
+
+
+In natural words, this config tells cluster_utils to do the following: Run grid search
+over the two parameters "x" and "y", checking the values "[0.0, 0.5, 1.0, 1.5, 2.0]"
+for each of them (entries in ``hyperparam_list``).  Get the Python script
+"rosenbrock.py" (``script_relative_path``) from the specified git repository
+(``git_params``).  For each combination of "(x, y)", execute the script once
+(``restarts``) on a single CPU core (``cluster_requirements``).  When finished, generate
+a nice PDF report (``generate_report``) and store it, together with other output files,
+in "/tmp/rosenbrock_grid_search" (``optimization_procedure_name``, ``results_dir``).
+
+
+**Note:** You will need to adjust the settings in the ``[git_params]`` section to point
+to the repository that contains the ``rosenbrock.py``.
+
+
+Run the grid search
+===================
+
+Now you can run the grid search locally:
+
+.. code-block:: sh
+
+    python3 -m cluster_utils.grid_search path/to/config.toml
+
+It will detect that it is not executed on a cluster and ask for confirmation to run
+locally.  Simply press enter to confirm.  It will then start executing jobs, and, when
+finished, create a report.  The output should look something like this:
+
+.. code-block:: text
+
+    Detailed logging available in /tmp/rosenbrock_grid_search/cluster_run.log
+    Creating directory /tmp/rosenbrock_grid_search/working_directories
+    Logs of individual jobs stored at /home/arada/.cache/cluster_utils/rosenbrock_grid_search-20241031-135040-jobs
+    Using project direcory /home/arada/.cache/cluster_utils/rosenbrock_grid_search-20241031-135040-project
+    No cluster detected. Do you want to run locally? [Y/n]: 
+    Completed:  92%|████████████████████████████████████████████████████▋        | 23/25
+    Started execution:  92%|████████████████████████████████████       | 23/25, Failed=0
+    Submitted: 100%|█████████████████████████████████████████████████████████████| 25/25
+
+    Killing remaining jobs...
+    Results are stored in /tmp/rosenbrock_grid_search
+    Procedure successfully finished
+    Producing basic report... 
+    Report saved at /tmp/rosenbrock_grid_search/rosenbrock_grid_search_report.pdf
+
+All results of the grid search are stored in ``/tmp/rosenbrock_grid_search``.  Most
+relevant files are:
+
+- rosenbrock_grid_search_report.pdf: The PDF report which includes a list of best
+  parameters and several plots for further analysis.
+- all_data.csv: Results of all runs as CSV file.
+- cluster_run.log: Log of cluster_utils.  Useful for debugging if something goes wrong.
+
+
+.. important::
+
+   Every time you run cluster_utils, it creates a temporary working copy of the
+   specified git repository.  This means, when you make changes to the code, you need to
+   **commit and push** them before running cluster_utils again.
diff --git a/docs/tutorials/hp_optimization.rst b/docs/tutorials/hp_optimization.rst
@@ -0,0 +1,197 @@
+*******************************************
+Tutorial: Basic Hyperparameter Optimization
+*******************************************
+
+In this tutorial, we learn how to set up cluster_utils to run a basic hyperparameter
+optimization on an arbitrary optimization function.  It does not cover all available
+options but instead shows the minimal steps needed to get started.
+
+.. note:: If you haven't done so, please read :doc:`grid_search` first.
+
+--------
+
+Prepare your code
+=================
+
+Please see the corresponding section in :doc:`grid_search`.  The exact same job
+script/git repository can be used here.
+
+
+
+Write a cluster_utils configuration file
+========================================
+
+The configuration file generally has the same structure as for grid search but some
+settings differ.
+
+.. code-block:: toml
+
+    # Name and base of the output directory.  With the given config, results will be
+    # written to /tmp/rosenbrock_optimization/.
+    optimization_procedure_name = "rosenbrock_optimization"
+    results_dir = "/tmp"
+
+    # Automatically generate a PDF report when finished
+    generate_report = "when_finished"
+
+    # Path to the job script.  Note that this is relative to the repositories root
+    # directory, not to this config file!
+    script_relative_path = "rosenbrock.py"
+
+    # which optimizer to use
+    optimizer_str = "cem_metaoptimizer"
+
+    # keep data of the 5 best runs (for example useful, if checkpoints are saved)
+    num_best_jobs_whose_data_is_kept = 5
+
+    [git_params]
+    # which repo/branch to check out
+    url = "<url to your git repository>"
+    branch = "main"
+
+    [cluster_requirements]
+    request_cpus = 1
+
+    [environment_setup]
+    # This section is required, even if no options are set here.
+
+    [fixed_params]
+    # Likewise required but may be empty.
+
+    [optimizer_settings]
+    with_restarts = false
+    num_jobs_in_elite = 10
+
+    [optimization_setting]
+    # Which metric value to optimize on.  Refers to the metrics dictionary that is
+    # returned in rosenbrock.py.
+    metric_to_optimize = "rosenbrock_value"
+    minimize = true
+
+    # total number samples that are tested
+    number_of_samples = 1_00
+    # how many jobs to run in parallel
+    n_jobs_per_iteration = 10
+
+    [[optimized_params]]
+    param = "x"
+    distribution = "TruncatedNormal"
+    bounds = [ -2, 2 ]
+
+    [[optimized_params]]
+    param = "y"
+    distribution = "TruncatedNormal"
+    bounds = [ -2, 2 ]
+
+
+
+Compared to the configuration from the :doc:`grid search tutorial <grid_search>` the
+``restarts`` and ``hyperparam_list`` settings are gone.  Instead a bunch of other
+settings has been added, which we will go through in the following:
+
+
+.. code-block:: toml
+
+    optimizer_str = "cem_metaoptimizer"
+
+The type of optimizer to use (see :confval:`optimizer_str` for available options).
+
+.. code-block:: toml
+
+    num_best_jobs_whose_data_is_kept = 5
+
+With this setting, the full output of the best 5 jobs throughout the whole optimization
+is kept.  This is mostly useful if your jobs store additional data (e.g. training
+snapshots), which you might want to analyse when finished.
+
+
+.. code-block:: toml
+
+    [optimizer_settings]
+    with_restarts = false
+    num_jobs_in_elite = 10
+
+Settings specific to the chosen optimizer.  See :ref:`config.optimizer_settings`.
+
+.. code-block:: toml
+
+    [optimization_setting]
+    # Which metric value to optimize on.  Refers to the metrics dictionary that is
+    # returned in rosenbrock.py.
+    metric_to_optimize = "rosenbrock_value"
+    minimize = true
+
+    # total number samples that are tested
+    number_of_samples = 1_00
+    # how many jobs to run in parallel
+    n_jobs_per_iteration = 10
+
+These are general optimization settings that are valid for all optimizers.  Here we
+specify which metric should be used for the optimization (in this tutorial, we only
+return one value in ``rosenbrock.py`` but there could be multiple) and whether it should
+be minimized or maximized.
+
+Further the number of samples and iterations is configured here.  See
+:ref:`config.optimization_settings` for more information.
+
+.. code-block:: toml
+
+    [[optimized_params]]
+    param = "x"
+    distribution = "TruncatedNormal"
+    bounds = [ 0, 2 ]
+
+    [[optimized_params]]
+    param = "y"
+    distribution = "TruncatedNormal"
+    bounds = [ 0, 2 ]
+
+Finally the hyperparmeters that should be optimized are specified.  In this example, we
+use a normal distribution over the range [0, 2] for both variables.  See
+:confval:`optimized_params` for a list of available distributions.
+
+**Note:** You will need to adjust the settings in the ``[git_params]`` section to point
+to the repository that contains the ``rosenbrock.py``.
+
+
+Run the hyperparameter optimization
+===================================
+
+Now you can run the hyperparameter optimization locally:
+
+.. code-block:: sh
+
+    python3 -m cluster_utils.hp_optimization path/to/config.toml
+
+The output during execution is similar to grid search.  However, after each
+"iteration" (see :ref:`config.hp_optimization_iterations`), a list of current best
+results is printed:
+
+.. code-block:: text
+
+           x     y  rosenbrock_value  job_restarts  rosenbrock_value__std
+    20  1.00  1.00          0.000000             3                    0.0
+    15  0.90  0.81          0.010000             1                    NaN
+    18  0.96  1.00          0.616256             1                    NaN
+    17  0.95  1.00          0.953125             1                    NaN
+    10  0.85  0.82          0.973125             1                    NaN
+    13  0.89  0.90          1.176341             1                    NaN
+    8   0.80  0.50          2.000000             1                    NaN
+    21  1.00  1.20          4.000000             1                    NaN
+    14  0.90  0.60          4.420000             2                    0.0
+    9   0.80  0.90          6.800000             1                    NaN
+
+
+The result files in the output directory are also similar to grid search.  Most
+important ones are:
+
+- result.pdf:  The PDF report.
+- all_data.csv:  Results of all runs as CSV file.
+- cluster_run.log: Log of cluster_utils.  Useful for debugging if something goes wrong.
+
+
+.. important::
+
+   Every time you run cluster_utils, it creates a temporary working copy of the
+   specified git repository.  This means, when you make changes to the code, you need to
+   **commit and push** them before running cluster_utils again.