Skip to content

Commit

Permalink
docs: Add basic tutorials
Browse files Browse the repository at this point in the history
Add two basic tutorials for grid_search and hp_optimization.
The goal of these tutorials is to get new users started.  As such, they
are intentionally minimalistic and do not explain all the available
options.  They should enable beginners to get cluster_utils running fast
and from there allow them to explore more options step by step (though
for that much more documentation needs to be written eventually).

I put them in the "Basics" section for now, but if more tutorials will
be added, it might make sense to add a dedicated "Tutorials" section.
  • Loading branch information
luator committed Oct 31, 2024
1 parent 58a363c commit 71d0873
Show file tree
Hide file tree
Showing 5 changed files with 1,824 additions and 0 deletions.
2 changes: 2 additions & 0 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -530,6 +530,8 @@ settings (i.e. the ones independent of the optimisation method set in
not set, the user will be asked at runtime in this case.


.. _config.hp_optimization_iterations:

About Iterations
~~~~~~~~~~~~~~~~

Expand Down
1,450 changes: 1,450 additions & 0 deletions docs/images/Rosenbrock-contour.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ For more information see :doc:`usage` and the examples in the ``examples/basic/`

installation
usage
tutorials/grid_search.rst
tutorials/hp_optimization.rst


.. toctree::
Expand Down
173 changes: 173 additions & 0 deletions docs/tutorials/grid_search.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
***************************
Tutorial: Basic Grid Search
***************************

In this tutorial, we learn how to set up cluster_utils to run a basic grid search on an
arbitrary optimization function. It does not cover all available options but instead
shows the minimal steps needed to get started.

--------

Prepare your code
=================

For the sake of this tutorial, we will use the two-dimensional Rosenbrock function.
However, any other function could be used here without affecting the general setup to
run with cluster_utils.

.. code-block:: python
def rosenbrock(x, y):
return (1 - x) ** 2 + 100 * (y - x**2) ** 2
The function has a minimum value of zero at (x, y) = (1, 1):

.. figure:: ../images/Rosenbrock-contour.svg
:alt: Plot of the Rosenbrock function.

Image by Nschloe - Own work, CC BY-SA 4.0, `link <https://commons.wikimedia.org/w/index.php?curid=114931732>`_


To be able to run the grid search on this function, we need to write a little script,
called ``rosenbrock.py`` in the following:


.. code-block:: python
# rosenbrock.py
from cluster_utils import cluster_main
def rosenbrock(x, y):
return (1 - x) ** 2 + 100 * (y - x**2) ** 2
@cluster_main
def main(**params):
value = rosenbrock(params["x"], params["y"])
metrics = {"rosenbrock_value": value}
return metrics
if __name__ == "__main__":
main()
This script will later be called by cluster_utils for each set of parameters in the grid
search.

**cluster_utils expects your code to be committed to a Git repository.** This
helps to keep track of the exact version of the code you ran the grid search on (the
Git revision will be included in the report). Thus, create a git repository, commit the
``rosenbrock.py`` script and push to the remote (cluster_utils will later pull from
there).


Write a cluster_utils configuration file
========================================

Now we need to write a configuration file to tell cluster_utils how to run it, which
parameters to do the grid search over, where to save results, etc.

This config file can be either JSON, YAML or TOML. In the following, we use TOML but
the other formats would work just as well (JSON is discouraged, though, as it is rather
annoying to write by hand and doesn't support comments).


.. code-block:: toml
# Name and base of the output directory. With the given config, results will be
# written to /tmp/rosenbrock_grid_search/.
optimization_procedure_name = "rosenbrock_grid_search"
results_dir = "/tmp"
# Automatically generate a PDF report when finished
generate_report = "when_finished"
# Path to the job script. Note that this is relative to the repositories root
# directory, not to this config file!
script_relative_path = "rosenbrock.py"
# How often to run each configuration (useful if there is some randomness
# in the result).
restarts = 1
[git_params]
# which repo/branch to check out
url = "<url to your git repository>"
branch = "main"
[cluster_requirements]
request_cpus = 1
[environment_setup]
# This section is required, even if no options are set here.
[fixed_params]
# Likewise required but may be empty.
[[hyperparam_list]]
param = "x"
values = [0.0, 0.5, 1.0, 1.5, 2.0]
[[hyperparam_list]]
param = "y"
values = [0.0, 0.5, 1.0, 1.5, 2.0]
In natural words, this config tells cluster_utils to do the following: Run grid search
over the two parameters "x" and "y", checking the values "[0.0, 0.5, 1.0, 1.5, 2.0]"
for each of them (entries in ``hyperparam_list``). Get the Python script
"rosenbrock.py" (``script_relative_path``) from the specified git repository
(``git_params``). For each combination of "(x, y)", execute the script once
(``restarts``) on a single CPU core (``cluster_requirements``). When finished, generate
a nice PDF report (``generate_report``) and store it, together with other output files,
in "/tmp/rosenbrock_grid_search" (``optimization_procedure_name``, ``results_dir``).


**Note:** You will need to adjust the settings in the ``[git_params]`` section to point
to the repository that contains the ``rosenbrock.py``.


Run the grid search
===================

Now you can run the grid search locally:

.. code-block:: sh
python3 -m cluster_utils.grid_search path/to/config.toml
It will detect that it is not executed on a cluster and ask for confirmation to run
locally. Simply press enter to confirm. It will then start executing jobs, and, when
finished, create a report. The output should look something like this:

.. code-block:: text
Detailed logging available in /tmp/rosenbrock_grid_search/cluster_run.log
Creating directory /tmp/rosenbrock_grid_search/working_directories
Logs of individual jobs stored at /home/arada/.cache/cluster_utils/rosenbrock_grid_search-20241031-135040-jobs
Using project direcory /home/arada/.cache/cluster_utils/rosenbrock_grid_search-20241031-135040-project
No cluster detected. Do you want to run locally? [Y/n]:
Completed: 92%|████████████████████████████████████████████████████▋ | 23/25
Started execution: 92%|████████████████████████████████████ | 23/25, Failed=0
Submitted: 100%|█████████████████████████████████████████████████████████████| 25/25
Killing remaining jobs...
Results are stored in /tmp/rosenbrock_grid_search
Procedure successfully finished
Producing basic report...
Report saved at /tmp/rosenbrock_grid_search/rosenbrock_grid_search_report.pdf
All results of the grid search are stored in ``/tmp/rosenbrock_grid_search``. Most
relevant files are:

- rosenbrock_grid_search_report.pdf: The PDF report which includes a list of best
parameters and several plots for further analysis.
- all_data.csv: Results of all runs as CSV file.
- cluster_run.log: Log of cluster_utils. Useful for debugging if something goes wrong.


.. important::

Every time you run cluster_utils, it creates a temporary working copy of the
specified git repository. This means, when you make changes to the code, you need to
**commit and push** them before running cluster_utils again.
197 changes: 197 additions & 0 deletions docs/tutorials/hp_optimization.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
*******************************************
Tutorial: Basic Hyperparameter Optimization
*******************************************

In this tutorial, we learn how to set up cluster_utils to run a basic hyperparameter
optimization on an arbitrary optimization function. It does not cover all available
options but instead shows the minimal steps needed to get started.

.. note:: If you haven't done so, please read :doc:`grid_search` first.

--------

Prepare your code
=================

Please see the corresponding section in :doc:`grid_search`. The exact same job
script/git repository can be used here.



Write a cluster_utils configuration file
========================================

The configuration file generally has the same structure as for grid search but some
settings differ.

.. code-block:: toml
# Name and base of the output directory. With the given config, results will be
# written to /tmp/rosenbrock_optimization/.
optimization_procedure_name = "rosenbrock_optimization"
results_dir = "/tmp"
# Automatically generate a PDF report when finished
generate_report = "when_finished"
# Path to the job script. Note that this is relative to the repositories root
# directory, not to this config file!
script_relative_path = "rosenbrock.py"
# which optimizer to use
optimizer_str = "cem_metaoptimizer"
# keep data of the 5 best runs (for example useful, if checkpoints are saved)
num_best_jobs_whose_data_is_kept = 5
[git_params]
# which repo/branch to check out
url = "<url to your git repository>"
branch = "main"
[cluster_requirements]
request_cpus = 1
[environment_setup]
# This section is required, even if no options are set here.
[fixed_params]
# Likewise required but may be empty.
[optimizer_settings]
with_restarts = false
num_jobs_in_elite = 10
[optimization_setting]
# Which metric value to optimize on. Refers to the metrics dictionary that is
# returned in rosenbrock.py.
metric_to_optimize = "rosenbrock_value"
minimize = true
# total number samples that are tested
number_of_samples = 1_00
# how many jobs to run in parallel
n_jobs_per_iteration = 10
[[optimized_params]]
param = "x"
distribution = "TruncatedNormal"
bounds = [ -2, 2 ]
[[optimized_params]]
param = "y"
distribution = "TruncatedNormal"
bounds = [ -2, 2 ]
Compared to the configuration from the :doc:`grid search tutorial <grid_search>` the
``restarts`` and ``hyperparam_list`` settings are gone. Instead a bunch of other
settings has been added, which we will go through in the following:


.. code-block:: toml
optimizer_str = "cem_metaoptimizer"
The type of optimizer to use (see :confval:`optimizer_str` for available options).

.. code-block:: toml
num_best_jobs_whose_data_is_kept = 5
With this setting, the full output of the best 5 jobs throughout the whole optimization
is kept. This is mostly useful if your jobs store additional data (e.g. training
snapshots), which you might want to analyse when finished.


.. code-block:: toml
[optimizer_settings]
with_restarts = false
num_jobs_in_elite = 10
Settings specific to the chosen optimizer. See :ref:`config.optimizer_settings`.

.. code-block:: toml
[optimization_setting]
# Which metric value to optimize on. Refers to the metrics dictionary that is
# returned in rosenbrock.py.
metric_to_optimize = "rosenbrock_value"
minimize = true
# total number samples that are tested
number_of_samples = 1_00
# how many jobs to run in parallel
n_jobs_per_iteration = 10
These are general optimization settings that are valid for all optimizers. Here we
specify which metric should be used for the optimization (in this tutorial, we only
return one value in ``rosenbrock.py`` but there could be multiple) and whether it should
be minimized or maximized.

Further the number of samples and iterations is configured here. See
:ref:`config.optimization_settings` for more information.

.. code-block:: toml
[[optimized_params]]
param = "x"
distribution = "TruncatedNormal"
bounds = [ 0, 2 ]
[[optimized_params]]
param = "y"
distribution = "TruncatedNormal"
bounds = [ 0, 2 ]
Finally the hyperparmeters that should be optimized are specified. In this example, we
use a normal distribution over the range [0, 2] for both variables. See
:confval:`optimized_params` for a list of available distributions.

**Note:** You will need to adjust the settings in the ``[git_params]`` section to point
to the repository that contains the ``rosenbrock.py``.


Run the hyperparameter optimization
===================================

Now you can run the hyperparameter optimization locally:

.. code-block:: sh
python3 -m cluster_utils.hp_optimization path/to/config.toml
The output during execution is similar to grid search. However, after each
"iteration" (see :ref:`config.hp_optimization_iterations`), a list of current best
results is printed:

.. code-block:: text
x y rosenbrock_value job_restarts rosenbrock_value__std
20 1.00 1.00 0.000000 3 0.0
15 0.90 0.81 0.010000 1 NaN
18 0.96 1.00 0.616256 1 NaN
17 0.95 1.00 0.953125 1 NaN
10 0.85 0.82 0.973125 1 NaN
13 0.89 0.90 1.176341 1 NaN
8 0.80 0.50 2.000000 1 NaN
21 1.00 1.20 4.000000 1 NaN
14 0.90 0.60 4.420000 2 0.0
9 0.80 0.90 6.800000 1 NaN
The result files in the output directory are also similar to grid search. Most
important ones are:

- result.pdf: The PDF report.
- all_data.csv: Results of all runs as CSV file.
- cluster_run.log: Log of cluster_utils. Useful for debugging if something goes wrong.


.. important::

Every time you run cluster_utils, it creates a temporary working copy of the
specified git repository. This means, when you make changes to the code, you need to
**commit and push** them before running cluster_utils again.

0 comments on commit 71d0873

Please sign in to comment.