Skip to content

Creating my first test

disservin edited this page Sep 24, 2024 · 253 revisions

Contributor Guidelines

  1. Always be polite, respectful, and kind: https://css-tricks.com/open-source-etiquette-guidebook/
  2. Keep your final change as small and neat as possible: https://tirania.org/blog/archive/2010/Dec-31.html
  3. Even if a change passes the Fishtest tests, it is not a guarantee that it will be merged. Patches that add significant complexity will need to show a big benefit to be considered.
  4. Never submit a test that is a bundle of multiple ideas. Submit each idea individually as its own test.
  5. Participate in the Stockfish Discord server. This is the place to communicate with moderators and other developers.

Requirements

To write patches for Stockfish and test them in the framework, you will need:

  • A recent C++ compiler
  • A working git on your system
  • A GitHub account
  • A git client on your computer (we recommend GitHub Desktop, for its simplicity)

Initial setup

Create a fork of the official Stockfish repository at https://github.com/official-stockfish/Stockfish and create a git clone of your forked version. Github has good help on this process: https://help.github.com/articles/fork-a-repo

Synchronize your master branch with the official master branch

Before creating a new patch, you have to make sure that your master branch is up to date and has all the newest commits from the official Stockfish master branch. You can use the following script to ensure this (this script should be used each time the official master branch changes). Save the script in a file named "sync-with-official.sh", then type the following command in a terminal: sh sync-with-official.sh

Click here to view the script
#!/bin/sh

# change directory to the path of the script
cd "${0%/*}"

# go to the src directory for Stockfish on my hard drive (edit accordingly)
cd ./chess/stockfish/src

echo
echo "This command will sync with master of official-stockfish"
echo

echo "Adding official Stockfish's public GitHub repository URL as a remote in my local git repository..."
git remote add     official https://github.com/official-stockfish/Stockfish.git
git remote set-url official https://github.com/official-stockfish/Stockfish.git

echo
echo "Going to my local master branch..."
git checkout master

echo
echo "Downloading official Stockfish's branches and commits..."
git fetch official

echo
echo "Updating my local master branch with the new commits from official Stockfish's master..."
git reset --hard official/master

echo
echo "Pushing my local master branch to my online GitHub repository..."
git push origin master --force

echo
echo "Compiling new master..."
make clean
make build -j
make net

echo
echo "Done."

Create your test

Tip

For instructions on how to create your test in Fishtest go to run your test.

Standard tests

  1. In terminal, browse to the Stockfish folder
  2. Create a branch for your work git checkout -b my_branch
  3. Edit the Stockfish source to make your changes
  4. To compile the source, refer to https://github.com/official-stockfish/Stockfish/wiki/Compiling-from-source but note that code changes are required as the code provided there is for compiling from the Stockfish source on github.
  5. Get the branch signature from ./(binary name) bench, which will look like Nodes searched : 4190940.
  6. Commit your changes locally with git commit -am "My commit message"
  7. Push your branch to Github: git push origin my_branch

NNUE net tests

To test new nets, first, upload these nets to Fishtest (account needed). By uploading you license your network under a CC0 license. The networks must follow the naming convention

nn-SHA.nnue

Where SHA are the first 12 digits of the sha256sum (shasum -a 256) of the nn.nnue data file (sha256sum nn.nnue | cut -c1-12).

On a Stockfish git branch, change the default value of EvalFileDefaultName (see evaluate.h) to the name of this net and proceed as usual with short time control (STC) and long time control (LTC) Sequential probability ratio test (SPRT) tests of this branch. Do not add an EvalFile test option, this is not supported on Fishtest, i.e. only the default net will be visible to the engine. Nets that require different features or are of different sizes can be tested as well, obviously with the needed changes made to the sources in the testing branch.

Elo measurements will be done as part of the usual progression/regression tests, which will probably be more frequent.

Tuning with SPSA

SPSA algorithm in Fishtest

The simultaneous perturbation stochastic approximation (SPSA) algorithm implementation in Fishtest can be (extremely) simplified in a loop based on these two steps:

  • Evaluation step: play a mini match using 2 different values of the parameter to be optimized:
    • parameter_value - ck
    • parameter_value + ck
  • Update step: update the parameter according to the mini match result:
    • parameter_value = parameter_value + (ck * rk) * (wins - losses)

Where:

  • ck: is the variation applied to the parameter_value to detect the right direction of the update from the mini match result. The value must be large enough to have a measurable Elo difference during the mini match

  • rk: is the fraction of ck used to update the parameter_value

    ck and rk decrease with the number of iterations (1 iteration = 2 games played) in order to have bigger updates of the parameter_value at the start of the tuning and to settle to a final value at the end of the tuning:

    ck = c0 / (1 + k)**gamma
    ak = a0 / (A + 1 + k)**alpha
    rk = ak / ck**2
    

The Fishtest page for a new SPSA test requires for each parameter:

  • The starting value of parameter_value
  • The clipping min value, applied on parameter_value and parameter_value - ck
  • The clipping max value, applied on parameter_value and parameter_value + ck
  • The final value of ck (i.e. when k = total_games / 2 - 1)
  • The final value of rk (i.e. when k = total_games / 2 - 1)

The min and max clipping values should establish the broadest feasible range within which the parameter is meaningful. If this range is set too narrowly, it could lead to unwarranted clipping, thereby diminishing the sensitivity of Elo during the mini match. To confine the evolution range for a parameter, simply set a smaller rk value.

Prepare the code for SPSA

  1. In the terminal, browse to the Stockfish folder

  2. Create a branch for your work git checkout -b my_tuning_branch

  3. Move the definition of the variables to the global scope of the Stockfish namespace.

  4. Remove const qualifiers from the variables in the source code that you want to tune.

  5. Flag the variables you want to tune with the TUNE macro. For example, if you have:

    namespace Stockfish {
    
    int myKing = 10, myQueen = 20;
    Score myBonus = S(5, 15);
    Value myValue[][2] = { { V(100), V(20) }, { V(7), V(78) } };

    Simply add the following line somewhere after it

    TUNE(myKing, myBonus, myValue);

    The type of the variables must be one of int, Score, or Value. They can be arrays of arbitrary dimensions. Note that the variables will only be allowed to vary in the range[0, 2*v], where v is the initial value. See also point 6.

    Note that the c value must be above 1. For example if we have the following:

    int bonus1 = 1; int bonus2 = 2;
    int totalBonus = bonus1 * term1 + bonus2 * term2;

    We can instead multiple up the values to be used in the tuner:

    int bonus1 = 10; int bonus2 = 20;
    int totalBonus = (bonus1 * term1 + bonus2 * term2) / 10;
    TUNE(bonus1, bonus2);

    In this case, we have multiplied up the values by 10x and can now use a c value of 5.

    You can have multiple invocations of TUNE in different places. For example, the code below is equivalent to the one above:

    TUNE(myKing);
    ...
    TUNE(myQueen, myValue);

    Even more flexibility can be obtained with a custom range function. For example, use a range that is +- 20 for each variable, except those that are zero.

    auto myfunc = [](int m){return m == 0 ? std::pair<int, int>(0, 0) : std::pair<int, int>(m - 20, m + 20);};
    TUNE(SetRange(myfunc), QuadraticOurs);
  6. If you have a function that needs to be called after variables are updated, for example void my_post_update() {} simply add its name to the TUNE arguments.

    TUNE(myKing, myBonus, myValue, my_post_update);

    You can add multiple functions and they will be called in the order you add them.

  7. By default, a variable v is tuned in the range 0 and 2 * v, and only that range is allowed for the parameter. You can change that by adding a custom range as another argument to TUNE as follows:

    TUNE(SetRange(-100, 100), myKing, myQueen);

    This will change the default range for all the variables. To customize it further, you can set another range for the remaining variables.

    TUNE(SetRange(-100, 100), myKing, SetRange(-20, 20), myQueen);

    Here myKing is tuned in [-100, 100] while myQueen is tuned in [-20, 20].

    To return the range to default use SetDefaultRange

    TUNE(SetRange(-100, 100), myKing, SetDefaultRange, myQueen);

    So that the range for myQueen is the default.

    Note: you can also change the range of each parameter manually as you input them to Fishtest, as will be shown below. However, that range must be within the allowed range for the parameter (so reduced from what stockfish prints out).

  8. After you are done specifying what you want to tune and how, compile the source.

  9. Run the following command ./stockfish. You will notice a comma-separated list printed. Copy that list somewhere.

  10. Get the branch signature from ./stockfish bench, which will look like Nodes searched : 4190940. Here, 4190940 is the signature.

  11. Commit your changes locally by running the command git commit -am "My commit message"

  12. Push your changes to Github with git push origin my_tuning_branch

You can read more about SPSA in Fishtest at the Issue #535.

Run your test

Warning

Do not run too many tests at the same time. Having too many active tests running will reduce their internal throughtput (ITP) significantly.

Standard tests

Please follow our testing methodology unless you have a good reason to do it differently.

  1. Go to https://tests.stockfishchess.org/tests/run

  2. Fill in the "Test repository" field with the link to the BASE Github repo of your fork of Stockfish, with no trailing slash, e.g. https://github.com/yourname/Stockfish. This is NOT a link to the repo of the test branch itself. Your "test repo" will remain constant for all of your tests across all of your branches.

  3. Fill in the "Test branch" field with the name of your branch, e.g. my_branch. This MUST exactly match the actual name of the branch on Github! It will not be deduced from your test repo link.

  4. Fill in the "Test signature" field with the bench output of your patch, e.g. 4190940. If you added a line containing "Bench: [the bench of the patch]" to the commit message, you don't need to insert this information.

  5. Fill in the "Info" field describing your change, keep it short but exhaustive.

  6. Click "Submit test".

SPSA tests

  1. Go to https://tests.stockfishchess.org/tests/run

  2. Nodestime allows SF to use a budgeted node count management, removing noise introduced by inconsistent hardware speed. If the value you're tuning is not susceptible to change significantly the nps, change Test options to read Hash=128 nodestime=600. With nodestime set to 600 the workers should be able to search at least 600 nodes per milliseconds (that is 0.6 Mnps) per thread which is above the fishtest minimum (540 nodes per millisecond per thread) for a node to be considered too slow.

  3. For search parameters, normal TC tuning is usually preferred.

  4. Choose the appropriate TC. A short TC is appropriate to get faster approximations or for checking the tune's correctness and parameters. A long TC (60+0.6) is best to get better scaling values. If using regular time control, set the appropriate TC and hash (Hash=16 for 20+0.2, Hash=64 for 60+0.6). If using nodestime, set the time control to 160+1.6 to get the equivalent of 60+0.6 with normal TM. The nodestime games will usually finish long before the time is up, provided that the workers are able to search more nodes per milliseconds than specified in the nodestime parameter; otherwise your games will end on time.

  5. Paste the list that you copied into the SPSA Parameters list. This is comma-separated data for parameter name, initial value, minimum, maximum, ck final value, rk final value in that order. Here you can also make manual changes to min and max values for parameters.

  6. A good tune run exhibits a significant change in the tuned values while minimizing random change. As a general rule, tuning many values at once (like a PSQT table) generates random change; while values that are only rarely used to score a position can have a hard time moving at all. Tweaking the ck value often helps to get a better result, with a higher ck forcing value change and a lower one reducing random change.

  7. Make sure your test repo is correct (e.g., https://github.com/yourname/Stockfish)

  8. Fill in the info describing your change

  9. Click "Submit test".

  10. If after a few thousand games the values are barely changing at all, the tuning run is useless and should be stopped. This usually happens when the ck value is too low for the parameter changes to influence the tuning result more than random noise.

Note

If you use the default range with an initial value of 0, the parameter will not be tuned since 2 * 0 and 0 / 2 are both 0, and empty intervals are not tuned.

Note

You cannot modify the number of games for SPSA tests, SPSA hyperparams are based off the initial number of games.

Testing methodology

Definitions
  • Hippopotamus: A patch that adds at least 10 lines of code. Please avoid them, if possible, it is often better to make a minimal version of your idea.
  • Parameters tweak: Changing the value of some constants in the code. The generated machine code is the same complexity (aiming at the same number of processor instructions).
  • Simplification: A way to make the code and the algorithms clearer. Most of the time, a necessary condition is that the number of lines in the source code of Stockfish goes down, or the number of processor instructions in the generated code goes down -- but this is by no means a sufficient condition because it is of course unfortunately possible to lower the number of lines in the code while obfuscating it.
  • Bug: A bug that has been discussed in Discord or as an issue in GitHub and confirmed as such by the maintainers. Potential bug-fix solutions shall be first discussed in the server, then tested in the framework.
  • TC: Time control, should almost always be STC or LTC, but exceptionally (e.g., testing time management) can be 40/10 or 40/10+0.1 (40 moves in 10s or 10s + increment), or 10+0 (sudden death).
  • STC: Short time control (10+0.1)
  • LTC: Long time control (60+0.6)
  • SMP: Multi-threaded (symmetric multiprocessing) tests
  • SPRT(x,y): Sequential probability ratio test (SPRT) test with elo0 = x and elo1 = y

The following recommendations to choose the right parameters can be best understood with the graphical SPRT calculator, which draws nice curves displaying the pass-rate and the average length of runs for various values of SPRT(x,y). When in doubt, stick with the standard test parameters.

There are single-threaded tests (e.g., STC, LTC) and multi-threaded (symmetric multiprocessing, "SMP") tests (e.g., STC SMP, LTC SMP) that run exactly on 8 threads. SMP testing is performed monthly as part of regression testing or pre-release testing of Stockfish.

Functional changes

Standard

We use these for almost all our tests. It's our workhorse, designed to commit only robust patches that almost surely work. Our goal is to reduce to the minimum possibility of regression and to avoid adding unnecessary complexity.

  1. Run your test with the "Test Type": STC

  2. If your STC test passes, create a new test with the "Test Type": LTC

    Note: If there have been no new commits to the Stockfish master branch since you created your first test, you can click the "Reschedule" button inside your test so most of the information will be automatically filled for you in the new test.

  3. If your LTC test passes, congratulations! Create a pull request against the official-stockfish repository, so your changes can be reviewed. Please remember that it is not guaranteed to be committed.

Simplifications

These must be used for all functional changing simplifications, even one-liners, to test if the removal of the code is detrimental to Stockfish's strength.

We try to reject an Elo loss and even a neutral patch can fail -- nevertheless because the code under test is simpler/smaller than the original, we don't require the stricter standard mode. These tests are also used for bug fixes and other special cases, but only after being discussed in Discord and approved in advance to avoid people testing with non-regression mode becoming their preferred toy, instead of using the stricter standard mode.

For the most part, follow the same procedures as for a standard test but changing the SPRT bounds:

Scalers

If you think that your patch will perform better in very long time controls than in shorter ones you might be able to test it at longer time controls than our standard LTC.

  • Time control: 180s + 1.8s, Threads: 1.
  • Time control: 60s + 0.6s, Threads: 8 or 70s + 0.7s, Threads: 7. Depending on the availability of 8-thread workers.

Unions

Among parameter tweaks, a special sub-case is the so-called union patch or combo patch, which is a bundling of patches that failed SPRT but with positive or near positive scores. Sometimes retesting the union as a whole passes SPRT. Due to the nature of the approach and because each patch failed already, a union has some constraints:

  • Maximum 2 patches per union.
  • Each patch shall be trivial, like a parameter tweak. Patches that add/remove a concept/idea/feature shall pass individually.

Non-functional changes

If your patch is a non-functional change, you might still need to run it through Fishtest, but there are exceptions:

Cleanups

Usually, there is an open PR to collect small cleanups. They are trivial, don't change the bench, and are generally non-controversial. This might be typos, variable names, dead code, and similar.

Refactoring and non-functional simplifications

Code refactoring and non-functional simplifications are a very wide family of patches and, by their nature, more subjective than other kinds of patches. So also the acceptance guidelines rely more on maintainer knowledge, experience, and sensibility.

Anything not fitting the above small cleanup category should typically be tested on Fishtest unless the code is not really exercised on Fishtest (e.g. syzygy) and doesn't change bench. This approach verifies code correctness (no crashes), and makes sure there are no unexpected side effects or slowdowns.

  • Regression tests will use non-regression SPRT bounds or exceptionally other bounds according to maintainer judgment.
  • Rejects are always possible if the patch is worse than the original.

These kinds of patches, although very important for long-term code quality, are also the ones that can raise discussions because code style is subjective in large parts. So be prepared to accept a negative judgment by the maintainer: it is not easy for him, indeed it is a hard job for him to judge on these, so please do not take it personally or start an endless discussion in case your patch is rejected. Simply move on to your next Elo-winning idea.

Speedups

Test it on your machine by running ./stockfish bench. It is recommended to run it several times and compute a mean and stdev to see if the improvement is statistically significant. For Linux/Windows you can use psybench, for Windows you can use the specialized tools FishBench and BuildTester, both excellent.

In case you have access to Linux, the most reliable way is to use the amazing perf tool:

$ sudo perf stat -r 5 -a -B -e cycles:u,instructions:u ./stockfish bench > /dev/null

This command will run the bench 5 times, counting instructions and cycles and averaging. At the end, a report will be printed:

 Performance counter stats for 'system wide' (5 runs):

    22.747.981.856      cycles:u                                     ( +-  0,05% )
    28.409.592.052      instructions:u     #    1,25  insn per cycle ( +-  0,06% )

      4,331400608 seconds time elapsed                               ( +-  0,11% )

Important

Note the very small error, in case you get a sensibly bigger one, please run the test twice. If you run Linux behind a VM, like VMWare, you have to enable performance counters in virtual machine settings.

If the code is a trivial change, send us a pull request. Speedups need to be further verified, for at least 2 reasons:

  1. The speedup needs to be statistically significant and not just random noise.
  2. The speedup needs to be confirmed on different machines. Sometimes a speedup on one machine is a slowdown on another.

To be considered for inclusion, the speedup should be around 0.5%. If the patch is more complex, then the patch will go under normal STC+LTC Fishtest tests. This will require about 0.25% speedup at STC and about 0.7% speedup at LTC for a 50% passing chance. 1% speedups pass with an 85% chance. The rationale is that a speed-up is totally comparable to a normal patch: it adds complexity with the aim to improve Elo, so it makes sense to test under the same conditions. Some data and discussion in the following issue: https://github.com/official-stockfish/Stockfish/issues/2593

I'm ready for my pull request!

Once you are ready, once the tests with your nice idea have passed and/or you have enough speed data to support your improvements, congratulations: you can now open a pull request against the master branch of official-stockfish.

Guidelines for a great pull request:

  • Is my pull request up-to-date with current master?
  • Does my pull request consist of a single commit?
    • If your branch has a long history, you may squash the commits and force push to your development branch or consider creating a new branch where you squash your changes together, then open the pull request from this new branch.
  • Is my code really really clean?
    • Employ a coding style that is similar to surrounding Stockfish code, remove all spaces on empty lines and trailing spaces everywhere, etc.
  • Can I improve the quality of my commit message?
    • Your git commit message should have a high-level description of the patch, explaining the reasoning behind the patch and why it improves on the current code. The pull request comment will automatically be filled with your commit message.
  • Do I provide easy ways to check my data?
    • Most changes should also report the results obtained in Fishtest at STC and LTC with links to the test pages.
  • What will be the next signature of Stockfish if my patch is accepted?
    • Both the commit message and the pull request comment on GitHub must mention if the patch is a 'No functional change' or changes the search. The last line of the commit message should be either 'No functional change' or 'Bench: XXXXXXX' where 'Nodes searched : XXXXXXX' can be found in the output of a ./stockfish bench invocation.
  • Is my patch portable?
    • We have continuous integration testing, which will check for standard conformance and reproducibility of search using various compilers. Test your code on your GitHub repository continuous integration pushing your code on the helper branch github_ci with the command git push -f origin HEAD:github_ci and analyze the continuous integration logs in case of failure.
  • How could we continue after the patch?
    • Write a few words in the commit message to offer perspectives for future work. This is often a nice way to get momentum for continuing research on this subject in Stockfish, and maybe somebody else in the community will pick-up the challenge.

Examples:

Simplify away nnue scale pawn count multiplier
Removes 2x multipliers in nnue scale calculation along with the pawn count term that was recently reintroduced.

Passed non-regression STC:
https://tests.stockfishchess.org/tests/view/64305bc720eb941419bdf72e
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 38008 W: 10234 L: 10021 D: 17753
Ptnml(0-2): 96, 4151, 10323, 4312, 122

Passed non-regression LTC:
https://tests.stockfishchess.org/tests/view/6430b76a028b029b01ac9bfd
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 91232 W: 24686 L: 24547 D: 41999
Ptnml(0-2): 30, 8721, 27986, 8838, 41

bench 4017320
Set the length of GIT_SHA to 8 characters
Previously, the length of git commit hashes could vary depending on the git environment.

No functional change

You can find more examples in the commit history.

Advanced options

The section advanced options in the test creation page contain options that should be toggled on/off only by advanced users. Currently, the advanced options are:

  • Auto-purge: Toggles auto-purge on and off. Having it off can be beneficial when testing for time-management patches or for patches affecting different OS in different manners.
  • Time odds: Use different time controls for the test and the base.
  • Custom book: Use a custom opening book for tests.
  • Disable adjudication: Disables adjudication for testing.

Useful Resources