-
Notifications
You must be signed in to change notification settings - Fork 132
Creating my first test
- Always be polite, respectful, and kind: https://css-tricks.com/open-source-etiquette-guidebook/
- Keep your final change as small and neat as possible: https://tirania.org/blog/archive/2010/Dec-31.html
- Even if a change passes the Fishtest tests, it is not a guarantee that it will be merged. Patches that add significant complexity will need to show a big benefit to be considered.
- Never submit a test that is a bundle of multiple ideas. Submit each idea individually as its own test.
- Participate in the Stockfish Discord server. This is the place to communicate with moderators and other developers.
To write patches for Stockfish and test them in the framework, you will need:
- A recent C++ compiler
- A working git on your system
- A GitHub account
- A git client on your computer (we recommend GitHub Desktop, for its simplicity)
Create a fork of the official Stockfish repository at https://github.com/official-stockfish/Stockfish and create a git clone of your forked version. Github has good help on this process: https://help.github.com/articles/fork-a-repo
Before creating a new patch, you have to make sure that your master branch is up to date and has all the newest commits from the official Stockfish master branch. You can use the following script to ensure this (this script should be used each time the official master branch changes). Save the script in a file named "sync-with-official.sh", then type the following command in a terminal: sh sync-with-official.sh
Click here to view the script
#!/bin/sh
# change directory to the path of the script
cd "${0%/*}"
# go to the src directory for Stockfish on my hard drive (edit accordingly)
cd ./chess/stockfish/src
echo
echo "This command will sync with master of official-stockfish"
echo
echo "Adding official Stockfish's public GitHub repository URL as a remote in my local git repository..."
git remote add official https://github.com/official-stockfish/Stockfish.git
git remote set-url official https://github.com/official-stockfish/Stockfish.git
echo
echo "Going to my local master branch..."
git checkout master
echo
echo "Downloading official Stockfish's branches and commits..."
git fetch official
echo
echo "Updating my local master branch with the new commits from official Stockfish's master..."
git reset --hard official/master
echo
echo "Pushing my local master branch to my online GitHub repository..."
git push origin master --force
echo
echo "Compiling new master..."
make clean
make build -j
make net
echo
echo "Done."
Tip
For instructions on how to create your test in Fishtest go to run your test.
- In terminal, browse to the Stockfish folder
-
Create a branch for your work
git checkout -b my_branch
- Edit the Stockfish source to make your changes
- To compile the source, refer to https://github.com/official-stockfish/Stockfish/wiki/Compiling-from-source but note that code changes are required as the code provided there is for compiling from the Stockfish source on github.
- Get the branch signature from
./(binary name) bench
, which will look likeNodes searched : 4190940
. -
Commit your changes locally with
git commit -am "My commit message"
-
Push your branch to Github:
git push origin my_branch
To test new nets, first, upload these nets to Fishtest (account needed). By uploading you license your network under a CC0 license. The networks must follow the naming convention
nn-SHA.nnue
Where SHA are the first 12 digits of the sha256sum
(shasum -a 256
) of the nn.nnue data file (sha256sum nn.nnue | cut -c1-12
).
On a Stockfish git branch, change the default value of EvalFileDefaultName
(see evaluate.h) to the name of this net and proceed as usual with short time control (STC) and long time control (LTC) Sequential probability ratio test (SPRT) tests of this branch. Do not add an EvalFile
test option, this is not supported on Fishtest, i.e. only the default net will be visible to the engine. Nets that require different features or are of different sizes can be tested as well, obviously with the needed changes made to the sources in the testing branch.
Elo measurements will be done as part of the usual progression/regression tests, which will probably be more frequent.
The simultaneous perturbation stochastic approximation (SPSA) algorithm implementation in Fishtest can be (extremely) simplified in a loop based on these two steps:
-
Evaluation step: play a mini match using 2 different values of the parameter to be optimized:
parameter_value - ck
parameter_value + ck
-
Update step: update the parameter according to the mini match result:
parameter_value = parameter_value + (ck * rk) * (wins - losses)
Where:
-
ck
: is the variation applied to theparameter_value
to detect the right direction of the update from the mini match result. The value must be large enough to have a measurable Elo difference during the mini match -
rk
: is the fraction ofck
used to update theparameter_value
ck
andrk
decrease with the number of iterations (1 iteration = 2 games played) in order to have bigger updates of theparameter_value
at the start of the tuning and to settle to a final value at the end of the tuning:ck = c0 / (1 + k)**gamma ak = a0 / (A + 1 + k)**alpha rk = ak / ck**2
The Fishtest page for a new SPSA test requires for each parameter:
- The starting value of
parameter_value
- The clipping
min
value, applied onparameter_value
andparameter_value - ck
- The clipping
max
value, applied onparameter_value
andparameter_value + ck
- The final value of
ck
(i.e. whenk = total_games / 2 - 1
) - The final value of
rk
(i.e. whenk = total_games / 2 - 1
)
The min
and max
clipping values should establish the broadest feasible range within which the parameter is meaningful. If this range is set too narrowly, it could lead to unwarranted clipping, thereby diminishing the sensitivity of Elo during the mini match.
To confine the evolution range for a parameter, simply set a smaller rk
value.
-
In the terminal, browse to the Stockfish folder
-
Create a branch for your work
git checkout -b my_tuning_branch
-
Move the definition of the variables to the global scope of the Stockfish namespace.
-
Remove
const
qualifiers from the variables in the source code that you want to tune. -
Flag the variables you want to tune with the
TUNE
macro. For example, if you have:namespace Stockfish { int myKing = 10, myQueen = 20; Score myBonus = S(5, 15); Value myValue[][2] = { { V(100), V(20) }, { V(7), V(78) } };
Simply add the following line somewhere after it
TUNE(myKing, myBonus, myValue);
The type of the variables must be one of
int
,Score
, orValue
. They can be arrays of arbitrary dimensions. Note that the variables will only be allowed to vary in the range[0, 2*v], where v is the initial value. See also point 6.Note that the c value must be above 1. For example if we have the following:
int bonus1 = 1; int bonus2 = 2; int totalBonus = bonus1 * term1 + bonus2 * term2;
We can instead multiple up the values to be used in the tuner:
int bonus1 = 10; int bonus2 = 20; int totalBonus = (bonus1 * term1 + bonus2 * term2) / 10; TUNE(bonus1, bonus2);
In this case, we have multiplied up the values by 10x and can now use a c value of 5.
You can have multiple invocations of
TUNE
in different places. For example, the code below is equivalent to the one above:TUNE(myKing); ... TUNE(myQueen, myValue);
Even more flexibility can be obtained with a custom range function. For example, use a range that is +- 20 for each variable, except those that are zero.
auto myfunc = [](int m){return m == 0 ? std::pair<int, int>(0, 0) : std::pair<int, int>(m - 20, m + 20);}; TUNE(SetRange(myfunc), QuadraticOurs);
-
If you have a function that needs to be called after variables are updated, for example
void my_post_update() {}
simply add its name to theTUNE
arguments.TUNE(myKing, myBonus, myValue, my_post_update);
You can add multiple functions and they will be called in the order you add them.
-
By default, a variable
v
is tuned in the range0
and2 * v
, and only that range is allowed for the parameter. You can change that by adding a custom range as another argument toTUNE
as follows:TUNE(SetRange(-100, 100), myKing, myQueen);
This will change the default range for all the variables. To customize it further, you can set another range for the remaining variables.
TUNE(SetRange(-100, 100), myKing, SetRange(-20, 20), myQueen);
Here
myKing
is tuned in [-100, 100] whilemyQueen
is tuned in [-20, 20].To return the range to default use
SetDefaultRange
TUNE(SetRange(-100, 100), myKing, SetDefaultRange, myQueen);
So that the range for
myQueen
is the default.Note: you can also change the range of each parameter manually as you input them to Fishtest, as will be shown below. However, that range must be within the allowed range for the parameter (so reduced from what stockfish prints out).
-
After you are done specifying what you want to tune and how, compile the source.
-
Run the following command
./stockfish
. You will notice a comma-separated list printed. Copy that list somewhere. -
Get the branch signature from
./stockfish bench
, which will look likeNodes searched : 4190940
. Here, 4190940 is the signature. -
Commit your changes locally by running the command
git commit -am "My commit message"
-
Push your changes to Github with
git push origin my_tuning_branch
You can read more about SPSA in Fishtest at the Issue #535.
Warning
Do not run too many tests at the same time. Having too many active tests running will reduce their internal throughtput (ITP) significantly.
Please follow our testing methodology unless you have a good reason to do it differently.
-
Fill in the "Test repository" field with the link to the BASE Github repo of your fork of Stockfish, with no trailing slash, e.g.
https://github.com/yourname/Stockfish
. This is NOT a link to the repo of the test branch itself. Your "test repo" will remain constant for all of your tests across all of your branches. -
Fill in the "Test branch" field with the name of your branch, e.g.
my_branch
. This MUST exactly match the actual name of the branch on Github! It will not be deduced from your test repo link. -
Fill in the "Test signature" field with the bench output of your patch, e.g.
4190940
. If you added a line containing "Bench: [the bench of the patch]" to the commit message, you don't need to insert this information. -
Fill in the "Info" field describing your change, keep it short but exhaustive.
-
Click "Submit test".
-
Nodestime allows SF to use a budgeted node count management, removing noise introduced by inconsistent hardware speed. If the value you're tuning is not susceptible to change significantly the nps, change
Test options
to readHash=128 nodestime=600
. Withnodestime
set to 600 the workers should be able to search at least 600 nodes per milliseconds (that is 0.6 Mnps) per thread which is above the fishtest minimum (540 nodes per millisecond per thread) for a node to be considered too slow. -
For search parameters, normal TC tuning is usually preferred.
-
Choose the appropriate TC. A short TC is appropriate to get faster approximations or for checking the tune's correctness and parameters. A long TC (
60+0.6
) is best to get better scaling values. If using regular time control, set the appropriate TC and hash (Hash=16
for20+0.2
,Hash=64
for60+0.6
). If usingnodestime
, set the time control to160+1.6
to get the equivalent of60+0.6
with normal TM. Thenodestime
games will usually finish long before the time is up, provided that the workers are able to search more nodes per milliseconds than specified in thenodestime
parameter; otherwise your games will end on time. -
Paste the list that you copied into the
SPSA Parameters
list. This is comma-separated data forparameter name, initial value, minimum, maximum, ck final value, rk final value
in that order. Here you can also make manual changes to min and max values for parameters. -
A good tune run exhibits a significant change in the tuned values while minimizing random change. As a general rule, tuning many values at once (like a PSQT table) generates random change; while values that are only rarely used to score a position can have a hard time moving at all. Tweaking the ck value often helps to get a better result, with a higher ck forcing value change and a lower one reducing random change.
-
Make sure your test repo is correct (e.g., https://github.com/yourname/Stockfish)
-
Fill in the info describing your change
-
Click "Submit test".
-
If after a few thousand games the values are barely changing at all, the tuning run is useless and should be stopped. This usually happens when the ck value is too low for the parameter changes to influence the tuning result more than random noise.
Note
If you use the default range with an initial value of 0, the parameter will not be tuned since 2 * 0 and 0 / 2 are both 0, and empty intervals are not tuned.
Note
You cannot modify the number of games for SPSA tests, SPSA hyperparams are based off the initial number of games.
Definitions
- Hippopotamus: A patch that adds at least 10 lines of code. Please avoid them, if possible, it is often better to make a minimal version of your idea.
- Parameters tweak: Changing the value of some constants in the code. The generated machine code is the same complexity (aiming at the same number of processor instructions).
- Simplification: A way to make the code and the algorithms clearer. Most of the time, a necessary condition is that the number of lines in the source code of Stockfish goes down, or the number of processor instructions in the generated code goes down -- but this is by no means a sufficient condition because it is of course unfortunately possible to lower the number of lines in the code while obfuscating it.
- Bug: A bug that has been discussed in Discord or as an issue in GitHub and confirmed as such by the maintainers. Potential bug-fix solutions shall be first discussed in the server, then tested in the framework.
- TC: Time control, should almost always be STC or LTC, but exceptionally (e.g., testing time management) can be 40/10 or 40/10+0.1 (40 moves in 10s or 10s + increment), or 10+0 (sudden death).
- STC: Short time control (10+0.1)
- LTC: Long time control (60+0.6)
- SMP: Multi-threaded (symmetric multiprocessing) tests
- SPRT(x,y): Sequential probability ratio test (SPRT) test with elo0 = x and elo1 = y
The following recommendations to choose the right parameters can be best understood with the graphical SPRT calculator, which draws nice curves displaying the pass-rate and the average length of runs for various values of SPRT(x,y). When in doubt, stick with the standard test parameters.
There are single-threaded tests (e.g., STC, LTC) and multi-threaded (symmetric multiprocessing, "SMP") tests (e.g., STC SMP, LTC SMP) that run exactly on 8 threads. SMP testing is performed monthly as part of regression testing or pre-release testing of Stockfish.
We use these for almost all our tests. It's our workhorse, designed to commit only robust patches that almost surely work. Our goal is to reduce to the minimum possibility of regression and to avoid adding unnecessary complexity.
-
Run your test with the "Test Type": STC
-
If your STC test passes, create a new test with the "Test Type": LTC
Note: If there have been no new commits to the Stockfish master branch since you created your first test, you can click the "Reschedule" button inside your test so most of the information will be automatically filled for you in the new test.
-
If your LTC test passes, congratulations! Create a pull request against the official-stockfish repository, so your changes can be reviewed. Please remember that it is not guaranteed to be committed.
These must be used for all functional changing simplifications, even one-liners, to test if the removal of the code is detrimental to Stockfish's strength.
We try to reject an Elo loss and even a neutral patch can fail -- nevertheless because the code under test is simpler/smaller than the original, we don't require the stricter standard mode. These tests are also used for bug fixes and other special cases, but only after being discussed in Discord and approved in advance to avoid people testing with non-regression mode becoming their preferred toy, instead of using the stricter standard mode.
For the most part, follow the same procedures as for a standard test but changing the SPRT bounds:
If you think that your patch will perform better in very long time controls than in shorter ones you might be able to test it at longer time controls than our standard LTC.
- Time control: 180s + 1.8s, Threads: 1.
- Time control: 60s + 0.6s, Threads: 8 or 70s + 0.7s, Threads: 7. Depending on the availability of 8-thread workers.
Among parameter tweaks, a special sub-case is the so-called union patch or combo patch, which is a bundling of patches that failed SPRT but with positive or near positive scores. Sometimes retesting the union as a whole passes SPRT. Due to the nature of the approach and because each patch failed already, a union has some constraints:
- Maximum 2 patches per union.
- Each patch shall be trivial, like a parameter tweak. Patches that add/remove a concept/idea/feature shall pass individually.
If your patch is a non-functional change, you might still need to run it through Fishtest, but there are exceptions:
Usually, there is an open PR to collect small cleanups. They are trivial, don't change the bench, and are generally non-controversial. This might be typos, variable names, dead code, and similar.
Code refactoring and non-functional simplifications are a very wide family of patches and, by their nature, more subjective than other kinds of patches. So also the acceptance guidelines rely more on maintainer knowledge, experience, and sensibility.
Anything not fitting the above small cleanup category should typically be tested on Fishtest unless the code is not really exercised on Fishtest (e.g. syzygy) and doesn't change bench. This approach verifies code correctness (no crashes), and makes sure there are no unexpected side effects or slowdowns.
- Regression tests will use non-regression SPRT bounds or exceptionally other bounds according to maintainer judgment.
- Rejects are always possible if the patch is worse than the original.
These kinds of patches, although very important for long-term code quality, are also the ones that can raise discussions because code style is subjective in large parts. So be prepared to accept a negative judgment by the maintainer: it is not easy for him, indeed it is a hard job for him to judge on these, so please do not take it personally or start an endless discussion in case your patch is rejected. Simply move on to your next Elo-winning idea.
Test it on your machine by running ./stockfish bench
. It is recommended to run it several times and compute a mean and stdev to see if the improvement is statistically significant. For Linux/Windows you can use psybench, for Windows you can use the specialized tools FishBench and BuildTester, both excellent.
In case you have access to Linux, the most reliable way is to use the amazing perf tool:
$ sudo perf stat -r 5 -a -B -e cycles:u,instructions:u ./stockfish bench > /dev/null
This command will run the bench 5 times, counting instructions and cycles and averaging. At the end, a report will be printed:
Performance counter stats for 'system wide' (5 runs):
22.747.981.856 cycles:u ( +- 0,05% )
28.409.592.052 instructions:u # 1,25 insn per cycle ( +- 0,06% )
4,331400608 seconds time elapsed ( +- 0,11% )
Important
Note the very small error, in case you get a sensibly bigger one, please run the test twice. If you run Linux behind a VM, like VMWare, you have to enable performance counters in virtual machine settings.
If the code is a trivial change, send us a pull request. Speedups need to be further verified, for at least 2 reasons:
- The speedup needs to be statistically significant and not just random noise.
- The speedup needs to be confirmed on different machines. Sometimes a speedup on one machine is a slowdown on another.
To be considered for inclusion, the speedup should be around 0.5%. If the patch is more complex, then the patch will go under normal STC+LTC Fishtest tests. This will require about 0.25% speedup at STC and about 0.7% speedup at LTC for a 50% passing chance. 1% speedups pass with an 85% chance. The rationale is that a speed-up is totally comparable to a normal patch: it adds complexity with the aim to improve Elo, so it makes sense to test under the same conditions. Some data and discussion in the following issue: https://github.com/official-stockfish/Stockfish/issues/2593
Once you are ready, once the tests with your nice idea have passed and/or you have enough speed data to support your improvements, congratulations: you can now open a pull request against the master branch of official-stockfish.
Guidelines for a great pull request:
- Is my pull request up-to-date with current master?
- The first thing to do before opening a pull request is to synchronize your master branch with the official master branch.
- Does my pull request consist of a single commit?
- If your branch has a long history, you may squash the commits and force push to your development branch or consider creating a new branch where you squash your changes together, then open the pull request from this new branch.
- Is my code really really clean?
- Employ a coding style that is similar to surrounding Stockfish code, remove all spaces on empty lines and trailing spaces everywhere, etc.
- Can I improve the quality of my commit message?
- Your git commit message should have a high-level description of the patch, explaining the reasoning behind the patch and why it improves on the current code. The pull request comment will automatically be filled with your commit message.
- Do I provide easy ways to check my data?
- Most changes should also report the results obtained in Fishtest at STC and LTC with links to the test pages.
- What will be the next signature of Stockfish if my patch is accepted?
- Both the commit message and the pull request comment on GitHub must mention if the patch is a 'No functional change' or changes the search. The last line of the commit message should be either 'No functional change' or 'Bench: XXXXXXX' where 'Nodes searched : XXXXXXX' can be found in the output of a
./stockfish bench
invocation.
- Both the commit message and the pull request comment on GitHub must mention if the patch is a 'No functional change' or changes the search. The last line of the commit message should be either 'No functional change' or 'Bench: XXXXXXX' where 'Nodes searched : XXXXXXX' can be found in the output of a
- Is my patch portable?
- We have continuous integration testing, which will check for standard conformance and reproducibility of search using various compilers. Test your code on your GitHub repository continuous integration pushing your code on the helper branch
github_ci
with the commandgit push -f origin HEAD:github_ci
and analyze the continuous integration logs in case of failure.
- We have continuous integration testing, which will check for standard conformance and reproducibility of search using various compilers. Test your code on your GitHub repository continuous integration pushing your code on the helper branch
-
How could we continue after the patch?
- Write a few words in the commit message to offer perspectives for future work. This is often a nice way to get momentum for continuing research on this subject in Stockfish, and maybe somebody else in the community will pick-up the challenge.
Examples:
Simplify away nnue scale pawn count multiplier
Removes 2x multipliers in nnue scale calculation along with the pawn count term that was recently reintroduced.
Passed non-regression STC:
https://tests.stockfishchess.org/tests/view/64305bc720eb941419bdf72e
LLR: 2.95 (-2.94,2.94) <-1.75,0.25>
Total: 38008 W: 10234 L: 10021 D: 17753
Ptnml(0-2): 96, 4151, 10323, 4312, 122
Passed non-regression LTC:
https://tests.stockfishchess.org/tests/view/6430b76a028b029b01ac9bfd
LLR: 2.94 (-2.94,2.94) <-1.75,0.25>
Total: 91232 W: 24686 L: 24547 D: 41999
Ptnml(0-2): 30, 8721, 27986, 8838, 41
bench 4017320
Set the length of GIT_SHA to 8 characters
Previously, the length of git commit hashes could vary depending on the git environment.
No functional change
You can find more examples in the commit history.
The section advanced options in the test creation page contain options that should be toggled on/off only by advanced users. Currently, the advanced options are:
- Auto-purge: Toggles auto-purge on and off. Having it off can be beneficial when testing for time-management patches or for patches affecting different OS in different manners.
- Time odds: Use different time controls for the test and the base.
- Custom book: Use a custom opening book for tests.
- Disable adjudication: Disables adjudication for testing.
- Zugzwang Test Suite must be used on patches that affect verification search.
- More Bench Positions