-
Notifications
You must be signed in to change notification settings - Fork 132
Fishtest FAQ
Even if you don't know programming yet, you can help the Stockfish developers to improve the chess engine by connecting your computer to the fishnet. Your computer will then play some chess games in the background to help the developers testing ideas and improvements.
Instructions on how to connect your computer on the fishtest network are given there:
- Running the worker
- Running the worker on Windows
- Running the worker on Linux
- Running the worker on macOS
- Running the worker in the Amazon AWS EC2 cloud
For SPRT tests, which are by far the most common type, the worker will send an update to fishtest every eight games. So on average, you can expect to lose four games when quitting the worker. Four STC games for a 1 core worker represents about 2 minutes of work. Four LTC games (which are less common) represent about 12 minutes of work.
The statistical models that Fishtest uses are based on the assumption that the pentanomial probabilities (a variation on the win, loss, draw probabilities) are the same for each worker. Therefore for each worker, a "residual" is shown on the overview page of every test. It is a measure of how far the worker deviates from the average. Small deviations are normally just due to statistical fluctuations and these will be colored green. However, if the deviation is exceptionally large then the residual will be colored yellow or even red. If this happens on a regular basis for a particular worker then this may be some cause for concern.
The following questions are more technical and aimed at potential Stockfish developers:
You should first check if the test has not been run previously. You can look at the test's history, and follow the corresponding link on the left of Fishtest's main view.
Most tests should use the two-stage approach, starting with stage 1, and if that passes, using the reschedule button to create the stage 2 test.
Selecting the type of test according to the stage you are in will configure all the necessary options for you.
Stage 1 | Reschedule | Stage 2 |
---|---|---|
SPRT stands for sequential probability ratio test. In SPRT, we have a null hypothesis that the two engines are equal in strength, while an alternative hypothesis is that one of the engines is stronger. With SPRT, we can test the hypothesis with the least expected number of games, that is, we don't attempt to fix the number of games to be played. The parameters of the test control the Type 1 and Type 2 errors. Essentially, we run matches sequentially, for each match we update a value from a likelihood function. The test is terminated when the value is below a lower-bound threshold or above an upper-bound threshold. The threshold is calculated based on the two parameters given to the test (please read the paragraph "Testing methodology" on the page Creating my first test for details).
You can use the NumGames stop rule, with 20000 games TC 10+0.1, and schedule a few tests around the direction you want to tune in. If you find a tuning that looks good, you can then schedule a two-stage SPRT test.
Generally, four or five tries is the limit. It's a good balance between exploring the change and not giving lucky tries too much of a chance to pass.
No. For various reasons, please base your tests on the current SF master.
A union is the bundling of patches that failed SPRT but with a positive or near-positive score. Sometimes retesting the union as a whole passes SPRT. Due to the nature of the approach and because each individual patch failed already, a union has some constraints:
- Maximum 2 patches per union
- Each patch shall be trivial, like a parameter tweak. Patches that add/remove a concept/idea/feature shall pass individually.
If your branch name is passed_pawn
, you can enter passed_pawn^
, passed_pawn^^
, ... in the branch field of the test submission page at https://tests.stockfishchess.org/tests/run .
Important
Note for patch authors: it is necessary, when testing patches with more than 8 threads, to disable "thread binding" in engine.cpp. Not doing so would have a negative effect on multi NUMA node (more than one physical CPU) Windows contributors machines with more than 8 cores, due to the parallelization of our test scripts for fishtest. This would bias the statistical value of the test.
Diff:
diff --git a/src/engine.cpp b/src/engine.cpp
index 81bb260b..bf3ebc12 100644
--- a/src/engine.cpp
+++ b/src/engine.cpp
@@ -184,23 +184,23 @@ void Engine::set_position(const std::string& fen, const std::vector<std::string>
// modifiers
void Engine::set_numa_config_from_option(const std::string& o) {
- if (o == "auto" || o == "system")
- {
- numaContext.set_numa_config(NumaConfig::from_system());
- }
- else if (o == "hardware")
- {
- // Don't respect affinity set in the system.
- numaContext.set_numa_config(NumaConfig::from_system(false));
- }
- else if (o == "none")
- {
- numaContext.set_numa_config(NumaConfig{});
- }
- else
- {
- numaContext.set_numa_config(NumaConfig::from_string(o));
- }
+ // if (o == "auto" || o == "system")
+ // {
+ // numaContext.set_numa_config(NumaConfig::from_system());
+ // }
+ // else if (o == "hardware")
+ // {
+ // // Don't respect affinity set in the system.
+ // numaContext.set_numa_config(NumaConfig::from_system(false));
+ // }
+ // else if (o == "none")
+ // {
+ numaContext.set_numa_config(NumaConfig{}); // <-------------
+ // }
+ // else
+ // {
+ // numaContext.set_numa_config(NumaConfig::from_string(o));
+ // }
// Force reallocation of threads in case affinities need to change.
resize_threads();
First, note that regression tests are not actually run to detect regressions. SF quality control is very stringent and regressive patches are very unlikely to make it into master. No, they are run to get an idea of SF's progress over time, which is impressive. See
https://github.com/official-stockfish/Stockfish/wiki/Regression-Tests
But still... what if the Elo outcome of a regression test is disappointingly low? Usually, there is little reason to worry.
-
First of all: wait till the test is finished. Drawing conclusions from an unfinished test is statistically meaningless.
-
Look at the error bars. The previous test may have been a lucky run, and the current one is perhaps an unlucky one. Note that the error bar is for the Elo relative to the fixed release (base). Differences between two such Elo estimates have nearly double the statistical error (2-3 Elo).
-
SFdev vs SF11 : NNUE vs classical evaluation is very sensitive to the hardware mix present at the time of testing. If a fleet of AVX512 workers is present/absent, Elo will be larger/smaller.
-
Error bars are designed to be right 95% of the time. So, conversely, 1 in 20 tests will be an outlier.
-
Selection bias is a book-related effect, patches are more likely to be selected if they perform well with the testing book. When they are retested with a different book their Elo score may be adversely affected.
-
Elo estimates of single patches (SPRT runs) typically come with large error bars. Take this into account when adding Elo estimates. Furthermore, Elo's estimates of passing patches are biased. The SPRT Elo estimates are only unbiased if one takes all patches into account, both passed and non-passed ones. As a result, the Elo gain measured by a regression test will typically be less than the sum of the estimated Elo gains of the individual patches since the previous regression test.
If a book is new, first make a PR against the Stockfish book repo https://github.com/official-stockfish/books and wait for a maintainer to commit it.
Then use the books to run time odds tests of master vs itself with a fixed number of games and compare the normalized Elo estimates - taking into account the error bars. Don't make the time odds too large since the aim is to approximate standard testing conditions. On the other hand, you also cannot make them too small since in that case, you will need many games to separate the books. I have had good experiences with tests of 60000 games with 30% time odds. Using this procedure it has been shown that unbalanced books are definitely better than balanced books for engine tests.
-
Do not run SPRT tests. They are a waste of resources for this application.
-
Do not run tests of master vs an earlier version. This may give misleading results as it favors the current book. This effect (selection bias) has been shown to exist several times.
-
This procedure can also be used to evaluate other testing changes (e.g. contempt). For changes that affect the amount of resources used (e.g. TC) one should take the resources into account (the amount of resources used by a test is ~ (game duration)/(normalized Elo)^2).