MTD best in Stockfish

Michiel Koorn · Post by **Michiel Koorn** » Tue Sep 21, 2010 2:46 pm

The following is an attempt to gather my thoughts and get some feedback from the forum, so any tips tricks or comments are appreciated.

I have been implementing an MTD-best approach in stockfish. It is a hack, so I bend existing code wherever necessary and from a clean & efficient code perspective it is ugly.
The basic paradigm of MTD best is that it is not necessary to rate the best score, but just to establish there is no better move. Lowerbound(bestmove)<=upperbound (nextmove). If I fail to do so I can improve the estimate of either bound by a new 0-window search. I have implemented this code at the root level search. Current implementation is basic and optimisations to reduce the number of iterations required should be possible.
Some other things need to change as well:
Stockfish uses a searchstack with search information (currentmove, bestmove, killermoves etc.). Since I may alternate between searching best and next move when improving bounds I have made two search stacks: one for the best and one for the next move. One of the implications is that I have killer moves focussed on the PV (the term is used loosely, since it is the lower estimate and not the best line) and a separate set for the rest.
The transposition table stores both an upper and a lower bound as opposed to just one score. This means the table contains more information but less entries for equal table size.
Some logic originally in stockfish may now be broken/altered as an unintentional consequence. I have seen one, possibly two:
Allotted search time algorithms are broken. The code apparently plays for the allotted time without functional extensions of search time. The result is that the mod saves time relative to the original code. My hack is to increase the default allotted time, although this does not have the same impact of course.
Using 0-window exclusively means relative more time is used at the root and searches become more independent. What does this mean for threading& efficient use of multi cores?
Does anybody see other candidates.

To validate my implementation I have played small tournaments against original code. The mod is functional, but looses 30-70. Sometimes a blunder makes it to the root level. If I analyse the position with both original and mod code the blunder is not reproducible. Working assumption is that it is a bug in the transposition table logic, but I am wide open for suggestions as I am unable to locate it.

Analysis.
I have focussed the first analysis on the size of the tree. For each move made in the tournaments I have calculated the effective branching factor for both the original code as well as the mod. Reminder: comments above are applicable to the data below.

The mod has a statistically significant lower branching factor

Code: Select all

mod      N   Mean  StDev  SE Mean
MTD   4062  1,867  0,205   0,0032
Orig  4046  1,904  0,223   0,0035

Difference = mu (MTD) - mu (Orig)
Estimate for difference:  -0,037364
95% CI for difference:  (-0,046691; -0,028037)
T-Test of difference = 0 (vs not =): T-Value = -7,85  P-Value = 0,000  DF = 8038

The difference is small but with an effective depth of 17 ply this means a node count reduction of just under 30%.

The mod has a more consistent branching factor (statistically significant lower standard deviation)

Code: Select all

mod     N     Lower     StDev     Upper
 MTD  4062  0,199608  0,204576  0,209789
Orig  4046  0,217980  0,223416  0,229121

F-Test (normal distribution)
Test statistic = 0,84; p-value = 0,000

The combination of the two means that the worst case node count per ply decreases

The cost can be seen in nodes per second. I was shocked by the size of the difference (246 k vs 371k) and cannot explain this presently. I made no effort to program efficiently, but did not expect to have done so badly. This should be easy to improve, since the fundamentals are not very complex. It is either efficiency of root level code or that of the transposition table logic

Code: Select all

mod      N    Mean   StDev  SE Mean
MTD   4062  246639  141267     2217
Orig  4046  371264  149617     2352

Difference = mu (MTD) - mu (Orig)
Estimate for difference:  -124625
95% CI for difference:  (-130960; -118289)
T-Test of difference = 0 (vs not =): T-Value = -38,56  P-Value = 0,000  DF =8075

As it stands now I have to fix the bug, fix the node count and retest. I have the tendency to attribute the differences to MTD best, but it is possible that it is one of the other Tentatively I would think that MTD best may be beneficial, once the break even point between lower NPS and node count is passed

bob · Post by **bob** » Tue Sep 21, 2010 7:18 pm

Michiel Koorn wrote:The following is an attempt to gather my thoughts and get some feedback from the forum, so any tips tricks or comments are appreciated.

I have been implementing an MTD-best approach in stockfish. It is a hack, so I bend existing code wherever necessary and from a clean & efficient code perspective it is ugly.
The basic paradigm of MTD best is that it is not necessary to rate the best score, but just to establish there is no better move. Lowerbound(bestmove)<=upperbound (nextmove). If I fail to do so I can improve the estimate of either bound by a new 0-window search. I have implemented this code at the root level search. Current implementation is basic and optimisations to reduce the number of iterations required should be possible.
Some other things need to change as well:
Stockfish uses a searchstack with search information (currentmove, bestmove, killermoves etc.). Since I may alternate between searching best and next move when improving bounds I have made two search stacks: one for the best and one for the next move. One of the implications is that I have killer moves focussed on the PV (the term is used loosely, since it is the lower estimate and not the best line) and a separate set for the rest.
The transposition table stores both an upper and a lower bound as opposed to just one score. This means the table contains more information but less entries for equal table size.
Some logic originally in stockfish may now be broken/altered as an unintentional consequence. I have seen one, possibly two:
Allotted search time algorithms are broken. The code apparently plays for the allotted time without functional extensions of search time. The result is that the mod saves time relative to the original code. My hack is to increase the default allotted time, although this does not have the same impact of course.
Using 0-window exclusively means relative more time is used at the root and searches become more independent. What does this mean for threading& efficient use of multi cores?
Does anybody see other candidates.

To validate my implementation I have played small tournaments against original code. The mod is functional, but looses 30-70. Sometimes a blunder makes it to the root level. If I analyse the position with both original and mod code the blunder is not reproducible. Working assumption is that it is a bug in the transposition table logic, but I am wide open for suggestions as I am unable to locate it.

Analysis.
I have focussed the first analysis on the size of the tree. For each move made in the tournaments I have calculated the effective branching factor for both the original code as well as the mod. Reminder: comments above are applicable to the data below.

The mod has a statistically significant lower branching factor
Code: Select all
mod      N   Mean  StDev  SE Mean
MTD   4062  1,867  0,205   0,0032
Orig  4046  1,904  0,223   0,0035

Difference = mu (MTD) - mu (Orig)
Estimate for difference:  -0,037364
95% CI for difference:  (-0,046691; -0,028037)
T-Test of difference = 0 (vs not =): T-Value = -7,85  P-Value = 0,000  DF = 8038
The difference is small but with an effective depth of 17 ply this means a node count reduction of just under 30%.

The mod has a more consistent branching factor (statistically significant lower standard deviation)
Code: Select all
mod     N     Lower     StDev     Upper
 MTD  4062  0,199608  0,204576  0,209789
Orig  4046  0,217980  0,223416  0,229121

F-Test (normal distribution)
Test statistic = 0,84; p-value = 0,000
The combination of the two means that the worst case node count per ply decreases

The cost can be seen in nodes per second. I was shocked by the size of the difference (246 k vs 371k) and cannot explain this presently. I made no effort to program efficiently, but did not expect to have done so badly. This should be easy to improve, since the fundamentals are not very complex. It is either efficiency of root level code or that of the transposition table logic
Code: Select all
mod      N    Mean   StDev  SE Mean
MTD   4062  246639  141267     2217
Orig  4046  371264  149617     2352

Difference = mu (MTD) - mu (Orig)
Estimate for difference:  -124625
95% CI for difference:  (-130960; -118289)
T-Test of difference = 0 (vs not =): T-Value = -38,56  P-Value = 0,000  DF =8075
As it stands now I have to fix the bug, fix the node count and retest. I have the tendency to attribute the differences to MTD best, but it is possible that it is one of the other Tentatively I would think that MTD best may be beneficial, once the break even point between lower NPS and node count is passed

Reduced NPS is not unexpected. When you go to a null-window search, you either search everything or search just one. No dilly-dallying around with PV nodes. This will reduce NPS since every CUT node involves a lot of work (move generation, etc) and it all gets thrown away when you search just one node. I did an MTD several years ago, and someone else recently did an mtd implementation on 23.0. I tested it on our cluster, and NPS dropped quite a bit, although the overall performance was only about -10 Elo from the current PVS.

However, for debugging, I want a PV that leads to an endpoint. MTD(f) has a difficult time providing that, and your algorithm (mtd best) is even worse. You often find the best move, but nothing else.

Michiel Koorn · Post by **Michiel Koorn** » Wed Sep 22, 2010 10:03 am

bob wrote: Reduced NPS is not unexpected. When you go to a null-window search, you either search everything or search just one. No dilly-dallying around with PV nodes. This will reduce NPS since every CUT node involves a lot of work (move generation, etc) and it all gets thrown away when you search just one node. I did an MTD several years ago, and someone else recently did an mtd implementation on 23.0. I tested it on our cluster, and NPS dropped quite a bit, although the overall performance was only about -10 Elo from the current PVS.

Thanks, I hadn't realised that. This saves me looking for a performance gap that is explained.

bob wrote: However, for debugging, I want a PV that leads to an endpoint.

This leads me to the realisation that I can revert back to the original root search for debugging purposes and reinstate mtd best when code changes are proven to be bug free.

bob wrote: MTD(f) has a difficult time providing that, and your algorithm (mtd best) is even worse. You often find the best move, but nothing else.

Since I did self play I had the option of comparing PV's and what I saw was that they often match to surprising depth. Part of this is expectation level of course. Secondly in close calls between alternative moves MTD best converges to the true PV. Thirdly the implementation of the algorithm tries to converge to the score over depth iterations

MTD best in Stockfish

MTD best in Stockfish

Re: MTD best in Stockfish

Re: MTD best in Stockfish