parallelizing & move reduction

flok · Post by **flok** » Tue Aug 20, 2013 10:07 pm

Hi,

My chess program is now able to use all sockets/cores/threads of a system. A quick test showed my that for each extra complete core added (e.g. not for hyperthreading) the performance increases almost linearly.

Unfortunately the number of nodes visited also increases almost(!) linearly.
So I was wondering (before I reinvent the wheel): are there any wel-known methods for e.g. alpha/beta value sharing or any other optimalization for multi-threaded chess engines?

Did a test run where I let it calculate the first move of a game. First iteration with 1 core, then with 2 cores and so on. The whole program was restarted after each iteration. The system has 6 cores with +/- 1 in use for other tasks. Plenty of free ram and a trans-pos-tab of only 65k elements.
Results: http://vanheusden.com/DeepBrutePos/pos1.7rev555.pdf

flok · Post by **flok** » Tue Aug 20, 2013 10:41 pm

Ok one thing that helps: FIRST sort the move-list (killer moves at the start of the list) and THEN move the transposition-table-hit to the front or the whole point of shuffeling the transpostab move is not met:

Code: Select all

7 1 8 1501940 88.519 16967.430721088127 E7-E6
7 2 8 518232 17.657 29349.946196975703 E7-E6
7 3 8 1554687 33.48 46436.29032258065 A7-A6
7 4 8 1332094 21.779 61164.14895082419 D7-D5
7 5 8 1363957 18.119 75277.71952094487 E7-E6
7 6 8 5027927 60.119 83632.91139240506 A7-A6
7 7 8 4763770 57.787 82436.70721788637 B7-B6
7 8 8 7190724 79.96 89929.01450725364 B7-B5
7 9 8 2306538 28.119 82027.73925104023 E7-E6
7 10 8 1887020 22.813 82716.87195897076 E7-E6
7 11 8 5264865 63.321 83145.63888757284 A7-A6
7 12 8 9174128 103.784 88396.36167424651 A7-A5

Second line is somewhat strange.
(also: A7-A6?!)
Then 6-8 and 11-12 are gigantic and 9/10 reasonably fast?
Almost as if when there are more threads (but not always, see 9/10) than cores, that the number of analyzed positions/nodes skyrocket? Probably some scheduling artifact. [...] yes I can confirm it has to do with scheduling: if I play with priority then the result looks different:

Code: Select all

7 1 8 315362 20.431 15435.465713866182 D7-D5
7 2 8 559034 18.484 30244.21120969487 E7-E6
7 3 8 1621968 34.831 46566.793947919956 A7-A6
7 4 8 1334194 21.954 60772.25107041997 D7-D5
7 5 8 1359045 18.34 74102.78080697928 E7-E6
7 6 8 4188522 53.81 77839.10053893329 A7-A6

Sven · Post by **Sven** » Tue Aug 20, 2013 11:22 pm

flok wrote:Ok one thing that helps: FIRST sort the move-list (killer moves at the start of the list) and THEN move the transposition-table-hit to the front or the whole point of shuffeling the transpostab move is not met:
Code: Select all
7 1 8 1501940 88.519 16967.430721088127 E7-E6
7 2 8 518232 17.657 29349.946196975703 E7-E6
7 3 8 1554687 33.48 46436.29032258065 A7-A6
7 4 8 1332094 21.779 61164.14895082419 D7-D5
7 5 8 1363957 18.119 75277.71952094487 E7-E6
7 6 8 5027927 60.119 83632.91139240506 A7-A6
7 7 8 4763770 57.787 82436.70721788637 B7-B6
7 8 8 7190724 79.96 89929.01450725364 B7-B5
7 9 8 2306538 28.119 82027.73925104023 E7-E6
7 10 8 1887020 22.813 82716.87195897076 E7-E6
7 11 8 5264865 63.321 83145.63888757284 A7-A6
7 12 8 9174128 103.784 88396.36167424651 A7-A5
Second line is somewhat strange.
(also: A7-A6?!)
Then 6-8 and 11-12 are gigantic and 9/10 reasonably fast?
Almost as if when there are more threads (but not always, see 9/10) than cores, that the number of analyzed positions/nodes skyrocket? Probably some scheduling artifact. [...] yes I can confirm it has to do with scheduling: if I play with priority then the result looks different:
Code: Select all
7 1 8 315362 20.431 15435.465713866182 D7-D5
7 2 8 559034 18.484 30244.21120969487 E7-E6
7 3 8 1621968 34.831 46566.793947919956 A7-A6
7 4 8 1334194 21.954 60772.25107041997 D7-D5
7 5 8 1359045 18.34 74102.78080697928 E7-E6
7 6 8 4188522 53.81 77839.10053893329 A7-A6

Hi Folkert,

successful move ordering looks a bit different. The following is typical:

1. TT move,
2. non-losing captures ordered by MVV/LVA,
3. killer moves,
4. losing captures,
5. quiet moves ordered by history heuristic.

As to your parallel search, it appears to work differently from most other SMP engines. The most obvious thing is that it seems to search a lot more nodes than expected. In general, the goal of using N cores (N>1) is to finish the same search task, e.g. a fixed-depth search of 8 plies, in significantly less time than with 1 core.
EDIT: Or, as we learned in a recent thread about Komodo and some other engines, to improve the quality of the search in terms of more ELO by searching a wider tree.

Can you explain how your parallel search works, in very few sentences? Does it resemble YBW, ABDADA, PVSplit, ...?

As I already told you in a different thread, I would ignore all results for >6 logical cores completely on your system with 6 physical cores. If one of those physically cores was also partially busy with other tasks during your tests then the 6 core result is not very interesting as well.

Sven

flok · Post by **flok** » Wed Aug 21, 2013 11:24 am

Sven Schüle wrote:successful move ordering looks a bit different. The following is typical:

1. TT move,
2. non-losing captures ordered by MVV/LVA,
3. killer moves,
4. losing captures,
5. quiet moves ordered by history heuristic.

Ah thanks, I think I can do something with it.

I tried sorting not only on killer moves but also on the value of the piece caught (pawn 1, rook 3, etc.) This seems to help. Did a 6-games match and DBP with this enhanced sorting won 4 games.

Sven Schüle wrote:As to your parallel search, it appears to work differently from most other SMP engines.<...>Can you explain how your parallel search works, in very few sentences?

Sure. What it does is: it calculates what moves to do. Then for each move it checks if there are any threads left to do the calculation work. If so, then that move will be calculated in a thread. If not, then search() is called from the current thread. This happens for each thread while depth >= 2.

I made sure that all threads for node 'x' can communicate to each other when they did a cut-off.

Does it resemble YBW, ABDADA, PVSplit, ...?

No idea. I woke up one night with this implementation idea. Had not done any research.

As I already told you in a different thread, I would ignore all results for >6 logical cores completely on your system with 6 physical cores. If one of those physically cores was also partially busy with other tasks during your tests then the 6 core result is not very interesting as well

Yes, just had not gotten around to alter the script. Tonight I'll see if I can boot-up an old other 6 core system (AMD) which is idle apart from the chess program for more clean testing. Also without that pesky hyperthreading.

Sven · Post by **Sven** » Wed Aug 21, 2013 6:18 pm

flok wrote:
Sven Schüle wrote:As to your parallel search, it appears to work differently from most other SMP engines.<...>Can you explain how your parallel search works, in very few sentences?
Sure. What it does is: it calculates what moves to do. Then for each move it checks if there are any threads left to do the calculation work. If so, then that move will be calculated in a thread. If not, then search() is called from the current thread. This happens for each thread while depth >= 2.

Make sure that you do not delegate searching a move to any available thread unless you are done with searching the first move and that first move did not cause a cutoff. Otherwise you will search really a lot of garbage since you frequently let other threads search irrelevant sibling moves instead of letting them cooperate in searching the first move faster in parallel and thus getting the cutoff faster. If you do not do this already then it could be an explanation for your heavily increased node counts with >1 core.

Sven

flok · Post by **flok** » Sun Aug 25, 2013 1:15 am

That indeed helps:

Code: Select all

7 1 8 618106 48.878 12645.893858177504 1986 50383 17696 E7-E6 => 0
7 2 8 763812 30.69 24887.976539589443 2335 63398 19930 E7-E6 => 0
7 3 8 837934 26.552 31558.225369087075 2280 68968 22507 E7-E6 => 0
7 4 8 951210 22.109 43023.65552489936 2623 79130 24551 E7-E6 => 0

(4th column)
Not perfect yet.

Sven · Post by **Sven** » Sun Aug 25, 2013 11:55 am

flok wrote:That indeed helps:

Code: Select all

7 1 8 618106 48.878 12645.893858177504 1986 50383 17696 E7-E6 => 0
7 2 8 763812 30.69 24887.976539589443 2335 63398 19930 E7-E6 => 0
7 3 8 837934 26.552 31558.225369087075 2280 68968 22507 E7-E6 => 0
7 4 8 951210 22.109 43023.65552489936 2623 79130 24551 E7-E6 => 0

(4th column)
Not perfect yet.

Nice improvement, now it seems you are starting to work in the right direction!

parallelizing & move reduction

parallelizing & move reduction

Re: parallelizing & move reduction

Re: parallelizing & move reduction

Re: parallelizing & move reduction

Re: parallelizing & move reduction

Re: parallelizing & move reduction

Re: parallelizing & move reduction