Alayan wrote: ↑
Wed Jan 01, 2020 2:34 pm
Modern engines prune very very aggressively, and leave thus plenty of holes to fill by searching somewhat wider with LazySMP.
That's an interesting take for sure.
If randomized move order truly benefits an engine (even across a LazySMP-like algorithm), that suggests that engines are pruning too much these days, and a more methodological search (or in the case of RMO LazySMP: a less
methodological search) could increase Elo.
Another note which confuses me greatly: YBW as written in the ChessProgrammingWiki seems suboptimal to me. It misses opportunities for parallelism, while simultaneously parallelizes upon "wasteful" work occasionally. In particular: ALL nodes can be parallelized. While CUT nodes should
visit its children fully sequential (not "sequential for 1st node, and parallel for 2nd+ nodes").
After all, if the 2nd node's "information" causes beta-cutoff on node #3, #4, #5 (etc. etc), then all of #3, #4, and #5 search was in vain. That could have been a huge amount of work wasted. In practice, searching the children sequentially doesn't mean losing parallelism. The first child of any CUT node is an ALL node, so the grandchildren from the 1st child can always be searched in parallel.
Other search strategies, such as DTS or "modified YBW" or "modified ABDADA", that I see in Textel or some other engines, do seem to more carefully analyze the search tree and more carefully choose split-points and sources of parallelism.
Another note: I think communication between threads can be optimized, lowering overhead costs. Under an AlphaBeta framework:
1. Younger brothers never have to send a message to their older brother.
2. The older brother broadcasts a singular message when complete: the centipawn evaluation to its younger brothers.
3. Only active threads need to get the message immediately. A blocked thread could receive the message "later". As long as the message is processed before the blocked threads run, you're efficient.
#2 limits our broadcast to the number of nodes covered by younger brothers. #3 means our broadcast mechanism only needs to scale to the number of cores of the system. Surely there is an algorithm that can provide #2 and #3 with low overhead?