Significant modification of Komodo SMP implementation

bob · Post by **bob** » Thu Jul 02, 2015 7:26 pm

Laskos wrote:
bob wrote:This ran overnight and here's the final numbers after 30K games:
Code: Select all
   1 Crafty-25.0   2603    1    1 30016   51%  2597   48% 
   2 Crafty-25.0x  2597    1    1 30016   49%  2603   48% 
So the current version appears to unintentionally widen just a bit, but since move ordering is never perfect, search overhead is a part of the game at present, although it has a remarkably small effect, although only measured at depth=10.

Let me also add this is a sub-optimal tuning. I am letting Crafty split everywhere in the search, with the usual min split depth set to zero, so the real search will widen less. A quick test suggested a 2 Elo change, 2602 vs 2598.

I ran another test as this represents something someone might do in a parallel search. In Crafty, when I split, a fail high at a split point instantly terminates all threads helping at that point or deeper in the same sub-tree (split here and results farther down one of the branches from this point). I disabled this, which means that every split has to complete normally, whether there is a fail high or not. This definitely widens the tree by a slight amount. But the Elo effect is almost nil. In a real search it would make a difference, since waiting on everything to finish after a fail high simply wastes time and slows the search down. But here, time doesn't matter (fixed depth search limit).

Just for fun, I repeated with null-move disabled. Since this is still to a fixed depth, I thought it would be interesting to see what effect this might have as this will definitely widen the tree with no apparent cost thanks to sd=10... This was a bit more interesting in that it resulted in a 2622-2578 (+44 Elo!) change. Of course we know null-move makes mistakes, but to compensate for those mistakes we get a significantly deeper search with more than compensates. Previous tests with Crafty showed removing null-move was an 80 Elo loss, which means that overall it is a 120 Elo gain due to extra depth and a 40 elo loss due to zugzwang/reduced depth errors.

Previous tests also showed that removing LMR by itself was about the same Elo loss, which means that in the above testing, it would likely produce about the same result as null-move removal.

It seems there are roughly four ways to "widen" realistically.

(1) search overhead caused by choosing poor split points and/or failing to stop/recover quickly when an unwanted fail-high occurs at a split point.

(2) back off on null-move reductions which will increase the size of the tree.

(3) back off on LMR reductions ditto.

(4) limit forward pruning (futility/LMP/etc) by increasing the margins, which again will make the tree larger and the depth lower.

I'm calling this test done (note that removing null-move greatly increases the length of the test games since sd=10 now takes a lot longer for one opponent.)

Next step is to re-run the parallel search tests. I don't expect major changes, but the results will tell the tale...
Would be curious now to see the speedup numbers with the new Crafty 25. I guess time-to-depth (TTD) speedup will go down a bit. Effective speedup in the narrowing Crafty 25 is maybe 5% smaller than TTD speedup, while in the corrected, a little bit widening Crafty 25, effective speedup is maybe 5% higher than TTD speedup. All in all, the corrected Crafty will probably gain on multicore several ELO points compared to the buggy one.

I will rerun those soon. But I am still a bit puzzled by some of the marked node count reductions, so I am "fine-toothing" it to make sure there are no oddities that only tend to show up on extreme architectures like the 20 core box. I started a test run last night but mother nature aborted it with a huge electrical storm. Will try again tonight just to get a "benchmark" with the current code...

bob · Post by **bob** » Fri Jul 03, 2015 8:03 am

I found what one might call a bug, or a design flaw. Very subtle. Moves are "LMR'ed" based on the order they are searched. But with a parallel search, order becomes less relevant. I have been chasing one issue that I had not expected, namely splitting at the root. That can shift the order of the moves a lot, which is not what I had intended. So I have spent two days working on a solution to guarantee that moves at a ply are at least given the same order "number" so that they will be reduced by the same amount whether at a split ply or a normal ply.

There are still quirks that can't be solved. For example, I "reset" the move counter when I start a search. After a fail high, it gets reset since that is now a new best move and it needs to be searched with no reductions and such. The next move searched will be move 2, and so forth. But with a serial search, the moves are searched in the order given by the root move list. With a parallel search, I have seen something like this...

search move 1 serially, splitting within the tree but not at ply=1 yet. After that finishes, split at ply=1 if the split rules don't have other moves that should be searched one at a time serially. Now the ply one moves get sucked away in big groups, and some of 'em get dismissed almost instantly. Then a move fails high at the root. All the moves that have been completed prior to the fail high won't be re-searched since they were finished. But some were probably in progress (particularly when there are multiple best move changes in the serial search). Now when the fail high move is searched, with the move counter reset to 1, the remaining moves are searched with 2, 3, ..., and they get reduced as the matrix says. But some of those moves that the serial search would normally search after the fail high have already been done. That shifts the rest of the moves up, and now one of those will fail high due to a lesser reduction, which will produce a different best move at the root. I don't see a viable solution for this except to not split at the root, and splitting at the root is too big a performance gain to get rid of.

So some inconsistencies are there.

I am going to try again to run 4 20cpu runs tonight, followed by a 1 cpu run. The 4 20cpu runs should end tonight, but the 1 cpu run will take most of the day. If we have no more storms, and I didn't break something I have not noticed, I should be able to at least provide the 1 vs 20cpu data tomorrow. Earlier I had a stupid bug that was making the 8 core program search MUCH larger trees, so that the 8 vs 1 test was showing a +100 Elo gain for the 8 core test, with both searching to 10 plies fixed. Thought you might enjoy that one Kai, as that was "massive widening."

It now tests as about 8 Elo difference after 3,000 games, favoring the 8 core version. So there is some useful search overhead, but not much... Most of it is totally wasted since it is at a ply where a fail high occurs and the other cores searching there are really doing absolutely nothing useful at all.

More results tomorrow, if something doesn't blow up and I shoot myself. This has been a MAJOR debugging session these last two days. And I mean MAJOR. I hope no one runs a diff on 24.1 search vs 25.0.

You will basically get a listing that contains almost everything from both.

Bed time here. Be a few minutes however as it is taking longer than expected to remove the duct tape from my head (a necessary precaution when debugging complex parallel algorithms, to prevent one's head from exploding.)

While removing the duct tape, I came up with a clever/simple solution for the ply-1 move sequence counting. It now is consistent between serial and parallel versions completely with respect to LMR at the root...

later....

Joerg Oster · Post by **Joerg Oster** » Fri Jul 03, 2015 10:51 am

bob wrote:I found what one might call a bug, or a design flaw. Very subtle. Moves are "LMR'ed" based on the order they are searched. But with a parallel search, order becomes less relevant. I have been chasing one issue that I had not expected, namely splitting at the root. That can shift the order of the moves a lot, which is not what I had intended. So I have spent two days working on a solution to guarantee that moves at a ply are at least given the same order "number" so that they will be reduced by the same amount whether at a split ply or a normal ply.

There are still quirks that can't be solved. For example, I "reset" the move counter when I start a search. After a fail high, it gets reset since that is now a new best move and it needs to be searched with no reductions and such. The next move searched will be move 2, and so forth. But with a serial search, the moves are searched in the order given by the root move list. With a parallel search, I have seen something like this...

search move 1 serially, splitting within the tree but not at ply=1 yet. After that finishes, split at ply=1 if the split rules don't have other moves that should be searched one at a time serially. Now the ply one moves get sucked away in big groups, and some of 'em get dismissed almost instantly. Then a move fails high at the root. All the moves that have been completed prior to the fail high won't be re-searched since they were finished. But some were probably in progress (particularly when there are multiple best move changes in the serial search). Now when the fail high move is searched, with the move counter reset to 1, the remaining moves are searched with 2, 3, ..., and they get reduced as the matrix says. But some of those moves that the serial search would normally search after the fail high have already been done. That shifts the rest of the moves up, and now one of those will fail high due to a lesser reduction, which will produce a different best move at the root. I don't see a viable solution for this except to not split at the root, and splitting at the root is too big a performance gain to get rid of.

So some inconsistencies are there.

I am going to try again to run 4 20cpu runs tonight, followed by a 1 cpu run. The 4 20cpu runs should end tonight, but the 1 cpu run will take most of the day. If we have no more storms, and I didn't break something I have not noticed, I should be able to at least provide the 1 vs 20cpu data tomorrow. Earlier I had a stupid bug that was making the 8 core program search MUCH larger trees, so that the 8 vs 1 test was showing a +100 Elo gain for the 8 core test, with both searching to 10 plies fixed. Thought you might enjoy that one Kai, as that was "massive widening." It now tests as about 8 Elo difference after 3,000 games, favoring the 8 core version. So there is some useful search overhead, but not much... Most of it is totally wasted since it is at a ply where a fail high occurs and the other cores searching there are really doing absolutely nothing useful at all.

More results tomorrow, if something doesn't blow up and I shoot myself. This has been a MAJOR debugging session these last two days. And I mean MAJOR. I hope no one runs a diff on 24.1 search vs 25.0. You will basically get a listing that contains almost everything from both.

Bed time here. Be a few minutes however as it is taking longer than expected to remove the duct tape from my head (a necessary precaution when debugging complex parallel algorithms, to prevent one's head from exploding.)

While removing the duct tape, I came up with a clever/simple solution for the ply-1 move sequence counting. It now is consistent between serial and parallel versions completely with respect to LMR at the root...

later....

Very interesting.
Looking forward to your final results.

Significant modification of Komodo SMP implementation

Re: final results

Re: final results

Re: final results