supersharp77 wrote:Thx Dan........are the tactical issues straightened out with the new edition? if so should be significantly stronger indeed! AR
Since you ask, an interesting bug. I thought the problem was in the parallel search and went over it VERY carefully. I found one almost impossible-to-occur race condition, and fixed it but the bug persisted. The problem with the bug was that it took about an hour's worth of running on 20 cores to cause a mysterious segfault. I kept adding debugging info and finally ran the issue down to "capturing a king" in the basic search. Could not figure out how this was possible. Here's the story from the beginning.
Years ago I played with an 8 byte hash entry, but was concerned with the excessive collisions. Of course the paper I wrote with Cozzie showed that they were really harmless, but still. But I had done that test prior to the paper. A couple of weeks ago I decided to visit that again. Easy enough. And it looked solid playing normal games. But when I started the extreme SMP tests it began to crash. A test run with 20 cores takes 2 hours. On average I would see one crash out of 4 test runs. VERY rare. So back to where I was above. I discovered the problem was capturing a king, which is not supposed to happen. I then step forward 2 plies and generate moves, and the move generator was not particularly happy with no king. So it produced a move that would cause the board to become corrupted.
But now I had a candidate, so how is the king getting captured. Since inside the main search loop I test for being in check after making a move to avoid searching illegal moves, so that seemed to be impossible. Debugging continued.
I finally tracked it down to what appeared to be an illegal move from the hash table. But that move is tested for legality in NextMove(). Or is it? I looked carefully, and decided "yes". So something is corrupting some other data structure that then mimics this behavior.
More debugging. I finally decided to look at one of these moves where the king was captured, so I added code to dump most everything when that happened. I found that at ply=N white checked black, and at ply=N+1, black played a move that did not address the check, leaving the king capture to ply=N+2. I am thinking "how?" But I decided to look at the move and sure enough it came from the hash table. So I figured this must be the new hash stuff, but then I thought further "OK, when I take the move from the hash table, I do a basic legality check, but this only checks for the right piece on the from/to squares. Which this bogus move met. So I believe I am on the right track. But then I remembered, "After I make the move in SearchMoves(), I then immediately check to see if the side on move is still in check after making the move. Or did I? Looking at the code, the answer was "yes".
And then it hit me. I only check for legality if I am NOT in check, because if I am in check I only generate legal moves and don't need the wasted test. And that showed me the error. I needed a position where I was in check, and got a real hash collision that gave me a move that looked legal (from/to squares, moving and captured pieces, etc) that was illegal. If that happens, I play the hash table move with a half-assed legality check that passes with flying colors here and promptly crash and burn.
I removed the 8-byte hash entry stuff, and moved to the Cray Blitz 12-byte hash entry (crafty was using a 16 byte entry which was overkill). There is still a small hole since the longer hash signature stored in 12 bytes greatly reduces the potential for collisions, but it does NOT reduce it to zero.
Bottom line, it was a hashing algorithm change, the parallel search has been working fine all along. But a solid week+ down the drain chasing this silly bug.