Resolving once in a trillion crashes

petero2 · Post by **petero2** » Thu Jan 20, 2022 10:00 am

I am a bit surprised that the use of core dumps has not been mentioned yet. Basically:

1. sudo sysctl kernel.core_pattern=core
2. ulimit -c unlimited
3. Run your test and wait for a crash
4. gdb -c core engine_binary
5. bt

That would give you a stack trace corresponding to the crash. You would have to use a non-stripped binary to get useful output. If the stack trace is not enough to figure out what the problem is, you can (at least if you use gcc, I have not tested clang) recompile the engine binary and add the "-g" compiler flag, but otherwise keep all flags the same. This will produce a binary that is compatible with the existing core file, so you can run "gdb -c core engine_binary_with_debug_info" and "bt" to get a stack trace containing also line numbers, and hopefully some variable/parameter info, although some of them may have been optimized out so the debugger cannot see their values.

A stack trace would not always be enough to figure out what the problem is, but I think it is at least a good way to start debugging the problem.

AndrewGrant · Post by **AndrewGrant** » Thu Jan 20, 2022 10:12 am

petero2 wrote: ↑Thu Jan 20, 2022 10:00 am I am a bit surprised that the use of core dumps has not been mentioned yet. Basically:

1. sudo sysctl kernel.core_pattern=core
2. ulimit -c unlimited
3. Run your test and wait for a crash
4. gdb -c core engine_binary
5. bt

That would give you a stack trace corresponding to the crash. You would have to use a non-stripped binary to get useful output. If the stack trace is not enough to figure out what the problem is, you can (at least if you use gcc, I have not tested clang) recompile the engine binary and add the "-g" compiler flag, but otherwise keep all flags the same. This will produce a binary that is compatible with the existing core file, so you can run "gdb -c core engine_binary_with_debug_info" and "bt" to get a stack trace containing also line numbers, and hopefully some variable/parameter info, although some of them may have been optimized out so the debugger cannot see their values.

A stack trace would not always be enough to figure out what the problem is, but I think it is at least a good way to start debugging the problem.

Well that would have been just as sufficient in my case as what I did; The problem is the difficulty of actually getting a crash.

For Daniel, looks like maybe the same problem?

petero2 · Post by **petero2** » Thu Jan 20, 2022 10:35 am

AndrewGrant wrote: ↑Thu Jan 20, 2022 10:12 am
petero2 wrote: ↑Thu Jan 20, 2022 10:00 am I am a bit surprised that the use of core dumps has not been mentioned yet. Basically:
...
Well that would have been just as sufficient in my case as what I did; The problem is the difficulty of actually getting a crash.

But you wrote in "part 1" that "there was exactly one crash". If core dumps had been enabled then, you could have skipped the second half of "part 1" and all of "part 2", including the "So I launch hundreds of thousands of games on my own machine at various time controls", which seems quite time consuming.

The power of core dumps is that it makes it possible to debug non-repeatable crashes.

jstanback · Post by **jstanback** » Thu Jan 20, 2022 4:53 pm

petero2 wrote: ↑Thu Jan 20, 2022 10:00 am I am a bit surprised that the use of core dumps has not been mentioned yet. Basically:

1. sudo sysctl kernel.core_pattern=core
2. ulimit -c unlimited
3. Run your test and wait for a crash
4. gdb -c core engine_binary
5. bt

That would give you a stack trace corresponding to the crash. You would have to use a non-stripped binary to get useful output. If the stack trace is not enough to figure out what the problem is, you can (at least if you use gcc, I have not tested clang) recompile the engine binary and add the "-g" compiler flag, but otherwise keep all flags the same. This will produce a binary that is compatible with the existing core file, so you can run "gdb -c core engine_binary_with_debug_info" and "bt" to get a stack trace containing also line numbers, and hopefully some variable/parameter info, although some of them may have been optimized out so the debugger cannot see their values.

A stack trace would not always be enough to figure out what the problem is, but I think it is at least a good way to start debugging the problem.

The last time Wasp crashed at TCEC, Kan sent me a link to a core dump. I could never re-create the crash on my machines. I hadn't used a core dump in ages so it took a few minutes to find out how to use it with gdb. I was quite amazed that after only a few minutes of examining the values of variables I discovered that there was a white pawn on f8. The move f7f8 came from the hash table and sure enough the routine I use to verify that moves from the hash table are pseudo-legal did not check whether a promotion bit was set for a pawn moving to the 8th rank. I was surprised that this had only caused a crash once in probably a trillion or more positions searched.

John

Ras · Post by **Ras** » Fri Jan 21, 2022 4:37 pm

AndrewGrant wrote: ↑Tue Jan 18, 2022 8:58 amThis should essentially be a blog post or something, but I'm going to write it here

Thanks - it's always interesting how people attack difficult bugs!

So naturally if we try to update God's parents, we run into issues. Memory access obviously.

That's the part I don't understand. Why didn't the root position have valid data? Wouldn't that be the first thing to ensure before starting the main search? Did you just forget to implement that kind of sentinel? Or is there some way that it can invalidate itself during and because of search?

jstanback wrote: ↑Thu Jan 20, 2022 4:53 pmthe routine I use to verify that moves from the hash table are pseudo-legal did not check whether a promotion bit was set for a pawn moving to the 8th rank.

Nice idea, I'll also add that one!

As for my own story, the worst bug I had were occiasional crashes in the mock-up version of the microcontroller version. I used GDB with stack trace, and it crashed on dereferencing some pointer variable - but that pointer was never written to after initialisation! So I put a write-watch on that variable and tested again. Turned out, the pointer was overwritten in a hashtable write.

The hashtables were dimensioned as power of two, using the low bits of the hash value as index, but may also try the two subsequent entries if the first one is not good for overwriting. So that was an out of bounds write that garbled whatever happened to be next in memory, in this case an unrelated pointer. Enlarging the tables a bit solved the problem. Also nasty: that impacted only the embedded version and its mockup because they allocate the memory statically. The PC version didn't crash because it allocated on the heap, and while the same bug had always been there, inherited already from NG-Play itself, it didn't seem to have consequences.

Another difficult bug was that the engine sometimes, in slightly worse positions, would throw away the queen and think it would be draw, but just lost. I solved that via code staring. The draw score was the hint because that happens only with insufficient material (it was in the middlegame), repetition (no forced line anywhere), or stalemate. Stalemate is when the executed move counter remains zero, but not in-check. I checked in which case the move counter would not be incremented, and it was with futility pruning so that totally bad moves won't count against LMR. Basically, the examined position was so bad that all moves were pruned away. The solution was to modify the mate/stalemate detection and also account for legal, but pruned moves.

Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes

Re: Resolving once in a trillion crashes