I am a bit surprised that the use of core dumps has not been mentioned yet. Basically:
1. sudo sysctl kernel.core_pattern=core
2. ulimit -c unlimited
3. Run your test and wait for a crash
4. gdb -c core engine_binary
5. bt
That would give you a stack trace corresponding to the crash. You would have to use a non-stripped binary to get useful output. If the stack trace is not enough to figure out what the problem is, you can (at least if you use gcc, I have not tested clang) recompile the engine binary and add the "-g" compiler flag, but otherwise keep all flags the same. This will produce a binary that is compatible with the existing core file, so you can run "gdb -c core engine_binary_with_debug_info" and "bt" to get a stack trace containing also line numbers, and hopefully some variable/parameter info, although some of them may have been optimized out so the debugger cannot see their values.
A stack trace would not always be enough to figure out what the problem is, but I think it is at least a good way to start debugging the problem.
Resolving once in a trillion crashes
Moderator: Ras
-
- Posts: 729
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
-
- Posts: 1960
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: Resolving once in a trillion crashes
Well that would have been just as sufficient in my case as what I did; The problem is the difficulty of actually getting a crash.petero2 wrote: ↑Thu Jan 20, 2022 10:00 am I am a bit surprised that the use of core dumps has not been mentioned yet. Basically:
1. sudo sysctl kernel.core_pattern=core
2. ulimit -c unlimited
3. Run your test and wait for a crash
4. gdb -c core engine_binary
5. bt
That would give you a stack trace corresponding to the crash. You would have to use a non-stripped binary to get useful output. If the stack trace is not enough to figure out what the problem is, you can (at least if you use gcc, I have not tested clang) recompile the engine binary and add the "-g" compiler flag, but otherwise keep all flags the same. This will produce a binary that is compatible with the existing core file, so you can run "gdb -c core engine_binary_with_debug_info" and "bt" to get a stack trace containing also line numbers, and hopefully some variable/parameter info, although some of them may have been optimized out so the debugger cannot see their values.
A stack trace would not always be enough to figure out what the problem is, but I think it is at least a good way to start debugging the problem.
For Daniel, looks like maybe the same problem?
-
- Posts: 729
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Resolving once in a trillion crashes
But you wrote in "part 1" that "there was exactly one crash". If core dumps had been enabled then, you could have skipped the second half of "part 1" and all of "part 2", including the "So I launch hundreds of thousands of games on my own machine at various time controls", which seems quite time consuming.AndrewGrant wrote: ↑Thu Jan 20, 2022 10:12 amWell that would have been just as sufficient in my case as what I did; The problem is the difficulty of actually getting a crash.
The power of core dumps is that it makes it possible to debug non-repeatable crashes.
-
- Posts: 139
- Joined: Fri Jun 17, 2016 4:14 pm
- Location: Colorado, USA
- Full name: John Stanback
Re: Resolving once in a trillion crashes
The last time Wasp crashed at TCEC, Kan sent me a link to a core dump. I could never re-create the crash on my machines. I hadn't used a core dump in ages so it took a few minutes to find out how to use it with gdb. I was quite amazed that after only a few minutes of examining the values of variables I discovered that there was a white pawn on f8. The move f7f8 came from the hash table and sure enough the routine I use to verify that moves from the hash table are pseudo-legal did not check whether a promotion bit was set for a pawn moving to the 8th rank. I was surprised that this had only caused a crash once in probably a trillion or more positions searched.petero2 wrote: ↑Thu Jan 20, 2022 10:00 am I am a bit surprised that the use of core dumps has not been mentioned yet. Basically:
1. sudo sysctl kernel.core_pattern=core
2. ulimit -c unlimited
3. Run your test and wait for a crash
4. gdb -c core engine_binary
5. bt
That would give you a stack trace corresponding to the crash. You would have to use a non-stripped binary to get useful output. If the stack trace is not enough to figure out what the problem is, you can (at least if you use gcc, I have not tested clang) recompile the engine binary and add the "-g" compiler flag, but otherwise keep all flags the same. This will produce a binary that is compatible with the existing core file, so you can run "gdb -c core engine_binary_with_debug_info" and "bt" to get a stack trace containing also line numbers, and hopefully some variable/parameter info, although some of them may have been optimized out so the debugger cannot see their values.
A stack trace would not always be enough to figure out what the problem is, but I think it is at least a good way to start debugging the problem.
John
-
- Posts: 2702
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Resolving once in a trillion crashes
Thanks - it's always interesting how people attack difficult bugs!AndrewGrant wrote: ↑Tue Jan 18, 2022 8:58 amThis should essentially be a blog post or something, but I'm going to write it here
That's the part I don't understand. Why didn't the root position have valid data? Wouldn't that be the first thing to ensure before starting the main search? Did you just forget to implement that kind of sentinel? Or is there some way that it can invalidate itself during and because of search?So naturally if we try to update God's parents, we run into issues. Memory access obviously.
Nice idea, I'll also add that one!
As for my own story, the worst bug I had were occiasional crashes in the mock-up version of the microcontroller version. I used GDB with stack trace, and it crashed on dereferencing some pointer variable - but that pointer was never written to after initialisation! So I put a write-watch on that variable and tested again. Turned out, the pointer was overwritten in a hashtable write.
The hashtables were dimensioned as power of two, using the low bits of the hash value as index, but may also try the two subsequent entries if the first one is not good for overwriting. So that was an out of bounds write that garbled whatever happened to be next in memory, in this case an unrelated pointer. Enlarging the tables a bit solved the problem. Also nasty: that impacted only the embedded version and its mockup because they allocate the memory statically. The PC version didn't crash because it allocated on the heap, and while the same bug had always been there, inherited already from NG-Play itself, it didn't seem to have consequences.
Another difficult bug was that the engine sometimes, in slightly worse positions, would throw away the queen and think it would be draw, but just lost. I solved that via code staring. The draw score was the hint because that happens only with insufficient material (it was in the middlegame), repetition (no forced line anywhere), or stalemate. Stalemate is when the executed move counter remains zero, but not in-check. I checked in which case the move counter would not be incremented, and it was with futility pruning so that totally bad moves won't count against LMR. Basically, the examined position was so bad that all moves were pruned away. The solution was to modify the mate/stalemate detection and also account for legal, but pruned moves.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net