Page 1 of 2

Volatile bug

Posted: Fri Jun 24, 2016 12:45 am
by hgm
Have you ever seen a thing like this? When I analyze the start position my engine now crashes reproducibly after 11 ply. But when I run it in gdb the crash is gone??? I thought gdb was a non-intervening tool, that only gets active when the process under scrutiny dies.

Code: Select all

hgm@hgm-xboard:~/haqikid$ ./haqikid
# hash-table address = 7f1dbd9de040
tellics say     HaQiKi D 1.8b
tellics say     by H.G. Muller
# 465885 of 465885 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 32611950 of 32611950 EGT bytes loaded
# last command = (null)# eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
memory 256
# peek memory
# command is 'memory 256
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
new
# peek new
# command is 'new
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=32 Ponder=0 move=0
# read command
analyze
# peek analyze
# command is 'analyze
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=64 Ponder=0 move=0
# start analysis
# eval=20
 1    124      0         76 f0e1
 2     92      0        734 f0e1 f9e8
 3     96      0       3183 f0e1 f9e8 g3g4
 4     76      1      19450 f0e1 h7e7 h2h3 f9e8
 5     96      2      31750 f0e1 f9e8 g3g4 h9g7
 6     72     11     176419 c3c4 h9g7 g3g4 f9e8 d0e1 b9c7
 7     72     22     327923 c3c4 g6g5 f0e1 h9g7 h2h6 g7f5 b0c2 f5g3 h6c6
 8     64     59     953991 f0e1 g6g5 g0e2 h9g7 h0f1 b9c7 i0f0 c6c5
 9     64    147    2463123 c0e2 g6g5 d0e1 h9g7 b0d1
10     68    316    5339229 c3c4 g6g5 b2g2 h9g7 b0c2 b9c7 a0b0 g7f5 b0b6 f9e8 g2g5
11     72    678   11557405 c3c4 f9e8 b2g2 h7e7 b0c2 h9g7 a0b0 b9c7 g2g6 i9h9 h2e2
Segmentation fault
Same thing under gdb:

Code: Select all

hgm@hgm-xboard:~/haqikid$ gdb ./haqikid
GNU gdb (GDB) 7.1-ubuntu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+&#58; GNU GPL version 3 or later <http&#58;//gnu.org/licenses/gpl.html>
This is free software&#58; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see&#58;
<http&#58;//www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/hgm/haqikid/haqikid...&#40;no debugging symbols found&#41;...done.
&#40;gdb&#41; run
Starting program&#58; /home/hgm/haqikid/haqikid 
# hash-table address = 7ffff7656040
tellics say     HaQiKi D 1.8b
tellics say     by H.G. Muller
# 465885 of 465885 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 32611950 of 32611950 EGT bytes loaded
# last command = &#40;null&#41;# eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
memory 256
# peek memory
# command is 'memory 256
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
new
# peek new
# command is 'new
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=32 Ponder=0 move=0
# read command
analyze
# peek analyze
# command is 'analyze
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=64 Ponder=0 move=0
# start analysis
# eval=20
 1    124      0         76 f0e1
 2     92      0        734 f0e1 f9e8
 3     96      0       3183 f0e1 f9e8 g3g4
 4     76      1      19450 f0e1 h7e7 h2h3 f9e8
 5     96      2      31750 f0e1 f9e8 g3g4 h9g7
 6     72     12     176419 c3c4 h9g7 g3g4 f9e8 d0e1 b9c7
 7     72     22     327923 c3c4 g6g5 f0e1 h9g7 h2h6 g7f5 b0c2 f5g3 h6c6
 8     64     59     953991 f0e1 g6g5 g0e2 h9g7 h0f1 b9c7 i0f0 c6c5
 9     64    147    2463123 c0e2 g6g5 d0e1 h9g7 b0d1
10     68    316    5339229 c3c4 g6g5 b2g2 h9g7 b0c2 b9c7 a0b0 g7f5 b0b6 f9e8 g2g5
11     72    679   11557405 c3c4 f9e8 b2g2 h7e7 b0c2 h9g7 a0b0 b9c7 g2g6 i9h9 h2e2
12     64   1680   28568379 c3c4 f9e8 d0e1 g6g5 b2g2 h9g7 b0c2 b7e7 a0b0 g7f5 b0b8 f5e3

Re: Volatile bug

Posted: Fri Jun 24, 2016 2:10 am
by tttony
Weird!!

Try with http://valgrind.org/

Re: Volatile bug

Posted: Fri Jun 24, 2016 2:23 am
by bob
hgm wrote:Have you ever seen a thing like this? When I analyze the start position my engine now crashes reproducibly after 11 ply. But when I run it in gdb the crash is gone??? I thought gdb was a non-intervening tool, that only gets active when the process under scrutiny dies.

Code: Select all

hgm@hgm-xboard&#58;~/haqikid$ ./haqikid
# hash-table address = 7f1dbd9de040
tellics say     HaQiKi D 1.8b
tellics say     by H.G. Muller
# 465885 of 465885 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 32611950 of 32611950 EGT bytes loaded
# last command = &#40;null&#41;# eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
memory 256
# peek memory
# command is 'memory 256
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
new
# peek new
# command is 'new
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=32 Ponder=0 move=0
# read command
analyze
# peek analyze
# command is 'analyze
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=64 Ponder=0 move=0
# start analysis
# eval=20
 1    124      0         76 f0e1
 2     92      0        734 f0e1 f9e8
 3     96      0       3183 f0e1 f9e8 g3g4
 4     76      1      19450 f0e1 h7e7 h2h3 f9e8
 5     96      2      31750 f0e1 f9e8 g3g4 h9g7
 6     72     11     176419 c3c4 h9g7 g3g4 f9e8 d0e1 b9c7
 7     72     22     327923 c3c4 g6g5 f0e1 h9g7 h2h6 g7f5 b0c2 f5g3 h6c6
 8     64     59     953991 f0e1 g6g5 g0e2 h9g7 h0f1 b9c7 i0f0 c6c5
 9     64    147    2463123 c0e2 g6g5 d0e1 h9g7 b0d1
10     68    316    5339229 c3c4 g6g5 b2g2 h9g7 b0c2 b9c7 a0b0 g7f5 b0b6 f9e8 g2g5
11     72    678   11557405 c3c4 f9e8 b2g2 h7e7 b0c2 h9g7 a0b0 b9c7 g2g6 i9h9 h2e2
Segmentation fault
Same thing under gdb:

Code: Select all

hgm@hgm-xboard&#58;~/haqikid$ gdb ./haqikid
GNU gdb &#40;GDB&#41; 7.1-ubuntu
Copyright &#40;C&#41; 2010 Free Software Foundation, Inc.
License GPLv3+&#58; GNU GPL version 3 or later <http&#58;//gnu.org/licenses/gpl.html>
This is free software&#58; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see&#58;
<http&#58;//www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/hgm/haqikid/haqikid...&#40;no debugging symbols found&#41;...done.
&#40;gdb&#41; run
Starting program&#58; /home/hgm/haqikid/haqikid 
# hash-table address = 7ffff7656040
tellics say     HaQiKi D 1.8b
tellics say     by H.G. Muller
# 465885 of 465885 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 32611950 of 32611950 EGT bytes loaded
# last command = &#40;null&#41;# eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
memory 256
# peek memory
# command is 'memory 256
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
new
# peek new
# command is 'new
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=32 Ponder=0 move=0
# read command
analyze
# peek analyze
# command is 'analyze
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=64 Ponder=0 move=0
# start analysis
# eval=20
 1    124      0         76 f0e1
 2     92      0        734 f0e1 f9e8
 3     96      0       3183 f0e1 f9e8 g3g4
 4     76      1      19450 f0e1 h7e7 h2h3 f9e8
 5     96      2      31750 f0e1 f9e8 g3g4 h9g7
 6     72     12     176419 c3c4 h9g7 g3g4 f9e8 d0e1 b9c7
 7     72     22     327923 c3c4 g6g5 f0e1 h9g7 h2h6 g7f5 b0c2 f5g3 h6c6
 8     64     59     953991 f0e1 g6g5 g0e2 h9g7 h0f1 b9c7 i0f0 c6c5
 9     64    147    2463123 c0e2 g6g5 d0e1 h9g7 b0d1
10     68    316    5339229 c3c4 g6g5 b2g2 h9g7 b0c2 b9c7 a0b0 g7f5 b0b6 f9e8 g2g5
11     72    679   11557405 c3c4 f9e8 b2g2 h7e7 b0c2 h9g7 a0b0 b9c7 g2g6 i9h9 h2e2
12     64   1680   28568379 c3c4 f9e8 d0e1 g6g5 b2g2 h9g7 b0c2 b7e7 a0b0 g7f5 b0b8 f5e3
I don't think there is ever any way to be "completely hands off". PITA but it happens. Usually something uninitialized. Have you enabled core dumps and then used gdb to look at the dump after the stand-alone version crashes??? Sometimes this won't work either as if the stack gets blown out badly there is no way to view any sort of traceback data...

If you have not used gdb with a core file, I can explain and it is really easy.

Re: Volatile bug

Posted: Fri Jun 24, 2016 2:32 am
by michiguel
It is not that unusual. It is a Heisenbug
https://en.wikipedia.org/wiki/Heisenbug

The fact that it happens, should be telling you clues or something about the bug itself.

Miguel

Re: Volatile bug

Posted: Fri Jun 24, 2016 2:56 am
by jdart
You can also try compiling with -fsanitze=undefined (requires a recent GCC).

Could also be a timing thing if you are running multithreaded.

--Jon

Re: Volatile bug

Posted: Fri Jun 24, 2016 3:34 am
by smatovic
This is your third bug in a row, maybe you should consider to check your hardware....

afaik Ubuntu offers some memory test at boot and for the CPU prime95 could do the job.

--
Srdja

Re: Volatile bug

Posted: Fri Jun 24, 2016 6:56 am
by Evert
I think I've seen that a few times, but usually with programs that do a lot of floating point calculations.

If you enable core dumps you can debug after the code crashes, which helps.

Pre-emptively running through valgrind may also give a clue.

Re: Volatile bug

Posted: Fri Jun 24, 2016 8:33 am
by hgm
I have seen Heisenbugs before. But that was when modifying the code with printfs to debug the issue. Icould understand that, because printf uses the stack, leaving altered values there, which could actas starting value of uninitialized variables. But with gdb? Perhaps my understanding of how gdb works is wrong. I was under the impression that it absolutely recreated the conditions of the program under test, by running it as a separate process, and somehow peeking in the memory map of that process.

Anyway, the core dump helped me locate the problem. I was using a Zobrist key table in a routine that checked whether a perceived repeat (i.e. keys the same) was a real repeat, by recalculating key modifications by the moves in the loop with another assignment of Zobrist keys. But as this is Xiangqi not all pieces can access the entire board, and inadvertantly I was using an incomplete Zobrist table for a piece that could access the entire board. This caused an out-of-bound read access to that table.

This bug must have been there for years. Still no idea what activated it. I was merging two versions of the engine, one where I implemented speculative pondering, the other that did EGT probing. Both versions worked fine, but when I merged them I suddenly had this crashing problem. Calling the initialization routine of the EGT probing enabled the crash. I had been able to trace that to a single line in that initialization routine, which set the values in a table to translate 0x88 square numbers to sequential numbering. There was no out-of-bounds problem in that table. But when I left it uninitialized, the crash would go away. I still have no why changing the values in that table (which was not used at this point) would activate the dormant bug in the Zobrist access. It could be that the memory map was such that the out-of-bound Zobrist accesses happened to map into this table, making the engine rejecting some false repeats that would be accepted otherwise, affecting the search tree, so that it finally got a repeat that in verification triggered an out-of-bound access bad enough to cause a segfault.

Anyway, thanks everyone for the advice.

Re: Volatile bug

Posted: Fri Jun 24, 2016 9:16 am
by Evert
hgm wrote:I have seen Heisenbugs before. But that was when modifying the code with printfs to debug the issue. Icould understand that, because printf uses the stack, leaving altered values there, which could actas starting value of uninitialized variables. But with gdb? Perhaps my understanding of how gdb works is wrong. I was under the impression that it absolutely recreated the conditions of the program under test, by running it as a separate process, and somehow peeking in the memory map of that process.
At the very least it must catch and handle signals that are raised but not caught by the program (like SEGV). Doesn't it also need to do some modifications when you add breakpoints (insert int3)? I can imagine that it maps memory differently as well.

Re: Volatile bug

Posted: Fri Jun 24, 2016 9:56 am
by hgm
Normally the OS catches those signals (and the handler then terminates the process). I figured there would be a way to request the system to relay the an un-caught signal in a child process to yourself (just as normally it relays your own signals back to you when you catch them). But perhaps gdb works through threads rather than processes. Breakpoints are usually implemented by poking an invalid instruction in the code at that point. So definitely gdb must also have write access to the memory.