Volatile bug

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Volatile bug

Post by hgm »

Have you ever seen a thing like this? When I analyze the start position my engine now crashes reproducibly after 11 ply. But when I run it in gdb the crash is gone??? I thought gdb was a non-intervening tool, that only gets active when the process under scrutiny dies.

Code: Select all

hgm@hgm-xboard:~/haqikid$ ./haqikid
# hash-table address = 7f1dbd9de040
tellics say     HaQiKi D 1.8b
tellics say     by H.G. Muller
# 465885 of 465885 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 32611950 of 32611950 EGT bytes loaded
# last command = (null)# eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
memory 256
# peek memory
# command is 'memory 256
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
new
# peek new
# command is 'new
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=32 Ponder=0 move=0
# read command
analyze
# peek analyze
# command is 'analyze
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=64 Ponder=0 move=0
# start analysis
# eval=20
 1    124      0         76 f0e1
 2     92      0        734 f0e1 f9e8
 3     96      0       3183 f0e1 f9e8 g3g4
 4     76      1      19450 f0e1 h7e7 h2h3 f9e8
 5     96      2      31750 f0e1 f9e8 g3g4 h9g7
 6     72     11     176419 c3c4 h9g7 g3g4 f9e8 d0e1 b9c7
 7     72     22     327923 c3c4 g6g5 f0e1 h9g7 h2h6 g7f5 b0c2 f5g3 h6c6
 8     64     59     953991 f0e1 g6g5 g0e2 h9g7 h0f1 b9c7 i0f0 c6c5
 9     64    147    2463123 c0e2 g6g5 d0e1 h9g7 b0d1
10     68    316    5339229 c3c4 g6g5 b2g2 h9g7 b0c2 b9c7 a0b0 g7f5 b0b6 f9e8 g2g5
11     72    678   11557405 c3c4 f9e8 b2g2 h7e7 b0c2 h9g7 a0b0 b9c7 g2g6 i9h9 h2e2
Segmentation fault
Same thing under gdb:

Code: Select all

hgm@hgm-xboard:~/haqikid$ gdb ./haqikid
GNU gdb (GDB) 7.1-ubuntu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+&#58; GNU GPL version 3 or later <http&#58;//gnu.org/licenses/gpl.html>
This is free software&#58; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see&#58;
<http&#58;//www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/hgm/haqikid/haqikid...&#40;no debugging symbols found&#41;...done.
&#40;gdb&#41; run
Starting program&#58; /home/hgm/haqikid/haqikid 
# hash-table address = 7ffff7656040
tellics say     HaQiKi D 1.8b
tellics say     by H.G. Muller
# 465885 of 465885 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 32611950 of 32611950 EGT bytes loaded
# last command = &#40;null&#41;# eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
memory 256
# peek memory
# command is 'memory 256
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
new
# peek new
# command is 'new
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=32 Ponder=0 move=0
# read command
analyze
# peek analyze
# command is 'analyze
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=64 Ponder=0 move=0
# start analysis
# eval=20
 1    124      0         76 f0e1
 2     92      0        734 f0e1 f9e8
 3     96      0       3183 f0e1 f9e8 g3g4
 4     76      1      19450 f0e1 h7e7 h2h3 f9e8
 5     96      2      31750 f0e1 f9e8 g3g4 h9g7
 6     72     12     176419 c3c4 h9g7 g3g4 f9e8 d0e1 b9c7
 7     72     22     327923 c3c4 g6g5 f0e1 h9g7 h2h6 g7f5 b0c2 f5g3 h6c6
 8     64     59     953991 f0e1 g6g5 g0e2 h9g7 h0f1 b9c7 i0f0 c6c5
 9     64    147    2463123 c0e2 g6g5 d0e1 h9g7 b0d1
10     68    316    5339229 c3c4 g6g5 b2g2 h9g7 b0c2 b9c7 a0b0 g7f5 b0b6 f9e8 g2g5
11     72    679   11557405 c3c4 f9e8 b2g2 h7e7 b0c2 h9g7 a0b0 b9c7 g2g6 i9h9 h2e2
12     64   1680   28568379 c3c4 f9e8 d0e1 g6g5 b2g2 h9g7 b0c2 b7e7 a0b0 g7f5 b0b8 f5e3
tttony
Posts: 268
Joined: Sun Apr 24, 2011 12:33 am

Re: Volatile bug

Post by tttony »

Weird!!

Try with http://valgrind.org/
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Volatile bug

Post by bob »

hgm wrote:Have you ever seen a thing like this? When I analyze the start position my engine now crashes reproducibly after 11 ply. But when I run it in gdb the crash is gone??? I thought gdb was a non-intervening tool, that only gets active when the process under scrutiny dies.

Code: Select all

hgm@hgm-xboard&#58;~/haqikid$ ./haqikid
# hash-table address = 7f1dbd9de040
tellics say     HaQiKi D 1.8b
tellics say     by H.G. Muller
# 465885 of 465885 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 32611950 of 32611950 EGT bytes loaded
# last command = &#40;null&#41;# eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
memory 256
# peek memory
# command is 'memory 256
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
new
# peek new
# command is 'new
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=32 Ponder=0 move=0
# read command
analyze
# peek analyze
# command is 'analyze
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=64 Ponder=0 move=0
# start analysis
# eval=20
 1    124      0         76 f0e1
 2     92      0        734 f0e1 f9e8
 3     96      0       3183 f0e1 f9e8 g3g4
 4     76      1      19450 f0e1 h7e7 h2h3 f9e8
 5     96      2      31750 f0e1 f9e8 g3g4 h9g7
 6     72     11     176419 c3c4 h9g7 g3g4 f9e8 d0e1 b9c7
 7     72     22     327923 c3c4 g6g5 f0e1 h9g7 h2h6 g7f5 b0c2 f5g3 h6c6
 8     64     59     953991 f0e1 g6g5 g0e2 h9g7 h0f1 b9c7 i0f0 c6c5
 9     64    147    2463123 c0e2 g6g5 d0e1 h9g7 b0d1
10     68    316    5339229 c3c4 g6g5 b2g2 h9g7 b0c2 b9c7 a0b0 g7f5 b0b6 f9e8 g2g5
11     72    678   11557405 c3c4 f9e8 b2g2 h7e7 b0c2 h9g7 a0b0 b9c7 g2g6 i9h9 h2e2
Segmentation fault
Same thing under gdb:

Code: Select all

hgm@hgm-xboard&#58;~/haqikid$ gdb ./haqikid
GNU gdb &#40;GDB&#41; 7.1-ubuntu
Copyright &#40;C&#41; 2010 Free Software Foundation, Inc.
License GPLv3+&#58; GNU GPL version 3 or later <http&#58;//gnu.org/licenses/gpl.html>
This is free software&#58; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see&#58;
<http&#58;//www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/hgm/haqikid/haqikid...&#40;no debugging symbols found&#41;...done.
&#40;gdb&#41; run
Starting program&#58; /home/hgm/haqikid/haqikid 
# hash-table address = 7ffff7656040
tellics say     HaQiKi D 1.8b
tellics say     by H.G. Muller
# 465885 of 465885 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 21430710 of 21430710 EGT bytes loaded
# 32611950 of 32611950 EGT bytes loaded
# last command = &#40;null&#41;# eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
memory 256
# peek memory
# command is 'memory 256
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=0 Ponder=0 move=0
# read command
new
# peek new
# command is 'new
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=32 Ponder=0 move=0
# read command
analyze
# peek analyze
# command is 'analyze
'
# last command = # eval = 00000000, index = 05b105b1 Side=16 Comp=64 Ponder=0 move=0
# start analysis
# eval=20
 1    124      0         76 f0e1
 2     92      0        734 f0e1 f9e8
 3     96      0       3183 f0e1 f9e8 g3g4
 4     76      1      19450 f0e1 h7e7 h2h3 f9e8
 5     96      2      31750 f0e1 f9e8 g3g4 h9g7
 6     72     12     176419 c3c4 h9g7 g3g4 f9e8 d0e1 b9c7
 7     72     22     327923 c3c4 g6g5 f0e1 h9g7 h2h6 g7f5 b0c2 f5g3 h6c6
 8     64     59     953991 f0e1 g6g5 g0e2 h9g7 h0f1 b9c7 i0f0 c6c5
 9     64    147    2463123 c0e2 g6g5 d0e1 h9g7 b0d1
10     68    316    5339229 c3c4 g6g5 b2g2 h9g7 b0c2 b9c7 a0b0 g7f5 b0b6 f9e8 g2g5
11     72    679   11557405 c3c4 f9e8 b2g2 h7e7 b0c2 h9g7 a0b0 b9c7 g2g6 i9h9 h2e2
12     64   1680   28568379 c3c4 f9e8 d0e1 g6g5 b2g2 h9g7 b0c2 b7e7 a0b0 g7f5 b0b8 f5e3
I don't think there is ever any way to be "completely hands off". PITA but it happens. Usually something uninitialized. Have you enabled core dumps and then used gdb to look at the dump after the stand-alone version crashes??? Sometimes this won't work either as if the stack gets blown out badly there is no way to view any sort of traceback data...

If you have not used gdb with a core file, I can explain and it is really easy.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Volatile bug

Post by michiguel »

It is not that unusual. It is a Heisenbug
https://en.wikipedia.org/wiki/Heisenbug

The fact that it happens, should be telling you clues or something about the bug itself.

Miguel
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Volatile bug

Post by jdart »

You can also try compiling with -fsanitze=undefined (requires a recent GCC).

Could also be a timing thing if you are running multithreaded.

--Jon
smatovic
Posts: 2642
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Volatile bug

Post by smatovic »

This is your third bug in a row, maybe you should consider to check your hardware....

afaik Ubuntu offers some memory test at boot and for the CPU prime95 could do the job.

--
Srdja
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Volatile bug

Post by Evert »

I think I've seen that a few times, but usually with programs that do a lot of floating point calculations.

If you enable core dumps you can debug after the code crashes, which helps.

Pre-emptively running through valgrind may also give a clue.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Volatile bug

Post by hgm »

I have seen Heisenbugs before. But that was when modifying the code with printfs to debug the issue. Icould understand that, because printf uses the stack, leaving altered values there, which could actas starting value of uninitialized variables. But with gdb? Perhaps my understanding of how gdb works is wrong. I was under the impression that it absolutely recreated the conditions of the program under test, by running it as a separate process, and somehow peeking in the memory map of that process.

Anyway, the core dump helped me locate the problem. I was using a Zobrist key table in a routine that checked whether a perceived repeat (i.e. keys the same) was a real repeat, by recalculating key modifications by the moves in the loop with another assignment of Zobrist keys. But as this is Xiangqi not all pieces can access the entire board, and inadvertantly I was using an incomplete Zobrist table for a piece that could access the entire board. This caused an out-of-bound read access to that table.

This bug must have been there for years. Still no idea what activated it. I was merging two versions of the engine, one where I implemented speculative pondering, the other that did EGT probing. Both versions worked fine, but when I merged them I suddenly had this crashing problem. Calling the initialization routine of the EGT probing enabled the crash. I had been able to trace that to a single line in that initialization routine, which set the values in a table to translate 0x88 square numbers to sequential numbering. There was no out-of-bounds problem in that table. But when I left it uninitialized, the crash would go away. I still have no why changing the values in that table (which was not used at this point) would activate the dormant bug in the Zobrist access. It could be that the memory map was such that the out-of-bound Zobrist accesses happened to map into this table, making the engine rejecting some false repeats that would be accepted otherwise, affecting the search tree, so that it finally got a repeat that in verification triggered an out-of-bound access bad enough to cause a segfault.

Anyway, thanks everyone for the advice.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Volatile bug

Post by Evert »

hgm wrote:I have seen Heisenbugs before. But that was when modifying the code with printfs to debug the issue. Icould understand that, because printf uses the stack, leaving altered values there, which could actas starting value of uninitialized variables. But with gdb? Perhaps my understanding of how gdb works is wrong. I was under the impression that it absolutely recreated the conditions of the program under test, by running it as a separate process, and somehow peeking in the memory map of that process.
At the very least it must catch and handle signals that are raised but not caught by the program (like SEGV). Doesn't it also need to do some modifications when you add breakpoints (insert int3)? I can imagine that it maps memory differently as well.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Volatile bug

Post by hgm »

Normally the OS catches those signals (and the handler then terminates the process). I figured there would be a way to request the system to relay the an un-caught signal in a child process to yourself (just as normally it relays your own signals back to you when you catch them). But perhaps gdb works through threads rather than processes. Breakpoints are usually implemented by poking an invalid instruction in the code at that point. So definitely gdb must also have write access to the memory.