Trying to catch a difficult bug

Kempelen · Post by **Kempelen** » Thu Nov 10, 2016 10:53 pm

Ras wrote:I guess the compiler optimisation can be an issue here. In C, the compiler is free to re-order non-volatile memory access as long as the result is equivalent - under the assumption that no undefined behaviour happens.

But crashing is pretty much the most undefined behaviour possible.

So it may happen that the compiler is moving the crashing instruction before the printf although in the source text, it is after the printf.

On the CPU level with out-of-order execution of modern x86 CPUs, there may be some additional fun.

Eureka! I compiled my code with -O in place of -O3 and the program work flowlessly. I can now use the engine in my test environment for hours without problems.

Anyway, and continuing this hot thread and the topic, I will describe another bug I have that is only seen when playing FICS though winboard. After a few moves in the first game, the program exits in a assert sentence. Repeating the test again exits in another assert (and so on if repeated, always in the same two or three assesrt when domove() after a long search). This time, surely, it is a bug in my code and not the compiler. The problem I have is the difficult to track the bug, any test is a test in FICS that ends in a disconnection, and it costs me a lot to divise where the problem is. The are a lot of timing conditions, commands and so on that could be the problem.

My question is, how in this case, would be a good strategy to catch this bug which is not reproducible in console and debugger?

thanks in advance...

AlvaroBegue · Post by **AlvaroBegue** » Thu Nov 10, 2016 11:05 pm

Kempelen wrote:Eureka! I compiled my code with -O in place of -O3 and the program work flowlessly. I can now use the engine in my test environment for hours without problems.

You shouldn't be so happy. If you haven't identified the cause of the bug in the first place, you are likely still invoking undefined behavior. It may work for hours now, and next week it may pop its ugly head up again, even with -O instead of -O3. You just don't know.

Ras · Post by **Ras** » Fri Nov 11, 2016 8:08 pm

Kempelen wrote:Eureka! I compiled my code with -O in place of -O3 and the program work flowlessly.

That is really bad news because the bug is still in there. It isn't the broken compiler optimisation, it is that the optimisations break code that is buggy to begin with.

Another thing is, don't go for -O3 - it is usually slower than -O2 if used for the full program. Later on in the project, you can profile the code, locate hotspots and use #pragma in the sources for locally switching to -O3. But premature optimisation only yields trouble.

Did you actually try the sanitiser with the tons of options I gave?

This time, surely, it is a bug in my code and not the compiler.

Both issues are surely code issues, not compiler ones. I've been developing in C for more than two decades now, and while compiler bugs do exist, I havn't actually encountered any.

My question is, how in this case, would be a good strategy to catch this bug which is not reproducible in console and debugger?

Sounds like you got some kind of protocol engine going on there. Usually, that is implemented as final statemachine. So you could add some fprintf at any point where you transition from one state to another. The idea is to log everything into a text file, line per line.

Printing would include timestamp, source state, destination state, possible additional parameter data.

Now if the protocol weren't implemented as a clean statemachine, then my first approach would be to change that.

Kempelen · Post by **Kempelen** » Fri Nov 11, 2016 8:37 pm

You are right, the problem is still there, and I am not so happy

surely both problems could be the same bug.

I think the problem is not the protocol, in fact I today I have been able to reproduce at console. The program assert at at point long deep in the search, and it asserts becuase a data structure has changed its real data. So I now where is the problem, but not where the data has changed elsewhere..... and this is realy a real problem I dont now exactly how to track, above all because is a node-depend structure.

the sanitizer option does not work for me on mingw, i get error, i will post next

Ras · Post by **Ras** » Fri Nov 11, 2016 9:07 pm

Kempelen wrote:it asserts becuase a data structure has changed its real data.

So that could be a buffer overrun (or underrun), or a pointer gone wild. As I wrote with my story above, it could even be a pointer gone wild due to a buffer overrun elsewhere.

Or maybe something related to pointer aliasing, GCC has become a bit picky about that one. Do you use data format conversion by pointer casting anywhere?

..... and this is realy a real problem I dont now exactly how to track, above all because is a node-depend structure.

Ouch, so it changes all the time, but now it changes when it shouldn't.

the sanitizer option does not work for me on mingw

Correct, it doesn't even work on Cygwin. It works only with GCC under GNU/Linux, unfortunately.

But you could install Linux Mint on a USB stick, boot from that and then have GCC available. I did that with my program with the full sanitiser options even when I didn't have a problem, just for being more sure.

Kempelen · Post by **Kempelen** » Fri Nov 11, 2016 9:19 pm

Ras wrote:
Use the following options to compile the program:

-Wall -Wmaybe-uninitialized -Wstrict-aliasing -Wlogical-op -Wcast-align -g -fsanitize=address -fsanitize=bounds -fsanitize=object-size -fsanitize=alignment -fsanitize=null -fsanitize=undefined -fsanitize=shift -fsanitize=signed-integer-overflow -fsanitize=integer-divide-by-zero

I can't compile the whole sentence, I get an error:

Code: Select all

D:\Rodin\cfg.c: In function 'CfgLeerOpcion':
D:\Rodin\cfg.c:84:5: internal compiler error: in pp_format, at pretty-print.c:611
 int CfgLeerOpcion(const char *seccion, const char *opcion, const char tipo,
     ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <http://tdm-gcc.tdragon.net/bugs> for instructions.
Process terminated with status 1 (0 minute(s), 0 second(s))

If I remove args and get only one, then I get a compilation problem, that even if I compile with -lasan I can't continue:

Code: Select all

D:\Rodin\bench.c|223|undefined reference to `__asan_report_load4'|

Ras · Post by **Ras** » Fri Nov 11, 2016 9:51 pm

Kempelen wrote:I can't compile the whole sentence, I get an error:

Of course. "D:\Rodin" is surely under Windows, and as I wrote, the sanitiser feature of GCC works under GNU/Linux only. Not under Windows/GCC/MingW, not under Windows/GCC/Cygwin.

I'm running Win7, so I had the same kind of problem, that's why I got a USB stick with Linux Mint going, only for checking with the GCC sanitiser feature.

But compared to the work you're in for with such a difficult bug, firing up GNU/Linux from a USB stick seems relatively little work.

Kempelen · Post by **Kempelen** » Sat Nov 12, 2016 3:08 pm

Finaly I catched the bug after a long three intensive days of hunting. The problem was in undoing a move in ponder mode if the expected move was not played that set a bad board state. That was the reason that only was reproduced in FICS as is the only place where my engine plays with ponder.

For the -fsanitize options I was trying to compile in Linux but get some problems because my engine has windows-dependent functions I need to port (reading files, ...). I have in my planning to test this as looks a very promising way to clean the code.

Thanks for the advices....

cdani · Post by **cdani** » Sat Nov 12, 2016 8:53 pm

Kempelen wrote: For the -fsanitize options I was trying to compile in Linux but get some problems because my engine has windows-dependent functions I need to port (reading files, ...). I have in my planning to test this as looks a very promising way to clean the code.

Yes! Also helped to find little bugs in Andscacs when I had done the linux version.

Ras · Post by **Ras** » Sun Nov 13, 2016 8:18 pm

Kempelen wrote:Finaly I catched the bug after a long three intensive days of hunting.

Congrats - I bet it's a good feeling!

The problem was in undoing a move in ponder mode if the expected move was not played that set a bad board state.

Uh, that's indeed a difficult one.

For the -fsanitize options I was trying to compile in Linux but get some problems because my engine has windows-dependent functions I need to port (reading files, ...).

Files are the easy part if you use the C standard library, but the networking API is somewhat different, not to mention signals. Maybe encapsulating that in some kind of OS-dependent wrapper layer would be good.

I'm using something similar for developing under Cygwin/Windows, occasionally sanitising under GNU/Linux, and the target firmware runs on a Cortex-M microcontroller without any OS.

I have in my planning to test this as looks a very promising way to clean the code.

Definitely, yes.

Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug

Re: Trying to catch a difficult bug