Engine Crash Detective Story

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
emadsen
Posts: 434
Joined: Thu Apr 26, 2012 1:51 am
Location: Oak Park, IL, USA
Full name: Erik Madsen

Engine Crash Detective Story

Post by emadsen »

Hi all. Recently I investigated a bug that caused my engine to crash in about 10 games of a 4,000 game tournament. One of those frustrating bugs that's not reproducible by a specific position or sequence of UCI commands. Rare enough that I kept putting off investigating it in favor of working on code to improve engine strength.

Last weekend I decided I'd lived with this bug long enough and it was time to exterminate it. I figured out the root cause and fixed the issue. Today I spent time writing a blog post that explains my debugging process: how I shed light on the problem, what I observed, a series of hypothesis I formulated and tested, and finally, the solution.

MadChess 3.0 Beta 2d855ec (Crash Bug)

I'm sharing it here in hope of helping novice chess engine programmers who may benefit from reading about the rigorous process required to locate and squash bugs in the complex codebase of a chess engine. Experienced programmers may wish to read the post to determine if they can find the "killer" quicker than I... or just to nod their head in acknowledgment, "Yes, I've experienced the very same issue."

One subtle aspect of this bug that I don't discuss in my blog post is the following: It's a vivid example of the kind of bug that is unlikely to be caught by defensive programming techniques like Debug.Assert(). I'm a fan of such techniques, and have included numerous assert statements in the MadChess 3.0 Beta codebase. However, playing a tournament with a debug version of the engine will not trigger this bug. Why? Because the Debug.Assert code recalculates so many values from scratch- inside the main search loop- that are calculated incrementally by the Release code, that it slows down the engine immensely. Encumbered by all that defensive code, the engine simply cannot search moves fast enough to accumulate a large enough value to trigger the bug described in my blog post. Truly a Heisenbug.

If you're so inclined, read my blog post and let me know if it's enlightening- or at least entertaining.

Cheers,
Erik
My C# chess engine: https://www.madchess.net
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Engine Crash Detective Story

Post by jdart »

I once had a programming job working on a compiler. There was a bug in the code generator, which would cause the generated program to crash. But the thing is, it didn't just generate an illegal instruction or something and the program crashed right at that point. No, it generated legal code that was an incorrect translation of the source, and much, much past that point, the execution of the program would fail due to a SIGSEGV or something. So I had to work backwards, with an assembly-language debugger, to the point where the bad code was generated. That's one of the hardest debugging tasks I have had. I can't think of a problem in my chess engine quite that hard to pin down, but there too there have been a lot of issues that required some sleuthing.

--Jon
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Engine Crash Detective Story

Post by lucasart »

emadsen wrote: Sun Aug 30, 2020 1:57 am Hi all. Recently I investigated a bug that caused my engine to crash in about 10 games of a 4,000 game tournament. One of those frustrating bugs that's not reproducible by a specific position or sequence of UCI commands. Rare enough that I kept putting off investigating it in favor of working on code to improve engine strength.

Last weekend I decided I'd lived with this bug long enough and it was time to exterminate it. I figured out the root cause and fixed the issue. Today I spent time writing a blog post that explains my debugging process: how I shed light on the problem, what I observed, a series of hypothesis I formulated and tested, and finally, the solution.

MadChess 3.0 Beta 2d855ec (Crash Bug)

I'm sharing it here in hope of helping novice chess engine programmers who may benefit from reading about the rigorous process required to locate and squash bugs in the complex codebase of a chess engine. Experienced programmers may wish to read the post to determine if they can find the "killer" quicker than I... or just to nod their head in acknowledgment, "Yes, I've experienced the very same issue."

One subtle aspect of this bug that I don't discuss in my blog post is the following: It's a vivid example of the kind of bug that is unlikely to be caught by defensive programming techniques like Debug.Assert(). I'm a fan of such techniques, and have included numerous assert statements in the MadChess 3.0 Beta codebase. However, playing a tournament with a debug version of the engine will not trigger this bug. Why? Because the Debug.Assert code recalculates so many values from scratch- inside the main search loop- that are calculated incrementally by the Release code, that it slows down the engine immensely. Encumbered by all that defensive code, the engine simply cannot search moves fast enough to accumulate a large enough value to trigger the bug described in my blog post. Truly a Heisenbug.

If you're so inclined, read my blog post and let me know if it's enlightening- or at least entertaining.

Cheers,
Erik
Well done. Debugging is an under-rated art. It is painful and unrewarding, but true programming skill is learnt debugging, much more so than writing code.

Starting from the original code, here's how I would have approached this problem (i'm not familiar with C#, so take my code as pseudo-code):

Code: Select all

PlayMove(Move);
int kingSquare = CurrentPosition.WhiteMove
    ? Bitwise.FindFirstSetBit(CurrentPosition.BlackKing)
    : Bitwise.FindFirstSetBit(CurrentPosition.WhiteKing);
if (IsSquareAttacked(kingSquare))
{
    UndoMove();
    return false;
}
The first hypothesis you make is interesting. But I think you are missing the potential for generalisation. Why write code like this ?

Code: Select all

int kingSquare = CurrentPosition.WhiteMove
    ? Bitwise.FindFirstSetBit(CurrentPosition.BlackKing)
    : Bitwise.FindFirstSetBit(CurrentPosition.WhiteKing);
When you could instead write all your code like this

Code: Select all

int kingSquare = CurrentPosition.kingSquare();
Now you delegate the work a separate function pos.kingSquare(), who is responsible for checking its arguments. The kingSquare() function should have an assert(), and scream (in debug compile at least) when it's going to return an invalid square. In fact, the FindFirstSetBit() should already have this protection. If the bitboard is zero, the result is undefined, and an assert should fire.

So you wouldn't need to formulate an hypothesis, and write specific logging, you just have the answer screaming at you with an assert. Not only in this specific case, but in every other case in your code that uses kingSquare() or FindFirstSetBit().

After playing the Kf5e6 move which "transforms the king into a pawn", any subsequent call to pos.kingSquare(WHITE) will scream.

After fixing a bug that was painful to debug, it's important to go the extra mile, and not just fix the bug. You need to refactor the code in such a way that this kind of bug can't happen again (eg. DRY representation to prevent inconsistencies), or will be detected immediately (assert). For example, you could add an assert that pos.whiteKing and pos.blackKing have exactly one bit. And more generally a assert(pos.isValid()) which checks all invariants.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Engine Crash Detective Story

Post by Dann Corbit »

Worst bug I ever had was in someone else's five million line code base that I was assigned to maintain. That's right, one person. It was an SMP program (a telephone call tracking and logging program written in OS/2) and so the very rare bug made me suspect a problem with guarding access to global or static data. It turned out that there was this struct that was passed down, down, down, down way down deep. Way down at the bottom something modified the struct. It looked benign. A tiny little struct had an assignment like this:

foo.bar = value;

The evil thing that caused the crash that made it so hard to find was that the struct was passed by reference.
If it were:
foo->bar = value;
you might immediately expect it could cause trouble in the caller. I did not look at the signatures closely enough and each and every one was a pass by reference (even stranger was that only the very bottom level call modified it and in every case a pass by value would have been more appropriate.) Now, if it was a big structure, you would have the overhead of making a big copy and pushing all that data on the stack. But it was a tiny little struct. It took three days to track it down, meanwhile, new bugs had been coming in at one or two per day and I was getting pretty nervous about my assignment. Fortunately, they sold the maintenance contract to an outside firm.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Engine Crash Detective Story

Post by lucasart »

Dann Corbit wrote: Sun Aug 30, 2020 5:18 am The evil thing that caused the crash that made it so hard to find was that the struct was passed by reference.
If it were:
foo->bar = value;
you might immediately expect it could cause trouble in the caller. I did not look at the signatures closely enough and each and every one was a pass by reference
Another "good idea" from C++, which, after decades of experience, even C++ fans are starting to admit was a bad idea …
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Engine Crash Detective Story

Post by Dann Corbit »

lucasart wrote: Sun Aug 30, 2020 6:28 am
Dann Corbit wrote: Sun Aug 30, 2020 5:18 am The evil thing that caused the crash that made it so hard to find was that the struct was passed by reference.
If it were:
foo->bar = value;
you might immediately expect it could cause trouble in the caller. I did not look at the signatures closely enough and each and every one was a pass by reference
Another "good idea" from C++, which, after decades of experience, even C++ fans are starting to admit was a bad idea …
If I were on the C++ standards committee, I would push for pass by reference allowed in exactly two contexts:
1. const reference to const object
2. operator overloading
#2 is just for convenience in something like a number class where things like:
Number a=<monstrous string of digits>, b = <monstrous string of digits>, c;
c = a + b;
is so much easier to read than a bunch of function calls
But beware, overloading exponention does not work.
Both ** (like Fortran) and ^ as exponentiation operators have the wrong precedence.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: Engine Crash Detective Story

Post by MikeB »

Dann Corbit wrote: Sun Aug 30, 2020 5:18 am ... Fortunately, they sold the maintenance contract to an outside firm.
I can relate. I was not a coder by profession, but I often had assignments that could have benefitted from some coding. Before becoming a CPA, i was going to school at night and working as an entry level accountant during the day for a major utility, A major utility that paid dividends on a quarterly basis. At some point , it was decided that i should reconcile the "dividend" account. This was back in the70's just over 40 years ago - every quarter, they would issue 3 to 4 million checks . all checks issued on the same day ,every quarter with all the checks in some multiple of the per share dividend amount. You would have thousands of checks issued the same day for the same amount. Everything was done in "batches" then, one batch may have 1000 checks. So you were talking about going through thousands of batches to find maybe 20 or so errors. Tooday , this process is highly automated and accurate, but 50 years ago, Banks often keyed an incorrect amount for check that was cashed. It was literally looking for needle in the haystack to find out which checked was debited incorrectly to the checking account. But we did it - they would me give 3 days a month to work on it. Was not unhappy to see that assignment go to someone else. Debugging can be very similar ...
Image
Ras
Posts: 2487
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Engine Crash Detective Story

Post by Ras »

Well written and interesting story!

The bugfix however seems to have another issue because the used history value will be somewhat off now if the bitwise & prevents an overflow. The value that is actually used can be pretty small. It would seem more correct to check for the limit value and use the largest legal value if the limit is exceeded. That would be closest to the intended, but (in this case) not possible code behaviour.

Example with made up values: history has three bits. The current history value is 8, ie. binary value 1000. You mask with 7, and the used value now is 0. It would be closer to check 8 against the allowed limit of 7, and use 7 because that's the closest to the actual value of 8 that you can have.
Rasmus Althoff
https://www.ct800.net
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Engine Crash Detective Story

Post by Dann Corbit »

MikeB wrote: Sun Aug 30, 2020 8:20 am
Dann Corbit wrote: Sun Aug 30, 2020 5:18 am ... Fortunately, they sold the maintenance contract to an outside firm.
I can relate. I was not a coder by profession, but I often had assignments that could have benefitted from some coding. Before becoming a CPA, i was going to school at night and working as an entry level accountant during the day for a major utility, A major utility that paid dividends on a quarterly basis. At some point , it was decided that i should reconcile the "dividend" account. This was back in the70's just over 40 years ago - every quarter, they would issue 3 to 4 million checks . all checks issued on the same day ,every quarter with all the checks in some multiple of the per share dividend amount. You would have thousands of checks issued the same day for the same amount. Everything was done in "batches" then, one batch may have 1000 checks. So you were talking about going through thousands of batches to find maybe 20 or so errors. Tooday , this process is highly automated and accurate, but 50 years ago, Banks often keyed an incorrect amount for check that was cashed. It was literally looking for needle in the haystack to find out which checked was debited incorrectly to the checking account. But we did it - they would me give 3 days a month to work on it. Was not unhappy to see that assignment go to someone else. Debugging can be very similar ...
When I was living in Turkey, a friend of mine had one days receipts for Exxon of Florida accidentally deposited in his bank account. (Something over $300,000 but I forget exactly how much. His name was Paul Abeyta)

Being a good Samaritan, he decided to return it. Which lead to all sorts of horrific consequences (following the law of "no good deed shall go unpunished")

In the end, we found out his account number was the binary compliment of Exxon of Florida's and a programming error had caused the problem.
Now there's a bug for you.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Engine Crash Detective Story

Post by mar »

Dann Corbit wrote: Sun Aug 30, 2020 5:18 am Worst bug I ever had was in someone else's five million line code base that I was assigned to maintain. That's right, one person. It was an SMP program (a telephone call tracking and logging program written in OS/2)
I'm puzzled how a telephone call tracking and logging program can be 5 million lines of code? That's nearly half of the whole Linux and four times Unreal Engine 4!... I find it hard to believe.

These days, we have amazing tools like sanitizers (even though they only cover code paths taken at runtime)
Martin Sedlak