Efficiently Generated Legal moves question

spirch · Post by **spirch** » Mon Jul 12, 2021 1:46 am

screenshot is showing sampling profiling, redo with tracing, untick enable inlining and tick high accuracy
check the number of call, not the time

try to remove some

Code: Select all

 [MethodImpl(MethodImplOptions.AggressiveInlining)]

it can do some bad thing in some scenario, this will be try & see

converting to byte is kind of a code smell to me, why? accessing array with ulong is also weird but i'm not yet using bitboard, maybe it's necessary

jagged array is better than multi dimensional array but what is even better is one array with offset (please profile)

in your isattacked, i assume this might be to see if the king is in check, try to change the order to queen, rook, bishop, knight, pawn, king

i assume all these are inlining? again try to see the impact by removing it

do you build in 32 bit or 64 bit?

lithander · Post by **lithander** » Mon Jul 12, 2021 2:01 am

pedrojdm2021 wrote: ↑Mon Jul 12, 2021 1:32 am I have 2 "projects"

look at this as a fronted-backed like structure of a website:

fronted: unity implementation

backend: C# Chess Engine .dll built in .net framework 4.8 (the highest version of .net that unity currently supports)

in the C# chess engine i have for sure a console application to profile, debug, and test the engine before i import it into unity as a .dll plugin

In unity things can get more slower due to unity's implementation of the C# layer, it is weird but the C# in unity always runs slower, no matter if you allocate GC or not

While I would completely agree that C# in Unity runs always slower it still makes a difference whether you want to optimize for the CLR or Unity.

For example in my C# engine I create a new board class instance whenever I make a move. It's basically a copy constructor with the move as the 2nd parameter. There's no move undo, you just let the instance go out of scope for the garbage collector to clean it up. So with 1M nodes per second you'd expect the garbage collector to be very busy, right? But a smart managed runtime detects the recursive pattern of the search and if I search to depth 10 then it only needs 10 instances at the same time. The heap size stays perfectly flat and small. Good runtime!

Not so in Unity. There you'll get 1M real allocations per second, the used heap memory would grow with a steep gradient until the garbage collector gets triggered, halts the main thread and scans for and removes all the stale copies of the Board class from the heap. That takes a few milliseconds in which you are not searching chess positions. Eventually you get a vertical drop in used heap size. Rinse, repeat. That is not a healthy pattern. It means that the millions of allocations have some actual cost and should be avoided. My code is fine (not the fastest but fine) in the CLR but it would be plain wasteful in Unity.

However, constantly worrying about avoiding runtime allocations is not how C# is meant to be used. In C# it should be fine to create a new List instance instead of clearing the old one to reuse it. It can be actually faster because "clearing" is not just setting the length to zero, it's also setting all the occupied slots to null or default. If you let it go extinct and allocate a new one you don't pay that price.

So that's something I really don't like about Unity: it punishes you for writing idiomatic C# code.

pedrojdm2021 · Post by **pedrojdm2021** » Mon Jul 12, 2021 2:20 am

lithander wrote: ↑Mon Jul 12, 2021 2:01 am
pedrojdm2021 wrote: ↑Mon Jul 12, 2021 1:32 am I have 2 "projects"

look at this as a fronted-backed like structure of a website:

fronted: unity implementation

backend: C# Chess Engine .dll built in .net framework 4.8 (the highest version of .net that unity currently supports)

in the C# chess engine i have for sure a console application to profile, debug, and test the engine before i import it into unity as a .dll plugin

In unity things can get more slower due to unity's implementation of the C# layer, it is weird but the C# in unity always runs slower, no matter if you allocate GC or not
While I would completely agree that C# in Unity runs always slower it still makes a difference whether you want to optimize for the CLR or Unity.

For example in my C# engine I create a new board class instance whenever I make a move. It's basically a copy constructor with the move as the 2nd parameter. There's no move undo, you just let the instance go out of scope for the garbage collector to clean it up. So with 1M nodes per second you'd expect the garbage collector to be very busy, right? But a smart managed runtime detects the recursive pattern of the search and if I search to depth 10 then it only needs 10 instances at the same time. The heap size stays perfectly flat and small. Good runtime!

Not so in Unity. There you'll get 1M real allocations per second, the used heap memory would grow with a steep gradient until the garbage collector gets triggered, halts the main thread and scans for and removes all the stale copies of the Board class from the heap. That takes a few milliseconds in which you are not searching chess positions. Eventually you get a vertical drop in used heap size. Rinse, repeat. That is not a healthy pattern. It means that the millions of allocations have some actual cost and should be avoided. My code is fine (not the fastest but fine) in the CLR but it would be plain wasteful in Unity.

However, constantly worrying about avoiding runtime allocations is not how C# is meant to be used. In C# it should be fine to create a new List instance instead of clearing the old one to reuse it. It can be actually faster because "clearing" is not just setting the length to zero, it's also setting all the occupied slots to null or default. If you let it go extinct and allocate a new one you don't pay that price.

So that's something I really don't like about Unity: it punishes you for writing idiomatic C# code.

Yeah, it was really bad my code when i first started on this, before i knew about bitboards and all that stuff i was doing it on my "way" and it was a disaster lol

after bitboards implementation i had it like yours: a new List<int> each GenerateMoves() call, and it for sure was generating big issues in the unity side, now everything is just an int[] array with a index accesor and now it feels so much smoother in the unity side

so in my loop i just do something like:

Code: Select all

sbyte targetIndex = board.GenerateMoves(movelist[ply]);

for (i = 0; i <= targetIndex; i++)
{
      currentMove = movelist[ply][i];
      .......
}

However, constantly worrying about avoiding runtime allocations is not how C# is meant to be used.

yeah i know, but when one wants some speed it is required at least based on my tests with this, in my case evern extracting a function that mixes two bitboards from the bitboards array, it generates issues with performance

Edsel Apostol · Post by **Edsel Apostol** » Mon Jul 12, 2021 3:28 pm

pedrojdm2021 wrote: ↑Mon Jul 12, 2021 1:15 am
spirch wrote: ↑Mon Jul 12, 2021 12:50 am
emadsen wrote: ↑Mon Jul 12, 2021 12:05 am
pedrojdm2021 wrote: ↑Sun Jul 11, 2021 10:16 pm Yeah but you use .net 5 that's the problem, any fast C# engine that i've seen is written in .net 5 in .net 4 we dont have access to the new "native" stuff in .net:

Just to name a few, in .net 4 we don't have:
- Span<T>
- System.Runtime.Intrinsics.X86;
OK Pedro, you know best. Yet you seem to have missed the entire point of my post. Span<T> and System.Runtime.Intrinsics are not magic bullets that get you a 10x speed boost. I assume you're new to chess programming. You've been asking questions on this forum for what, two weeks? Now you're tossing away- as too slow- a programming language and runtime (or at least version 4 of the .NET runtime) that's supported by 20+ years of engineering effort from a tech pioneering company. Doesn't that strike you as a bit rash and out of proportion?

You're relying too much on profiler CPU metrics. Time In Method percentages will not indicate fundamental flaws in data structures, memory allocation, or software algorithms. TIM metrics are much more useful when you have recorded a baseline performance profile, you refactor or enhance code, then you observe a performance regression (slowdown). You can compare the TIM metrics from the previous version to the new version to determine if the performance of specific methods has degraded. If so, those methods are culprits. Code changes within them (or the methods underneath them on the call stack) may have wrecked performance.

You can't just look at TIM percentages and declare a method with a high CPU% indicates a flaw in the underlying programming language or runtime. For example, if you profile a strong chess engine, you're likely to find high CPU% for the quiescence search and evaluation methods. Should we eliminate QSearch & Eval to speed up the engine? Of course not. So the high CPU% metric for those methods does not indicate a performance problem.

The absolute value of program run time (or time-to-depth in a chess engine search) is the metric that matters. Program run time is overwhelmingly influenced by proper choice of data structures, by light memory management (avoiding churn via excessive allocation and GC), by efficient algorithms, and by bug elimination (a never-ending quest). Even the run time and time-to-depth metrics are fraught with subtlety and require context to understand. You can make a chess engine search very fast but very stupidly by careless reducing and pruning too many moves.

It is entirely possible that a code refactor that better aligns data structures to the data access patterns of a chess engine, or reduces memory allocations, or improves search efficiency (spending more time on critical positions and less time on unprofitable continuations) actually increases one of the method CPU percentages you're fixated on, yet significantly reduces the absolute value of program run time (or time-to-depth), thereby increasing the playing strength of the engine.

I'm repeating myself, but for emphasis: High CPU percentages are relevant when the same method's CPU% was low in a previous baseline performance profile and increased significantly in a new version. The metric provides a lead to follow when investigating performance degradations introduced by newly commited code. The metric rarely provides much value when interpreted a) in isolation or b) when compared to the same metric from a vastly different codebase.
i totally agree with this, look at the implementation, see if you can optimize it, refactor, refactor, refactor, ... ok i hope you understand

also, in one other thread i suggested that you try the trial version of dottrace instead of the build-in visual studio one, if you didnt, try. i personally don't like the built-in one, trial is free, go play with it. try tracing profiling, not sampling. this could help you figure out if some method are called too often, untick allow inlining.

i'm going to link this, post by me a few months ago http://talkchess.com/forum3/viewtopic.p ... 10#p886029
focus on the screenshot and feel free to look at the whole thread

in there i'm saying that I went from 3.5m to 21m in my perft because of an error in my code

today the same perft is now doing between 35m and 36m, what did I do? some pretty minor refactoring

(i should stop spending time on x88 and start the damn bitboard haha)
yes i have dowloaded the dotTrace and everything seems to be pointing more or less to the same code...

It points to the same IsSquareAttacked that seems heavy, inside that it points to:

5% on this GetBitboard, i thought exacting methods as this, when inlining should not cause any overload on the cpu but it seems that it does, or the inlining is not being applied at all?
Code: Select all
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public ulong GetBitboard(byte _player , byte _pieceType)
        {
            return bitboards[_player] & bitboards[_pieceType];
        }
Another significant part that it points is to this one on GetQueenMoves:
Code: Select all
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetQueenMoves(byte _square , ulong _occupancy)
        {
            return GetBishopMoves(_square , _occupancy) | GetRookMoves(_square , _occupancy);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetBishopMoves(byte _square , ulong _occupancy)
        {
            _occupancy &= BISHOP_MASKS[_square];
            _occupancy *= Magics.BISHOP_MAGICS[_square];
            _occupancy >>= (64 - BISHOP_MOVE_COUNT[_square]);
            // magic bitboard lookup (square)(magic index)
            return BISHOP_MOVES[_square][_occupancy];
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetRookMoves(byte _square, ulong _occupancy)
        {
            _occupancy &= ROOK_MASKS[_square];
            ........
you can see is only a combination of the 2 magic bitboards of rook and bishops, and that, is heavy as profiler tells...
so all the profiling work that i've put here is just pointing to the same, but i don't see anything wrong with it

i am using bitboards, i am only returning a value from some arrays and i apply some bitwise operations to get the attacks, i am using inlining so it shlould not cause any issue combining the value of bishop and rook to get queens

Your IsSquareAttacked is inefficient. You can save the call to the QueenMoves and speed it up by 5.6%.

Code: Select all

if (GetBishopAttacks & (GetBitboard(Bishop) | GetBitboard(Queen))) return true;

Do the same for the rook attacks.

You can optimize further by having GetBishopSliders like this:

Code: Select all

 public ulong GetBishopSliders (byte _player)
        {
            return bitboards[_player] & (bitboards[Bishop] | bitboards[Queen]);
        }

        if (GetBishopAttacks() & GetBishopSliders(_player)) return true;

pedrojdm2021 · Post by **pedrojdm2021** » Mon Jul 12, 2021 7:13 pm

Edsel Apostol wrote: ↑Mon Jul 12, 2021 3:28 pm
pedrojdm2021 wrote: ↑Mon Jul 12, 2021 1:15 am
spirch wrote: ↑Mon Jul 12, 2021 12:50 am
emadsen wrote: ↑Mon Jul 12, 2021 12:05 am
pedrojdm2021 wrote: ↑Sun Jul 11, 2021 10:16 pm Yeah but you use .net 5 that's the problem, any fast C# engine that i've seen is written in .net 5 in .net 4 we dont have access to the new "native" stuff in .net:

Just to name a few, in .net 4 we don't have:
- Span<T>
- System.Runtime.Intrinsics.X86;
OK Pedro, you know best. Yet you seem to have missed the entire point of my post. Span<T> and System.Runtime.Intrinsics are not magic bullets that get you a 10x speed boost. I assume you're new to chess programming. You've been asking questions on this forum for what, two weeks? Now you're tossing away- as too slow- a programming language and runtime (or at least version 4 of the .NET runtime) that's supported by 20+ years of engineering effort from a tech pioneering company. Doesn't that strike you as a bit rash and out of proportion?

You're relying too much on profiler CPU metrics. Time In Method percentages will not indicate fundamental flaws in data structures, memory allocation, or software algorithms. TIM metrics are much more useful when you have recorded a baseline performance profile, you refactor or enhance code, then you observe a performance regression (slowdown). You can compare the TIM metrics from the previous version to the new version to determine if the performance of specific methods has degraded. If so, those methods are culprits. Code changes within them (or the methods underneath them on the call stack) may have wrecked performance.

You can't just look at TIM percentages and declare a method with a high CPU% indicates a flaw in the underlying programming language or runtime. For example, if you profile a strong chess engine, you're likely to find high CPU% for the quiescence search and evaluation methods. Should we eliminate QSearch & Eval to speed up the engine? Of course not. So the high CPU% metric for those methods does not indicate a performance problem.

The absolute value of program run time (or time-to-depth in a chess engine search) is the metric that matters. Program run time is overwhelmingly influenced by proper choice of data structures, by light memory management (avoiding churn via excessive allocation and GC), by efficient algorithms, and by bug elimination (a never-ending quest). Even the run time and time-to-depth metrics are fraught with subtlety and require context to understand. You can make a chess engine search very fast but very stupidly by careless reducing and pruning too many moves.

It is entirely possible that a code refactor that better aligns data structures to the data access patterns of a chess engine, or reduces memory allocations, or improves search efficiency (spending more time on critical positions and less time on unprofitable continuations) actually increases one of the method CPU percentages you're fixated on, yet significantly reduces the absolute value of program run time (or time-to-depth), thereby increasing the playing strength of the engine.

I'm repeating myself, but for emphasis: High CPU percentages are relevant when the same method's CPU% was low in a previous baseline performance profile and increased significantly in a new version. The metric provides a lead to follow when investigating performance degradations introduced by newly commited code. The metric rarely provides much value when interpreted a) in isolation or b) when compared to the same metric from a vastly different codebase.
i totally agree with this, look at the implementation, see if you can optimize it, refactor, refactor, refactor, ... ok i hope you understand

also, in one other thread i suggested that you try the trial version of dottrace instead of the build-in visual studio one, if you didnt, try. i personally don't like the built-in one, trial is free, go play with it. try tracing profiling, not sampling. this could help you figure out if some method are called too often, untick allow inlining.

i'm going to link this, post by me a few months ago http://talkchess.com/forum3/viewtopic.p ... 10#p886029
focus on the screenshot and feel free to look at the whole thread

in there i'm saying that I went from 3.5m to 21m in my perft because of an error in my code

today the same perft is now doing between 35m and 36m, what did I do? some pretty minor refactoring

(i should stop spending time on x88 and start the damn bitboard haha)
yes i have dowloaded the dotTrace and everything seems to be pointing more or less to the same code...

It points to the same IsSquareAttacked that seems heavy, inside that it points to:

5% on this GetBitboard, i thought exacting methods as this, when inlining should not cause any overload on the cpu but it seems that it does, or the inlining is not being applied at all?
Code: Select all
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public ulong GetBitboard(byte _player , byte _pieceType)
        {
            return bitboards[_player] & bitboards[_pieceType];
        }
Another significant part that it points is to this one on GetQueenMoves:
Code: Select all
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetQueenMoves(byte _square , ulong _occupancy)
        {
            return GetBishopMoves(_square , _occupancy) | GetRookMoves(_square , _occupancy);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetBishopMoves(byte _square , ulong _occupancy)
        {
            _occupancy &= BISHOP_MASKS[_square];
            _occupancy *= Magics.BISHOP_MAGICS[_square];
            _occupancy >>= (64 - BISHOP_MOVE_COUNT[_square]);
            // magic bitboard lookup (square)(magic index)
            return BISHOP_MOVES[_square][_occupancy];
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetRookMoves(byte _square, ulong _occupancy)
        {
            _occupancy &= ROOK_MASKS[_square];
            ........
you can see is only a combination of the 2 magic bitboards of rook and bishops, and that, is heavy as profiler tells...
so all the profiling work that i've put here is just pointing to the same, but i don't see anything wrong with it

i am using bitboards, i am only returning a value from some arrays and i apply some bitwise operations to get the attacks, i am using inlining so it shlould not cause any issue combining the value of bishop and rook to get queens
Your IsSquareAttacked is inefficient. You can save the call to the QueenMoves and speed it up by 5.6%.
Code: Select all
if (GetBishopAttacks & (GetBitboard(Bishop) | GetBitboard(Queen))) return true;
Do the same for the rook attacks.

You can optimize further by having GetBishopSliders like this:
Code: Select all
 public ulong GetBishopSliders (byte _player)
        {
            return bitboards[_player] & (bitboards[Bishop] | bitboards[Queen]);
        }

        if (GetBishopAttacks() & GetBishopSliders(_player)) return true;

hey! thank you!

i did your suggestion and i gained ~50ms in perft 4 for kiwipete position

before: ~550ms
now: ~505ms

is not very much but at least is something

pedrojdm2021 · Post by **pedrojdm2021** » Mon Jul 12, 2021 7:42 pm

I don't know if these numbers are correect but for AI search on kiwipete position on depth 9

[d]r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -

It executes Evaluate() 2,386,401 times:

i looked up to my code, and i was counting the nodes bad in Quiescense Search

i had:

Code: Select all

  private int Quiescense(int _alpha, int _beta)
        {
            if (AbortSearch) return 0;
            
            int eval = Evaluate();
          

            // fail-hard beta cut-off
            if (eval >= _beta) 
                return _beta;
            
            if (eval > _alpha)
                _alpha = eval;
                
             // increment node count
               searched_nodes++;

while it should be:

Code: Select all

 private int Quiescense(int _alpha, int _beta)
        {
            if (AbortSearch) return 0;
            
            int eval = Evaluate();
            
            // increment node count
            searched_nodes++;

            // fail-hard beta cut-off
            if (eval >= _beta) 
                return _beta;
            
            if (eval > _alpha)
                _alpha = eval;

And now my search prints:

Code: Select all

bestmove e2a6 score 132 seached nodes 2648244
search done in 3380ms
Illegal nodes: 350426

So the reality was not the ~800.000 positions, the reality WAS nearly 3 million of positions

now i know that something is really broken with the pruning, it will find you dammit bug!

Efficiently Generated Legal moves question

Re: Efficiently Generated Legal moves question

Re: Efficiently Generated Legal moves question

Re: Efficiently Generated Legal moves question

Re: Efficiently Generated Legal moves question

Re: Efficiently Generated Legal moves question

Re: Efficiently Generated Legal moves question