Efficiently Generated Legal moves question

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Pio
Posts: 334
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: Efficiently Generated Legal moves question

Post by Pio »

pedrojdm2021 wrote: Sun Jul 11, 2021 10:16 pm
emadsen wrote: Sun Jul 11, 2021 9:59 pm
pedrojdm2021 wrote: Sun Jul 11, 2021 8:43 pmMy conclusion on this is that the possibly main cause is the slowness of C# in . net framework 4.
You are jumping to conclusions way too quickly. C# and .NET is not to blame for your engine performing at 1/10th the speed it should. My engine is written in C# using magic bitboards. It searches 3.7 million NPS in the WAC test suite, counting nodes only in the Board.PlayMove method.

In the source code of my C# engine I define #if POPCOUNT to a) fallback to loops (via Value &= Value - 1ul) to count set bits and b) use a De Bruijn sequence to find the first set bit for older processors that lack the popcount instruction. I haven't done extensive testing. But in a few positions I examined, the popcount version was 1.25x faster than the non-popcount version. A 20% slowdown due to missing support for popcount cannot explain why your engine is an order of magnitude (1/10) slower than it could be. It's not due to techniques used to count and locate bits.

In addition, I know Graham Banks uses the non-popcount version of MadChess when running tournaments. It's unlikely all CCRL testers use the non-popcount version of my engine (each tester has their own hardware), however, CCRL has rated MadChess at 2636 ELO. This aligns with my own testing on my home, popcount-capble, PC (2638 ELO). My point being, yes MadChess runs slower on non-popcount hardware, but so do most other engines. I haven't measure any detrimental effect of non-popcount hardware on my engine's playing strength.
Yeah but you use .net 5 that's the problem, any fast C# engine that i've seen is written in .net 5 in .net 4 we dont have access to the new "native" stuff in .net:

Just to name a few, in .net 4 we don't have:
- Span<T>
- System.Runtime.Intrinsics.X86;

i have to stick with normal arrays and without any "fancy" .net 5 feature that microsoft would have added in that version

In .net 4 bitwise operations seems to be extremly expensive:

a simple least significant index from some bitwise operations represents a 6%?

Image

i have tried many things and everything from visual studio points to the same: the bitwise operations and the "Square attacked by" (to detect checks) that i have implemented it as it is documented in the chess programming wiki :?

Sadly i can't use .net 5 because the framework that i am using to build the game itself (unity engine) does not support .net 5
If you look at my last post, you have a solution that should speed up your isincheck function so it will take approximately 2/3 of the time I guess.
spirch
Posts: 95
Joined: Fri Nov 09, 2012 12:36 am

Re: Efficiently Generated Legal moves question

Post by spirch »

hgm wrote: Sun Jul 11, 2021 8:50 pm Or refrain from using bitboards...
that is why i went with x88 with mine a long long longtime ago, i saw that going bitboard was a bad idea back then

now my x88 engine is under .net core 5, now i wish i was using bitboard
pedrojdm2021
Posts: 157
Joined: Fri Apr 30, 2021 7:19 am
Full name: Pedro Duran

Re: Efficiently Generated Legal moves question

Post by pedrojdm2021 »

Pio wrote: Sun Jul 11, 2021 10:20 pm
pedrojdm2021 wrote: Sun Jul 11, 2021 10:16 pm
emadsen wrote: Sun Jul 11, 2021 9:59 pm
pedrojdm2021 wrote: Sun Jul 11, 2021 8:43 pmMy conclusion on this is that the possibly main cause is the slowness of C# in . net framework 4.
You are jumping to conclusions way too quickly. C# and .NET is not to blame for your engine performing at 1/10th the speed it should. My engine is written in C# using magic bitboards. It searches 3.7 million NPS in the WAC test suite, counting nodes only in the Board.PlayMove method.

In the source code of my C# engine I define #if POPCOUNT to a) fallback to loops (via Value &= Value - 1ul) to count set bits and b) use a De Bruijn sequence to find the first set bit for older processors that lack the popcount instruction. I haven't done extensive testing. But in a few positions I examined, the popcount version was 1.25x faster than the non-popcount version. A 20% slowdown due to missing support for popcount cannot explain why your engine is an order of magnitude (1/10) slower than it could be. It's not due to techniques used to count and locate bits.

In addition, I know Graham Banks uses the non-popcount version of MadChess when running tournaments. It's unlikely all CCRL testers use the non-popcount version of my engine (each tester has their own hardware), however, CCRL has rated MadChess at 2636 ELO. This aligns with my own testing on my home, popcount-capble, PC (2638 ELO). My point being, yes MadChess runs slower on non-popcount hardware, but so do most other engines. I haven't measure any detrimental effect of non-popcount hardware on my engine's playing strength.
Yeah but you use .net 5 that's the problem, any fast C# engine that i've seen is written in .net 5 in .net 4 we dont have access to the new "native" stuff in .net:

Just to name a few, in .net 4 we don't have:
- Span<T>
- System.Runtime.Intrinsics.X86;

i have to stick with normal arrays and without any "fancy" .net 5 feature that microsoft would have added in that version

In .net 4 bitwise operations seems to be extremly expensive:

a simple least significant index from some bitwise operations represents a 6%?

Image

i have tried many things and everything from visual studio points to the same: the bitwise operations and the "Square attacked by" (to detect checks) that i have implemented it as it is documented in the chess programming wiki :?

Sadly i can't use .net 5 because the framework that i am using to build the game itself (unity engine) does not support .net 5
If you look at my last post, you have a solution that should speed up your isincheck function so it will take approximately 2/3 of the time I guess.
yeah i am going to apply that, and only generate legal moves :D
pedrojdm2021
Posts: 157
Joined: Fri Apr 30, 2021 7:19 am
Full name: Pedro Duran

Re: Efficiently Generated Legal moves question

Post by pedrojdm2021 »

spirch wrote: Sun Jul 11, 2021 10:26 pm
hgm wrote: Sun Jul 11, 2021 8:50 pm Or refrain from using bitboards...
that is why i went with x88 with mine a long long longtime ago, i saw that going bitboard was a bad idea back then

now my x88 engine is under .net core 5, now i wish i was using bitboard
now i wish that i would have implemented the engine as an x88 engine :lol: but i wanted something fancy and now here i am
User avatar
emadsen
Posts: 434
Joined: Thu Apr 26, 2012 1:51 am
Location: Oak Park, IL, USA
Full name: Erik Madsen

Re: Efficiently Generated Legal moves question

Post by emadsen »

pedrojdm2021 wrote: Sun Jul 11, 2021 10:16 pm Yeah but you use .net 5 that's the problem, any fast C# engine that i've seen is written in .net 5 in .net 4 we dont have access to the new "native" stuff in .net:

Just to name a few, in .net 4 we don't have:
- Span<T>
- System.Runtime.Intrinsics.X86;
OK Pedro, you know best. Yet you seem to have missed the entire point of my post. Span<T> and System.Runtime.Intrinsics are not magic bullets that get you a 10x speed boost. I assume you're new to chess programming. You've been asking questions on this forum for what, two weeks? Now you're tossing away- as too slow- a programming language and runtime (or at least version 4 of the .NET runtime) that's supported by 20+ years of engineering effort from a tech pioneering company. Doesn't that strike you as a bit rash and out of proportion?

You're relying too much on profiler CPU metrics. Time In Method percentages will not indicate fundamental flaws in data structures, memory allocation, or software algorithms. TIM metrics are much more useful when you have recorded a baseline performance profile, you refactor or enhance code, then you observe a performance regression (slowdown). You can compare the TIM metrics from the previous version to the new version to determine if the performance of specific methods has degraded. If so, those methods are culprits. Code changes within them (or the methods underneath them on the call stack) may have wrecked performance.

You can't just look at TIM percentages and declare a method with a high CPU% indicates a flaw in the underlying programming language or runtime. For example, if you profile a strong chess engine, you're likely to find high CPU% for the quiescence search and evaluation methods. Should we eliminate QSearch & Eval to speed up the engine? Of course not. So the high CPU% metric for those methods does not indicate a performance problem.

The absolute value of program run time (or time-to-depth in a chess engine search) is the metric that matters. Program run time is overwhelmingly influenced by proper choice of data structures, by light memory management (avoiding churn via excessive allocation and GC), by efficient algorithms, and by bug elimination (a never-ending quest). Even the run time and time-to-depth metrics are fraught with subtlety and require context to understand. You can make a chess engine search very fast but very stupidly by careless reducing and pruning too many moves.

It is entirely possible that a code refactor that better aligns data structures to the data access patterns of a chess engine, or reduces memory allocations, or improves search efficiency (spending more time on critical positions and less time on unprofitable continuations) actually increases one of the method CPU percentages you're fixated on, yet significantly reduces the absolute value of program run time (or time-to-depth), thereby increasing the playing strength of the engine.

I'm repeating myself, but for emphasis: High CPU percentages are relevant when the same method's CPU% was low in a previous baseline performance profile and increased significantly in a new version. The metric provides a lead to follow when investigating performance degradations introduced by newly commited code. The metric rarely provides much value when interpreted a) in isolation or b) when compared to the same metric from a vastly different codebase.
My C# chess engine: https://www.madchess.net
spirch
Posts: 95
Joined: Fri Nov 09, 2012 12:36 am

Re: Efficiently Generated Legal moves question

Post by spirch »

emadsen wrote: Mon Jul 12, 2021 12:05 am
pedrojdm2021 wrote: Sun Jul 11, 2021 10:16 pm Yeah but you use .net 5 that's the problem, any fast C# engine that i've seen is written in .net 5 in .net 4 we dont have access to the new "native" stuff in .net:

Just to name a few, in .net 4 we don't have:
- Span<T>
- System.Runtime.Intrinsics.X86;
OK Pedro, you know best. Yet you seem to have missed the entire point of my post. Span<T> and System.Runtime.Intrinsics are not magic bullets that get you a 10x speed boost. I assume you're new to chess programming. You've been asking questions on this forum for what, two weeks? Now you're tossing away- as too slow- a programming language and runtime (or at least version 4 of the .NET runtime) that's supported by 20+ years of engineering effort from a tech pioneering company. Doesn't that strike you as a bit rash and out of proportion?

You're relying too much on profiler CPU metrics. Time In Method percentages will not indicate fundamental flaws in data structures, memory allocation, or software algorithms. TIM metrics are much more useful when you have recorded a baseline performance profile, you refactor or enhance code, then you observe a performance regression (slowdown). You can compare the TIM metrics from the previous version to the new version to determine if the performance of specific methods has degraded. If so, those methods are culprits. Code changes within them (or the methods underneath them on the call stack) may have wrecked performance.

You can't just look at TIM percentages and declare a method with a high CPU% indicates a flaw in the underlying programming language or runtime. For example, if you profile a strong chess engine, you're likely to find high CPU% for the quiescence search and evaluation methods. Should we eliminate QSearch & Eval to speed up the engine? Of course not. So the high CPU% metric for those methods does not indicate a performance problem.

The absolute value of program run time (or time-to-depth in a chess engine search) is the metric that matters. Program run time is overwhelmingly influenced by proper choice of data structures, by light memory management (avoiding churn via excessive allocation and GC), by efficient algorithms, and by bug elimination (a never-ending quest). Even the run time and time-to-depth metrics are fraught with subtlety and require context to understand. You can make a chess engine search very fast but very stupidly by careless reducing and pruning too many moves.

It is entirely possible that a code refactor that better aligns data structures to the data access patterns of a chess engine, or reduces memory allocations, or improves search efficiency (spending more time on critical positions and less time on unprofitable continuations) actually increases one of the method CPU percentages you're fixated on, yet significantly reduces the absolute value of program run time (or time-to-depth), thereby increasing the playing strength of the engine.

I'm repeating myself, but for emphasis: High CPU percentages are relevant when the same method's CPU% was low in a previous baseline performance profile and increased significantly in a new version. The metric provides a lead to follow when investigating performance degradations introduced by newly commited code. The metric rarely provides much value when interpreted a) in isolation or b) when compared to the same metric from a vastly different codebase.
i totally agree with this, look at the implementation, see if you can optimize it, refactor, refactor, refactor, ... ok i hope you understand 8-)

also, in one other thread i suggested that you try the trial version of dottrace instead of the build-in visual studio one, if you didnt, try. i personally don't like the built-in one, trial is free, go play with it. try tracing profiling, not sampling. this could help you figure out if some method are called too often, untick allow inlining.

i'm going to link this, post by me a few months ago http://talkchess.com/forum3/viewtopic.p ... 10#p886029
focus on the screenshot and feel free to look at the whole thread

in there i'm saying that I went from 3.5m to 21m in my perft because of an error in my code

today the same perft is now doing between 35m and 36m, what did I do? some pretty minor refactoring

(i should stop spending time on x88 and start the damn bitboard haha)
spirch
Posts: 95
Joined: Fri Nov 09, 2012 12:36 am

Re: Efficiently Generated Legal moves question

Post by spirch »

also, I hope you are running your actual test under release mode and not debug mode
pedrojdm2021
Posts: 157
Joined: Fri Apr 30, 2021 7:19 am
Full name: Pedro Duran

Re: Efficiently Generated Legal moves question

Post by pedrojdm2021 »

spirch wrote: Mon Jul 12, 2021 12:50 am
emadsen wrote: Mon Jul 12, 2021 12:05 am
pedrojdm2021 wrote: Sun Jul 11, 2021 10:16 pm Yeah but you use .net 5 that's the problem, any fast C# engine that i've seen is written in .net 5 in .net 4 we dont have access to the new "native" stuff in .net:

Just to name a few, in .net 4 we don't have:
- Span<T>
- System.Runtime.Intrinsics.X86;
OK Pedro, you know best. Yet you seem to have missed the entire point of my post. Span<T> and System.Runtime.Intrinsics are not magic bullets that get you a 10x speed boost. I assume you're new to chess programming. You've been asking questions on this forum for what, two weeks? Now you're tossing away- as too slow- a programming language and runtime (or at least version 4 of the .NET runtime) that's supported by 20+ years of engineering effort from a tech pioneering company. Doesn't that strike you as a bit rash and out of proportion?

You're relying too much on profiler CPU metrics. Time In Method percentages will not indicate fundamental flaws in data structures, memory allocation, or software algorithms. TIM metrics are much more useful when you have recorded a baseline performance profile, you refactor or enhance code, then you observe a performance regression (slowdown). You can compare the TIM metrics from the previous version to the new version to determine if the performance of specific methods has degraded. If so, those methods are culprits. Code changes within them (or the methods underneath them on the call stack) may have wrecked performance.

You can't just look at TIM percentages and declare a method with a high CPU% indicates a flaw in the underlying programming language or runtime. For example, if you profile a strong chess engine, you're likely to find high CPU% for the quiescence search and evaluation methods. Should we eliminate QSearch & Eval to speed up the engine? Of course not. So the high CPU% metric for those methods does not indicate a performance problem.

The absolute value of program run time (or time-to-depth in a chess engine search) is the metric that matters. Program run time is overwhelmingly influenced by proper choice of data structures, by light memory management (avoiding churn via excessive allocation and GC), by efficient algorithms, and by bug elimination (a never-ending quest). Even the run time and time-to-depth metrics are fraught with subtlety and require context to understand. You can make a chess engine search very fast but very stupidly by careless reducing and pruning too many moves.

It is entirely possible that a code refactor that better aligns data structures to the data access patterns of a chess engine, or reduces memory allocations, or improves search efficiency (spending more time on critical positions and less time on unprofitable continuations) actually increases one of the method CPU percentages you're fixated on, yet significantly reduces the absolute value of program run time (or time-to-depth), thereby increasing the playing strength of the engine.

I'm repeating myself, but for emphasis: High CPU percentages are relevant when the same method's CPU% was low in a previous baseline performance profile and increased significantly in a new version. The metric provides a lead to follow when investigating performance degradations introduced by newly commited code. The metric rarely provides much value when interpreted a) in isolation or b) when compared to the same metric from a vastly different codebase.
i totally agree with this, look at the implementation, see if you can optimize it, refactor, refactor, refactor, ... ok i hope you understand 8-)

also, in one other thread i suggested that you try the trial version of dottrace instead of the build-in visual studio one, if you didnt, try. i personally don't like the built-in one, trial is free, go play with it. try tracing profiling, not sampling. this could help you figure out if some method are called too often, untick allow inlining.

i'm going to link this, post by me a few months ago http://talkchess.com/forum3/viewtopic.p ... 10#p886029
focus on the screenshot and feel free to look at the whole thread

in there i'm saying that I went from 3.5m to 21m in my perft because of an error in my code

today the same perft is now doing between 35m and 36m, what did I do? some pretty minor refactoring

(i should stop spending time on x88 and start the damn bitboard haha)
yes i have dowloaded the dotTrace and everything seems to be pointing more or less to the same code...

Image

It points to the same IsSquareAttacked that seems heavy, inside that it points to:

5% on this GetBitboard, i thought exacting methods as this, when inlining should not cause any overload on the cpu but it seems that it does, or the inlining is not being applied at all?

Code: Select all

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public ulong GetBitboard(byte _player , byte _pieceType)
        {
            return bitboards[_player] & bitboards[_pieceType];
        }
Another significant part that it points is to this one on GetQueenMoves:

Code: Select all

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetQueenMoves(byte _square , ulong _occupancy)
        {
            return GetBishopMoves(_square , _occupancy) | GetRookMoves(_square , _occupancy);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetBishopMoves(byte _square , ulong _occupancy)
        {
            _occupancy &= BISHOP_MASKS[_square];
            _occupancy *= Magics.BISHOP_MAGICS[_square];
            _occupancy >>= (64 - BISHOP_MOVE_COUNT[_square]);
            // magic bitboard lookup (square)(magic index)
            return BISHOP_MOVES[_square][_occupancy];
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong GetRookMoves(byte _square, ulong _occupancy)
        {
            _occupancy &= ROOK_MASKS[_square];
            ........
you can see is only a combination of the 2 magic bitboards of rook and bishops, and that, is heavy as profiler tells...
so all the profiling work that i've put here is just pointing to the same, but i don't see anything wrong with it :?

i am using bitboards, i am only returning a value from some arrays and i apply some bitwise operations to get the attacks, i am using inlining so it shlould not cause any issue combining the value of bishop and rook to get queens
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Efficiently Generated Legal moves question

Post by lithander »

Keep in mind that Pedro is using Unity. Unity has C# scripting but that doesn't mean that the performance characteristics of code written for Unity is really comparable to the C# written for the CLR (the runtime everyone else uses). In Unity you have two choices: Either the C# code get's compiled to IL code and then the IL code get's converted to C++ and finally compiled to a native, platform specific binary. This is called IL2CPP and this is potentially fast but not exactly comparable to IL code being run and JIT-compiled in a high performant, managed runtime like today's CLR.

The alternative to using IL2CPP is to use Mono. They call their fork MonoBleedingEdge which is very funny considering that Mono in itself isn't exactly "bleeding edge" but can be almost considered obsolete since Microsoft made .Net open source. And their Mono fork is also about two years behind the latest upstream code. So it's bleeding-edge in a very painful, sarcastic way, maybe. The biggest issue with the Mono scripting backend is that the garbage collector is just bad. A managed language relies on a runtime that does some decent automatic memory management and Unity struggles with providing that.

And Pedro, tbh I'm not even sure how you're using the Microsoft Visual Studio Profiler effectively to profile a Unity application? Or is your chess code a standalone library that you're just planning to use within Unity? But you develop (and profile) it outside of Unity and run it instead in Microsoft's CLR when your profile it?
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
pedrojdm2021
Posts: 157
Joined: Fri Apr 30, 2021 7:19 am
Full name: Pedro Duran

Re: Efficiently Generated Legal moves question

Post by pedrojdm2021 »

lithander wrote: Mon Jul 12, 2021 1:22 am Keep in mind that Pedro is using Unity. Unity has C# scripting but that doesn't mean that the performance characteristics of code written for Unity is really comparable to the C# written for the CLR (the runtime everyone else uses). In Unity you have two choices: Either the C# code get's compiled to IL code and then the IL code get's converted to C++ and finally compiled to a native, platform specific binary. This is called IL2CPP and this is potentially fast but not exactly comparable to IL code being run and JIT-compiled in a high performant, managed runtime like today's CLR.

The alternative to using IL2CPP is to use Mono. They call their fork MonoBleedingEdge which is very funny considering that Mono in itself isn't exactly "bleeding edge" but can be almost considered obsolete since Microsoft made .Net open source. And their Mono fork is also about two years behind the latest upstream code. So it's bleeding-edge in a very painful, sarcastic way, maybe. The biggest issue with the Mono scripting backend is that the garbage collector is just bad. A managed language relies on a runtime that does some decent automatic memory management and Unity struggles with providing that.

And Pedro, tbh I'm not even sure how you're using the Microsoft Visual Studio Profiler effectively to profile a Unity application? Or is your chess code a standalone library that you're just planning to use within Unity? But you develop (and profile) it outside of it using the Microsoft CLR?
I have 2 "projects"

look at this as a fronted-backed like structure of a website:

fronted: unity implementation

backend: C# Chess Engine .dll built in .net framework 4.8 (the highest version of .net that unity currently supports)

in the C# chess engine i have for sure a console application to profile, debug, and test the engine before i import it into unity as a .dll plugin

In unity things can get more slower due to unity's implementation of the C# layer, it is weird but the C# in unity always runs slower, no matter if you allocate GC or not