Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

dangi12012 · Post by **dangi12012** » Wed Oct 06, 2021 6:39 pm

I would like to release the sourcecode to the fastest Chess movegenerator I was able to write during the last 2½ years.
It is able to generate 2 Billion Moves per Thread per Second. No Hashing is used - and it still is single threaded.

What makes it so fast is summarized below - but in short: Make/Unmake and a Movelist is not needed and 2x slower than a visitor pattern - which is even more powerful and faster and implemented in this movegen. Also expanding the enumeration template for color to include enpassant and all castling squares helps a lot!

Summary:
Explanation Article: https://www.codeproject.com/Articles/53 ... egenerator
Sourcecode: https://github.com/Gigantua/Gigantua
Linux/Windows Binary: https://www.codeproject.com/KB/Articles ... nLinux.zip

It is a Bitboard movegenerator. As it turns out incremental bitboards are slower for this type of generator. Remembering previous from | to square is more expensive than recalculating many of the moves. It is possible to add that to the sourcecode easily and maybe I will merge the exploration branches in the future.

It uses heavy template instantiation with a template inlined "callback" function. What makes this so fast is that you dont need a movelist anymore for this approach. When a legal move is encountered it instantly is decided what to do with it. Expand more - Store - Or sort into a list of good moves. All possible with this approach.

The pin and checkmask is what makes bitboards so good. With a handful of & operations all moves are pruned to only cointain legal moves. An I mean all moves. Even the most ugliest of pinned - prevention of promotion - or check double check pin combos are almost free and certainly branchless!

Also what I think no one implemented before me: Enpassant and Castling are moved to a Boardstate Template. You can input any fen - the appropriate template is called with a single switch and off you go enumerating. The cost of a single switch at the very start of enumerating saves billions of noop instructions for castling / enpassant. The Boardstate contains: Moving Color, HasEnpassant, 4 Castling Squares. Which makes the C++ compiler emit 64 functions for all 6 independent flags.

The beauty of this is that the current color implemented as a template removes all lookups like pawn move direction etc. Code checks for Castling are only in the assembly if it is possible to castle. Once a Rook / King moves - Not a single instruction is used for castling. Not even a single if. It gets if constexpr pruned away.

The reciever for all possible moves are 8 Methods that are named after each piece and movetype - and the from - to squares are there for you in bitboard form. So its almost free to count all castling moves etc. It wont even add a single if instruction. This approach needs no movelist at all - but you can still store the move inside the callback.

What I would like for you the reader to do: Try to find any mistakes or slow stuff in my code. Tear it apart - think about it - ideally comment below with ideas how to make it faster - or better send a pull request into the github repo!
https://github.com/Gigantua/Gigantua

Compile with Visual studio or Linux - Clang -march=native -std=c++20 -lstdc++ -O3 Gigantua.cpp -flto -o giga_clang

Greetings! - Daniel Inführ

References:
http://www.talkchess.com/forum3/viewtop ... =7&t=78230

Perft Start 7: 3195901860 2191ms 1458.31 MNodes/s
Perft Kiwi 6: 8031647685 4004ms 2005.51 MNodes/s
Perft aggregate: 18999768562 10149ms 1871.94 MNodes/s

klx · Post by **klx** » Wed Oct 06, 2021 7:53 pm

That's nice, thanks for releasing the source.

So... are you going to credit me or what? Lol I literally see my code from this post in your repo (this line and onwards), after which you suddenly claimed a 30% performance boost overnight in spite of having worked on this for 2.5 years =)

dangi12012 · Post by **dangi12012** » Wed Oct 06, 2021 8:07 pm

I shall credit you for a faster bitloop. - Unrolled switch is 80% slower = 1/5th of the speed because branching on popcnt is not predictable. changing from while to for was a good idea.

It was 5% faster and as you can read in the source - there are still many while loops - because the for loop is slower when you actually need the source square as a single bit for the bitboard. 1ull<<sq is slower there.

Now go forth and find more optimisations i missed - it was a good tip!

abulmo2 · Post by **abulmo2** » Thu Oct 07, 2021 1:25 am

dangi12012 wrote: ↑Wed Oct 06, 2021 6:39 pm Perft Start 7: 3195901860 2191ms 1458.31 MNodes/s
Perft Kiwi 6: 8031647685 4004ms 2005.51 MNodes/s
Perft aggregate: 18999768562 10149ms 1871.94 MNodes/s

I know it is supposed to run at high speed only on recent AMD or Intel CPU, but the number I get on my (not so) old Ryzen 1700x are very far from yours:
(Run under Fedora, after compiling it with clang 12)

Code: Select all

Perft Start 7: 3195901860 21051ms 151.813 MNodes/s
Perft Kiwi 6: 8031647685 39598ms 202.825 MNodes/s
Perft Midgame 6: 6923051137 35298ms 196.126 MNodes/s
Perft Endgame 6: 849167880 2911ms 291.63 MNodes/s
Perft aggregate: 18999768562 101618ms 186.971 MNodes/s

My own classical perft with bulk counting is faster:

Code: Select all

perft  7 :      3195901860 leaves in      5.754 s    555431950 leaves/s
perft  6 :      8031647685 leaves in     12.853 s    624892362 leaves/s
perft  6 :      6923051137 leaves in      8.334 s    830732235 leaves/s
perft  6 :       849167880 leaves in      1.454 s    584123663 leaves/s
So an aggregate time of 28.395 s.

The compilation time of your perft is also impressive. About 10 minutes on my computer. I guess you are using PEXT instruction, which are fast on ryzen 5000 but slow on previous version. You should maybe provide an alternative code without PEXT for older CPU.

klx · Post by **klx** » Thu Oct 07, 2021 2:04 am

abulmo2 wrote: ↑Thu Oct 07, 2021 1:25 am I know it is supposed to run at high speed only on recent AMD or Intel CPU, but the number I get on my (not so) old Ryzen 1700x are very far from yours

That's expected. You basically can't use PEXT on Zen or Zen 2. It could clog up your whole system, as in your example.

abulmo2 wrote: ↑Thu Oct 07, 2021 1:25 am You should maybe provide an alternative code without PEXT for older CPU.

It's already configurable.

Code: Select all

//This is the most important definition for performance!
namespace Lookup = Chess_Lookup::Lookup_Pext;

dangi12012 · Post by **dangi12012** » Thu Oct 07, 2021 10:17 am

abulmo2 wrote: ↑Thu Oct 07, 2021 1:25 am The compilation time of your perft is also impressive. About 10 minutes on my computer. I guess you are using PEXT instruction, which are fast on ryzen 5000 but slow on previous version. You should maybe provide an alternative code without PEXT for older CPU.

What you can do on Zen1/2 is have your compiler emit the PEXT emulation. That will be faster than the horrible horrbile Zen1/2 microcode:

Code: Select all

Movemap.hpp - struct SliderPext_t
	_ForceInline constexpr uint64_t operator[](const uint64_t blocker) const
	{
		if (std::is_constant_evaluated()) {
			return AttackPtr[_pext_u64_emulated(blocker, Mask)];
		}
		else {
			return AttackPtr[_pext_u64(blocker, Mask)];
		}
	}

Just make this - return AttackPtr[_pext_u64_emulated(blocker, Mask)];
Should be much faster on Zen2.

As for compiletime you can define _ForceInline as whitespace. - much faster compile - much slower executable.

dangi12012 · Post by **dangi12012** » Thu Oct 07, 2021 11:23 am

We did it guys! Top post on Reddit C++.
Nice

Rein Halbersma · Post by **Rein Halbersma** » Sat Oct 09, 2021 2:01 pm

Your use of macros starting with an underscore and a capital letter (such as _Compiletime) is undefined behavior in C++. These are all reserved for the C++ Standard Library. Use them at your own risk.

AndrewGrant · Post by **AndrewGrant** » Sat Oct 09, 2021 8:55 pm

https://github.com/ankan-ban/perft_gpu

Steve Maughan · Post by **Steve Maughan** » Sat Oct 09, 2021 9:01 pm

This is awesome - congrats! I've always maintained that Perft is something worth optimizing. If your engine has a good perft speed you can probably add a decent evaluation too.

– Steve

Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release

Re: Gigantua: 2 Gigamoves per Second per Core move generator - Sourcecode Release