New Engine announcement: Pygmalion

klx · Post by **klx** » Sun Sep 05, 2021 11:14 am

R. Tomasi wrote: ↑Thu Sep 02, 2021 9:28 pm For lengths up to 32 it will use parallel sorting networks ...

What does your code look like that chooses the sorting network function to call?

smatovic · Post by **smatovic** » Sun Sep 05, 2021 11:55 am

R. Tomasi wrote: ↑Thu Sep 02, 2021 3:00 pm ...
Over the last years the "engine" was rewritten several times and ported to C++. I even played around with an OpenCL version in the hopes of harnessing the power of modern GPUs, but (as you may already guess) failed miserably at efficiently parallelizing the search in a way that fits these vector-based architectures of the GPUs. About two years ago I had a version that was playing reasonably strong (it consistently beat the FairyMax version bundled with Winboard at the time), but it was buggy, full of search instabilities and generally not anything I would show around without being ashamed of my work.
...

Curious, there are not that many OpenCL engines out there. What kind of design did you choose? Did you couple multiple gpu-threads to one worker? What kind of move generation?

--
Srdja

R. Tomasi · Post by **R. Tomasi** » Sun Sep 05, 2021 11:58 am

klx wrote: ↑Sun Sep 05, 2021 11:14 am
R. Tomasi wrote: ↑Thu Sep 02, 2021 9:28 pm For lengths up to 32 it will use parallel sorting networks ...
What does your code look like that chooses the sorting network function to call?

Code: Select all

template<typename VALUE, typename SCORE>
class sortAlgorithm
{
private:
/* 
	[...]
*/
	typedef void SORTFUNCTION(VALUE* pValues, SCORE* pScores);
	constexpr static inline SORTFUNCTION* m_Sort[31]
	{
		&sort_N2,
		&sort_N3,
		&sort_N4,
		&sort_N5,
		&sort_N6,
		&sort_N7,
		&sort_N8,
		&sort_N9,
		&sort_N10,
		&sort_N11,
		&sort_N12,
		&sort_N13,
		&sort_N14,
		&sort_N15,
		&sort_N16,
		&sort_N17,
		&sort_N18,
		&sort_N19,
		&sort_N20,
		&sort_N21,
		&sort_N22,
		&sort_N23,
		&sort_N24,
		&sort_N25,
		&sort_N26,
		&sort_N27,
		&sort_N28,
		&sort_N29,
		&sort_N30,
		&sort_N31,
		&sort_N32
	};
	constexpr static size_t sort_tail() noexcept
	{
		return 32;
	}
public:
	constexpr static void sortValues(VALUE* pValues, SCORE* pScores, const size_t length) noexcept
	{
		if (length > 1)
		{
			if (length <= sort_tail())
			{
				m_Sort[length - 2](pValues, pScores);
			}
			else
			{
				quickSort(0, length - 1, pValues, pScores);
			}
		}
	}
}

R. Tomasi · Post by **R. Tomasi** » Sun Sep 05, 2021 12:05 pm

smatovic wrote: ↑Sun Sep 05, 2021 11:55 am
R. Tomasi wrote: ↑Thu Sep 02, 2021 3:00 pm ...
Over the last years the "engine" was rewritten several times and ported to C++. I even played around with an OpenCL version in the hopes of harnessing the power of modern GPUs, but (as you may already guess) failed miserably at efficiently parallelizing the search in a way that fits these vector-based architectures of the GPUs. About two years ago I had a version that was playing reasonably strong (it consistently beat the FairyMax version bundled with Winboard at the time), but it was buggy, full of search instabilities and generally not anything I would show around without being ashamed of my work.
...
Curious, there are not that many OpenCL engines out there. What kind of design did you choose? Did you couple multiple gpu-threads to one worker? What kind of move generation?

--
Srdja

My Idea was to break down the processing of a node into different phases (prologue, make move, returned from move, loop next move, epilogue) and have a kernel that can process such a phase. Then in essence process all moves of a node in a vector-parallel fashion using these kernels. Trouble is, that depending on what phase the kernel is in it will take different time to execute it, such that all moves always wait for the one that takes the longest to process. Also, the different threads branch differently, which is a huge performance hit on a GPU (you typically want neighbouring workers to branch the same way).
Concerning the move generation: well, I simply ported my C++ code to C99 and replaced anything recursive by equivalent iterations (since you cannot do recursion on the GPU).
It simply did not work out the way I hoped. Although it did play chess, it became relatively quickly clear that this design is totally inefficient.

Edit: I may have the code lying around somewhere. If you're interested I can try to dig it out, but I don't want to promise you that I still have it - I may have lost that code. I'm not sure.

smatovic · Post by **smatovic** » Sun Sep 05, 2021 12:31 pm

R. Tomasi wrote: ↑Sun Sep 05, 2021 12:05 pm My Idea was to break down the processing of a node into different phases (prologue, make move, returned from move, loop next move, epilogue) and have a kernel that can process such a phase. Then in essence process all moves of a node in a vector-parallel fashion using these kernels. Trouble is, that depending on what phase the kernel is in it will take different time to execute it, such that all moves always wait for the one that takes the longest to process. Also, the different threads branch differently, which is a huge performance hit on a GPU (you typically want neighbouring workers to branch the same way).
Concerning the move generation: well, I simply ported my C++ code to C99 and replaced anything recursive by equivalent iterations (since you cannot do recursion on the GPU).
It simply did not work out the way I hoped. Although it did play chess, it became relatively quickly clear that this design is totally inefficient.

Edit: I may have the code lying around somewhere. If you're interested I can try to dig it out, but I don't want to promise you that I still have it - I may have lost that code. I'm not sure.

Not sure, maybe I had a similar idea, I called it SPPS, simple parallel processing scheme, couple one work-group, 128 threads, to one worker, process up to 128 moves of a node in parallel, store the moves as tree in a way that you can process the moves in parallel. SPPS was able to play chess, but IIRC I had to realize that I am wasting most of the threads during Quiescence-Search and such an approach can not work out...

--
Srdja

R. Tomasi · Post by **R. Tomasi** » Sun Sep 05, 2021 12:36 pm

smatovic wrote: ↑Sun Sep 05, 2021 12:31 pm
R. Tomasi wrote: ↑Sun Sep 05, 2021 12:05 pm My Idea was to break down the processing of a node into different phases (prologue, make move, returned from move, loop next move, epilogue) and have a kernel that can process such a phase. Then in essence process all moves of a node in a vector-parallel fashion using these kernels. Trouble is, that depending on what phase the kernel is in it will take different time to execute it, such that all moves always wait for the one that takes the longest to process. Also, the different threads branch differently, which is a huge performance hit on a GPU (you typically want neighbouring workers to branch the same way).
Concerning the move generation: well, I simply ported my C++ code to C99 and replaced anything recursive by equivalent iterations (since you cannot do recursion on the GPU).
It simply did not work out the way I hoped. Although it did play chess, it became relatively quickly clear that this design is totally inefficient.

Edit: I may have the code lying around somewhere. If you're interested I can try to dig it out, but I don't want to promise you that I still have it - I may have lost that code. I'm not sure.
Not sure, maybe I had a similar idea, I called it SPPS, simple parallel processing scheme, couple one work-group, 128 threads, to one worker, process up to 128 moves of a node in parallel, store the moves as tree in a way that you can process the moves in parallel. SPPS was able to play chess, but IIRC I had to realize that I am wasting most of the threads during Quiescence-Search and such an approach can not work out...

--
Srdja

Sounds like a similar concept, yes. Anyways, I dumped that idea relatively quickly. I think there still are ways to use the GPU in a chess engine, but imho it would have to be some sort of hybrid-aproach.

R. Tomasi · Post by **R. Tomasi** » Sun Sep 05, 2021 3:11 pm

pygmalion-mechanics

This module provides the framework for defining moves and how they are supposed to be made/unmade. It builds on top of pygmalion-state. There is no move generation or legality checks going on, that's all in the pygmalion-dynamics module which builds on top of pygmalion-mechanics. The mechanics module is first and foremost responsible for describing what exactly a certain movetype is supposed to do with the board. It also provides everything needed to keep the move history of a game and bloomfilters for repetition checks.

As already done with the pygmalion-state module it is configured using a description template:

Code: Select all

template<typename MOVE, size_t COUNT_BITS_BLOOMFILTER, size_t COUNT_VALUES_BLOOMFILTER>
class descriptor_mechanics;

The template arguments have the following semantics:

COUNT_BITS_BLOOMFILTER defines how many bits of a position hash should be used for the repetition bloomfilter.
COUNT_VALUES_BLOOMFILTER defines the range of the individual entries in the bloomfilter. Essentially this should be larger than the maximum number of moves a game can have, which obviously depends on the rules of the boardgame that is targetted.
MOVE is the class that defines a move of the game. It must inherit from the move template/class that is also defined in the pygmalion-mechanics module. Afaik there is no elegant way in C++17 to enforce template arguments inheriting a certain base class. I think C++20 provides ways to do that, but I have not really looked into that, so don't take me at face-value here.

That base template/class for all movetypes is defined as follows:

Code: Select all

template<typename BOARD, size_t COUNT_BITS, typename MOVEDATA, typename INSTANCE>
class move
{
public:
	using instanceType = INSTANCE;
	using boardType = BOARD;
	using descriptorState = typename boardType::descriptorState;
#include <pygmalion-state/include_state.h>
	constexpr static const size_t countBits{ COUNT_BITS };
	using movebitsType = uint_t<countBits, false>;
	using movedataType = MOVEDATA;
protected:
	constexpr move() noexcept = default;
	constexpr move(move&&) noexcept = default;
	constexpr move(const move&) noexcept = default;
	constexpr move& operator=(move&&) noexcept = default;
	constexpr move& operator=(const move&) noexcept = default;
	~move() noexcept = default;
public:
	std::string name() const noexcept
	{
		return static_cast<const instanceType*>(this)->name_Implementation();
	}
	constexpr movedataType doMove(boardType& position, const movebitsType& moveBits) const noexcept
	{
		return static_cast<const instanceType*>(this)->doMove_Implementation(position, moveBits);
	}
	constexpr void undoMove(boardType& position, const movedataType& data) const noexcept
	{
		static_cast<const instanceType*>(this)->undoMove_Implementation(position, data);
	}
	bool parse(const boardType& position, std::string& text, movebitsType& moveBits) const noexcept
	{
		return static_cast<const instanceType*>(this)->parse_Implementation(position, text, moveBits);
	}
	std::string toString(const boardType& position, const movebitsType& moveBits) const noexcept
	{
		return static_cast<const instanceType*>(this)->toString_Implementation(position, moveBits);
	}
	constexpr squaresType otherOccupancyDelta(const boardType& position, const movebitsType& moveBits) const noexcept
	{
		return static_cast<const instanceType*>(this)->otherOccupancyDelta_Implementation(position, moveBits);
	}
	constexpr squaresType ownOccupancyDelta(const boardType& position, const movebitsType& moveBits) const noexcept
	{
		return static_cast<const instanceType*>(this)->ownOccupancyDelta_Implementation(position, moveBits);
	}
	constexpr squareType fromSquare(const boardType& position, const movebitsType& moveBits) const noexcept
	{
		return static_cast<const instanceType*>(this)->fromSquare_Implementation(position, moveBits);
	}
	constexpr squareType toSquare(const boardType& position, const movebitsType& moveBits) const noexcept
	{
		return static_cast<const instanceType*>(this)->toSquare_Implementation(position, moveBits);
	}
};

Again it's using the CRTP for static polymorphism. The fromSquare and toSquare methods probably should be moved from this class into the inheriting class, since for some movetypes (like nullmove) these may not be well-defined. I plan to do that eventually, but for the moment I could not be bothered tidying this up. It's on the to-do list, though.

Building on top of that base-class the module defines some movetypes which one typically encounters in boardgames:

transportmove : a move that just transports a piece from one square to another
capturemove: a move that transports a piece from one square to another and removes the piece that occupies the target square
promotionmove: a move that removes the piece from it's initial square and drops a different piece on the target square
promocapturemove: a move that removes the piece from it's initial square and drops a different piece on the target square after removing the piece that occupies the target square
dropmove: a move that just spawns a piece on a square
killmove: a move that removes a piece from a square
nullmove: does nothing, except from changing which player is on the move

These classes will provide the framework with information like for example how many bits they require to store the information that is encoded in a move of that type. If the boardgame requires anything fancy on top of that (for example en-passant or castling in chess) one will have to derive a specialized class for that from the base move class.

Now, since typically you will need more than one of these movetypes in a boardgame (in chess for example you need transport, captures, promotions, promocaptures) another class that inherits from move is provided: disjunctive move. What it does is multiplexing different movetypes into a single movetype.

In chess, for example, my final move class looks like that:

Code: Select all

class combinedmoves :
		public pygmalion::mechanics::disjunctivemove<board, queenpromocapturemove, queenpromotionmove, rookpromocapturemove, rookpromotionmove, bishoppromocapturemove, bishoppromotionmove, knightpromocapturemove, knightpromotionmove, doublepushmove, enpassantmove, kingsidecastlemove, queensidecastlemove, capturemove, quietmove, nullmove>

This turns out to be a quite compact way of storing moves, for chess a move will require 16bit storage size and it could even be extended with movetypes for offering/accepting draws without increasing the storage size.

tcusr · Post by **tcusr** » Sun Sep 05, 2021 4:13 pm

you can do a static_assert and use std::is_base_of to verify a class inherits from another one

R. Tomasi · Post by **R. Tomasi** » Sun Sep 05, 2021 4:20 pm

tcusr wrote: ↑Sun Sep 05, 2021 4:13 pm you can do a static_assert and use std::is_base_of to verify a class inherits from another one

You're right! Thanks for pointing that out. I guess I'll incorporate static asserts of that kind. It's more or less a cosmetic thing, since when the inheritance does not fit, it will not compile anyways. But it's better to have one static assert failing instead of like 100 compiler errors without clear indication of what went wrong.

R. Tomasi · Post by **R. Tomasi** » Mon Sep 13, 2021 1:00 am

Source code is now available on github under GPLv3 license:

https://github.com/rtomasi75/pygmalion

New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion

Re: New Engine announcement: Pygmalion