Back To The Beginning

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

fabianVDW
Posts: 146
Joined: Fri Mar 15, 2019 8:46 pm
Location: Germany
Full name: Fabian von der Warth

Re: Back To The Beginning

Post by fabianVDW »

Joost Buijs wrote: Sat Sep 07, 2019 12:57 pm
fabianVDW wrote: Sat Sep 07, 2019 12:21 pm AFAIK, movegeneration speed is a good chunk of the runtime in modern engines. Atleast in FabChess, it is about 25-30% of the time used, being on par with the evaluation function (IIRC).
25% to 30% of the time used by the move generation is quite a lot. Maybe it depends upon staged move generation or not and whether you execute the hash move with or without move generation, when I profile my engine move generation takes considerably less than 10% of the total time used.
I also do staged movegen in FabChess. Maybe I remembered the numbers wrong... I will do another profiling session tomorrow.
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Back To The Beginning

Post by Michael Sherwin »

zullil wrote: Sat Sep 07, 2019 2:12 pm
Michael Sherwin wrote: Sat Sep 07, 2019 12:42 pm
On an i7- 3930k overclocked to 4.4 GHz using only one thread in the opening position the assembler version generates 75 million positions per second. That includes all moves made and unmade and no hash counting. On an i9-9900k overclocked to 5GHz it would do double that.
Why would going from 4.4 GHz to 5.0 Ghz double the speed of the move generator? What else does the i9 provide?
An out of the box comparison. 3930k at 3.2 GHz vs 9900k at 3.6 GHz
Multithreaded
3930k CPU Mark 11980
9900k CPU Mark 20217

Single Thread
3930k CPU Mark 1935
9900k CPU Mark 2898

I was remembering that last number wrong. I thought it was 3898, sorry. However a 9900k can be easily overclocked to 5GHz using a good air cooler and 5.2 GHz with a really good liquid cooler. A 3930k needs a really good liquid cooler to get to 4.4GHz. Normally I keep my 3930k at stock 3.2 GHz. At 4.4GHz it will run 12 threaded SF just fine. Some other non chess apps will lock it up. The 3930k will run stable at 4GHz for all apps but still gets very hot. The 9900k is 95 watts and the 3930k is 130 watts.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Back To The Beginning

Post by Joost Buijs »

On my 6950X at 4GHz. I get on average 52 mnps without bulk counting and including updating/downdating hash, pawn-hash, material-hash, material-score and phase. The latest processors from Intel have a 15 to 20% higher IPC and a somewhat higher clock-frequency, so my guess is that I would get between 60 and 70 mnps on a more recent processor. This comes close to the 75 mnps you get with your assembler version. I'm pretty sure that this figure can be improved, but it is rather useless to spend time on it because it will have no influence on playing strength.

With bulk counting I get 185 mnps, with PGO it even gets over 200 mnps.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Back To The Beginning

Post by Michael Sherwin »

For those that might be interested in the difference in speed of the pure C version and the assembler version these are the numbers on a 3930k at stock 3.2 GHz.

Bench 6

Pure C,
33 879 522 moves per second

Bench code in C, move gen and MakeMove in assembler.
65 608 247 moves per second

I don't believe in the mantra, "don't program in assembler any more because an optimizing C++ compiler can produce code that is more efficient than any human". :D
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Back To The Beginning

Post by Michael Sherwin »

Joost Buijs wrote: Sat Sep 07, 2019 7:34 pm On my 6950X at 4GHz. I get on average 52 mnps without bulk counting and including updating/downdating hash, pawn-hash, material-hash, material-score and phase. The latest processors from Intel have a 15 to 20% higher IPC and a somewhat higher clock-frequency, so my guess is that I would get between 60 and 70 mnps on a more recent processor. This comes close to the 75 mnps you get with your assembler version. I'm pretty sure that this figure can be improved, but it is rather useless to spend time on it because it will have no influence on playing strength.

With bulk counting I get 185 mnps, with PGO it even gets over 200 mnps.
Single Thread Comparison CPU Mark Rating
3930k 1935
6950x 2147

I guess it is a minor difference but it is a difference. 2147 / 1935 * 75 = 83 million.

P.S And my bench function also uses the normal MakeMove() with all that extra updating. This code may only be for special use cases like mate finding or a different type of engine or whatever. I just posted it in case anyone could use it for whatever. :D
Last edited by Michael Sherwin on Sat Sep 07, 2019 8:02 pm, edited 1 time in total.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Back To The Beginning

Post by Joost Buijs »

Well, I don't think it is impossible to beat the compiler with assembler code, but a programmer who knows all the ins and outs of the compiler he uses is capable of producing a program in a short amount of time that is very hard to beat with assembler code. And then there is the issue of portability, if you want to target another CPU all the code you have written is useless and you have to do it all over again.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Back To The Beginning

Post by Joost Buijs »

Michael Sherwin wrote: Sat Sep 07, 2019 7:53 pm
Joost Buijs wrote: Sat Sep 07, 2019 7:34 pm On my 6950X at 4GHz. I get on average 52 mnps without bulk counting and including updating/downdating hash, pawn-hash, material-hash, material-score and phase. The latest processors from Intel have a 15 to 20% higher IPC and a somewhat higher clock-frequency, so my guess is that I would get between 60 and 70 mnps on a more recent processor. This comes close to the 75 mnps you get with your assembler version. I'm pretty sure that this figure can be improved, but it is rather useless to spend time on it because it will have no influence on playing strength.

With bulk counting I get 185 mnps, with PGO it even gets over 200 mnps.
Single Thread Comparison CPU Mark Rating
3930k 1935
6950x 2147

I guess it is a minor difference but it is a difference. 2147 / 1935 * 75 = 83 million.
Don't forget I up en down-date everything something you don't seem to do, this will make a difference of course.
And you run at 4.4 GHz. while I run at 4.0, also a 10% difference.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Back To The Beginning

Post by Michael Sherwin »

Joost Buijs wrote: Sat Sep 07, 2019 7:58 pm
Michael Sherwin wrote: Sat Sep 07, 2019 7:53 pm
Joost Buijs wrote: Sat Sep 07, 2019 7:34 pm On my 6950X at 4GHz. I get on average 52 mnps without bulk counting and including updating/downdating hash, pawn-hash, material-hash, material-score and phase. The latest processors from Intel have a 15 to 20% higher IPC and a somewhat higher clock-frequency, so my guess is that I would get between 60 and 70 mnps on a more recent processor. This comes close to the 75 mnps you get with your assembler version. I'm pretty sure that this figure can be improved, but it is rather useless to spend time on it because it will have no influence on playing strength.

With bulk counting I get 185 mnps, with PGO it even gets over 200 mnps.
Single Thread Comparison CPU Mark Rating
3930k 1935
6950x 2147

I guess it is a minor difference but it is a difference. 2147 / 1935 * 75 = 83 million.
Don't forget I up en down-date everything something you don't seem to do, this will make a difference of course.
And you run at 4.4 GHz. while I run at 4.0, also a 10% difference.
Okay, good point on the overclocking difference! But as I edited above I do update everything up and down. You did not say if you make and unmake all moves generated. I think in a previous thread you said that you do. Anyway I did not post this to get into a mine is bigger than yours discussion. I posted it because I thought it might be interesting for whatever reason.

How does your move generator work? I invite you to post some example code.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Back To The Beginning

Post by Joost Buijs »

Michael Sherwin wrote: Sat Sep 07, 2019 8:11 pm
Joost Buijs wrote: Sat Sep 07, 2019 7:58 pm
Michael Sherwin wrote: Sat Sep 07, 2019 7:53 pm
Joost Buijs wrote: Sat Sep 07, 2019 7:34 pm On my 6950X at 4GHz. I get on average 52 mnps without bulk counting and including updating/downdating hash, pawn-hash, material-hash, material-score and phase. The latest processors from Intel have a 15 to 20% higher IPC and a somewhat higher clock-frequency, so my guess is that I would get between 60 and 70 mnps on a more recent processor. This comes close to the 75 mnps you get with your assembler version. I'm pretty sure that this figure can be improved, but it is rather useless to spend time on it because it will have no influence on playing strength.

With bulk counting I get 185 mnps, with PGO it even gets over 200 mnps.
Single Thread Comparison CPU Mark Rating
3930k 1935
6950x 2147

I guess it is a minor difference but it is a difference. 2147 / 1935 * 75 = 83 million.
Don't forget I up en down-date everything something you don't seem to do, this will make a difference of course.
And you run at 4.4 GHz. while I run at 4.0, also a 10% difference.
Okay, good point on the overclocking difference! But as I edited above I do update everything up and down. You did not say if you make and unmake all moves generated. I think in a previous thread you said that you do. Anyway I did not post this to get into a mine is bigger than yours discussion. I posted it because I thought it might be interesting for whatever reason.

How does your move generator work? I invite you to post some example code.
I meant up and down-dating the hashes, material score and phase, which takes time and is not needed for perft(). This is not about a mine is bigger than yours discussion, but I just want to show that you don't need assembler to get good performance.

There is nothing secret about my move generator, it is a very straightforward bit-board implementation, I used some meta-programming, this can make it somewhat difficult to comprehend.

Code: Select all

namespace movegen
{
	enum : int { mvNW = 7, mvFW = 8, mvNE = 9, mvF2 = 16 };
	template <int S, int N> constexpr bb_t bbFile = S == White ? masks::bbFile<N> : masks::bbFile<7 - N>;
	template <int S, int N> constexpr bb_t bbRank = S == White ? masks::bbRank<N> : masks::bbRank<7 - N>;
	template <int S> constexpr bb_t bbNW = ~bbFile<S, 0> & ~bbRank<S, 7>;
	template <int S> constexpr bb_t bbNE = ~bbFile<S, 7> & ~bbRank<S, 7>;
	template <int S> constexpr bb_t bbFW = ~bbRank<S, 7>;
	template <int S, int N> constexpr bb_t SHL(bb_t bb) { return S == White ? bb << N : bb >> N; }
	template <int S, int N> constexpr bb_t SHR(bb_t bb) { return S == White ? bb >> N : bb << N; }

	template <int S, int N>
	constexpr void gen_promotion(position_t* pos, bb_t src, bb_t tgt, move_t*& move, int type)
	{
		switch (N)
		{
		case  mvNW: src &= bbNW<S> & SHR<S, N>(tgt); break;
		case  mvFW: src &= bbFW<S> & SHR<S, N>(tgt); break;
		case  mvNE: src &= bbNE<S> & SHR<S, N>(tgt); break;
		}

		while (src)
		{
			int fsq = LSBR(src);
			int tsq = S == White ? fsq + N : fsq - N;
			move++->pack(fsq, tsq, type, S | wQueen);
			move++->pack(fsq, tsq, type, S | wKnight);
			move++->pack(fsq, tsq, type, S | wRook);
			move++->pack(fsq, tsq, type, S | wBishop);
		}
	}

	template <int S, int N>
	constexpr void gen_pawn(position_t* pos, bb_t src, bb_t tgt, move_t*& move, int type)
	{
		switch (N)
		{
		case  mvNW: src &= bbNW<S> & SHR<S, N>(tgt); break;
		case  mvFW: src &= bbFW<S> & SHR<S, N>(tgt); break;
		case  mvNE: src &= bbNE<S> & SHR<S, N>(tgt); break;
		case  mvF2: src &= bbRank<S, 1> & SHR<S, mvFW>(pos->empty())& SHR<S, N>(tgt); break;
		}

		while (src)
		{
			int fsq = LSBR(src);
			S == White ? move++->pack(fsq, fsq + N, type) : move++->pack(fsq, fsq - N, type);
		}
	}

	template <int S, int T>
	constexpr void gen_piece(position_t* pos, bb_t tgt, move_t*& move, int type)
	{
		bb_t src = 0;

		switch (T)
		{
		case Knight: src = pos->knights(S); break;
		case Bishop: src = pos->bishops_queens(S); break;
		case Rook: src = pos->rooks_queens(S); break;
		case King: src = pos->king(S); break;
		}

		while (src)
		{
			int fsq = LSBR(src);
			bb_t dst = 0;

			switch (T)
			{
			case Knight: dst = tgt & masks::knight_attacks[fsq]; break;
			case Bishop: dst = tgt & magics::bishop_attacks(fsq, pos->occupied()); break;
			case Rook: dst = tgt & magics::rook_attacks(fsq, pos->occupied()); break;
			case King: dst = tgt & masks::king_attacks[fsq]; break;
			}

			while (dst) move++->pack(fsq, LSBR(dst), type);
		}
	}

	template <int S>
	constexpr void gen_castling(position_t* pos, move_t*& move)
	{
		if (S == White)
		{
			if (Bit(H1) & pos->castling() && !(0x60ULL & pos->occupied()) && !(0x70ULL & pos->attacked()))
				move++->pack(E1, G1, mtQuiet | mtKing | mtCastle);
			if (Bit(A1) & pos->castling() && !(0x0EULL & pos->occupied()) && !(0x1CULL & pos->attacked()))
				move++->pack(E1, C1, mtQuiet | mtKing | mtCastle);
		}
		else
		{
			if (Bit(H8) & pos->castling() && !(0x60ULL << 56 & pos->occupied()) && !(0x70ULL << 56 & pos->attacked()))
				move++->pack(E8, G8, mtQuiet | mtKing | mtCastle);
			if (Bit(A8) & pos->castling() && !(0x0EULL << 56 & pos->occupied()) && !(0x1CULL << 56 & pos->attacked()))
				move++->pack(E8, C8, mtQuiet | mtKing | mtCastle);
		}
	}

	template <int S>
	void gen_captures(position_t* pos, move_t*& move)
	{
		bb_t src = bbRank<S, 6> & pos->pawns(S);
		bb_t tgt = pos->occupied(Opp(S));
		gen_promotion<S, mvNW>(pos, src, tgt, move, mtPromote | mtCapture | mtPawn);
		gen_promotion<S, mvNE>(pos, src, tgt, move, mtPromote | mtCapture | mtPawn);
		gen_promotion<S, mvFW>(pos, src, pos->empty(), move, mtPromote | mtPawn);
		src = ~bbRank<S, 6> & pos->pawns(S);
		gen_pawn<S, mvNW>(pos, src, tgt, move, mtCapture | mtPawn);
		gen_pawn<S, mvNE>(pos, src, tgt, move, mtCapture | mtPawn);

		if (pos->enpassant_square() > 0)
		{
			bb_t epb = Bit(pos->enpassant_square());
			gen_pawn<S, mvNW>(pos, src, epb, move, mtCapture | mtEnpcapt | mtPawn);
			gen_pawn<S, mvNE>(pos, src, epb, move, mtCapture | mtEnpcapt | mtPawn);
		}

		gen_piece<S, Knight>(pos, tgt, move, mtCapture);
		gen_piece<S, Bishop>(pos, tgt, move, mtCapture);
		gen_piece<S, Rook>(pos, tgt, move, mtCapture);
		gen_piece<S, King>(pos, tgt & ~pos->attacked(), move, mtCapture | mtKing);
	}

	template <int S>
	void gen_quiet_moves(position_t* pos, move_t*& move)
	{
		bb_t src = ~bbRank<S, 6> & pos->pawns(S);
		bb_t tgt = pos->empty();
		gen_pawn<S, mvFW>(pos, src, tgt, move, mtQuiet | mtPawn);
		gen_pawn<S, mvF2>(pos, src, tgt, move, mtQuiet | mtPawn | mtPawn2);
		gen_piece<S, Knight>(pos, tgt, move, mtQuiet);
		gen_piece<S, Bishop>(pos, tgt, move, mtQuiet);
		gen_piece<S, Rook>(pos, tgt, move, mtQuiet);
		gen_piece<S, King>(pos, tgt & ~pos->attacked(), move, mtQuiet | mtKing);
		gen_castling<S>(pos, move);
	}

	template <int S>
	void gen_captures_to_target(position_t* pos, bb_t tgt, move_t*& move)
	{
		bb_t src = bbRank<S, 6> & pos->pawns(S);
		gen_promotion<S, mvNW>(pos, src, tgt, move, mtPromote | mtCapture | mtPawn);
		gen_promotion<S, mvNE>(pos, src, tgt, move, mtPromote | mtCapture | mtPawn);

		src = ~bbRank<S, 6> & pos->pawns(S);
		gen_pawn<S, mvNW>(pos, src, tgt, move, mtCapture | mtPawn);
		gen_pawn<S, mvNE>(pos, src, tgt, move, mtCapture | mtPawn);

		if (pos->enpassant_square() > 0)
		{
			bb_t epb = Bit(pos->enpassant_square());

			if (tgt & SHR<S, 8>(epb))
			{
				gen_pawn<S, mvNW>(pos, src, epb, move, mtCapture | mtEnpcapt | mtPawn);
				gen_pawn<S, mvNE>(pos, src, epb, move, mtCapture | mtEnpcapt | mtPawn);
			}
		}

		gen_piece<S, Knight>(pos, tgt, move, mtCapture);
		gen_piece<S, Bishop>(pos, tgt, move, mtCapture);
		gen_piece<S, Rook>(pos, tgt, move, mtCapture);
	}

	template <int S>
	void gen_interpositions(position_t* pos, bb_t tgt, move_t*& move)
	{
		gen_promotion<S, mvFW>(pos, bbRank<S, 6> & pos->pawns(S), tgt, move, mtPromote | mtPawn);
		bb_t src = ~bbRank<S, 6> & pos->pawns(S);
		gen_pawn<S, mvFW>(pos, src, tgt, move, mtQuiet | mtPawn);
		gen_pawn<S, mvF2>(pos, src, tgt, move, mtQuiet | mtPawn | mtPawn2);
		gen_piece<S, Knight>(pos, tgt, move, mtQuiet);
		gen_piece<S, Bishop>(pos, tgt, move, mtQuiet);
		gen_piece<S, Rook>(pos, tgt, move, mtQuiet);
	}

	template <int S>
	void gen_evasion_captures(position_t* pos, move_t*& move)
	{
		gen_piece<S, King>(pos, pos->occupied(Opp(S)) & ~pos->attacked(), move, mtCapture | mtKing);

		if (POPCNT(pos->checkers()) == 1)
			gen_captures_to_target<S>(pos, pos->checkers(), move);
	}

	template <int S>
	void gen_evasion_moves(position_t* pos, move_t*& move)
	{
		gen_piece<S, King>(pos, pos->empty() & ~pos->attacked(), move, mtQuiet | mtKing);

		if (POPCNT(pos->checkers()) == 1 && (pos->checkers() & (pos->pawns(Opp(S)) | pos->knights(Opp(S)))) == 0)
			gen_interpositions<S>(pos, masks::between[pos->king_square(S)][LSB(pos->checkers())], move);
	}

	move_t* gen_captures(position_t* pos, move_t* moves)
	{
		pos->own() == White ? gen_captures<White>(pos, moves) : gen_captures<Black>(pos, moves);  
		return moves;
	}

	move_t* gen_quiet_moves(position_t* pos, move_t* moves)
	{
		pos->own() == White ? gen_quiet_moves<White>(pos, moves) : gen_quiet_moves<Black>(pos, moves);
		return moves;
	}

	move_t* gen_evasion_captures(position_t* pos, move_t* moves)
	{
		pos->own() == White ? gen_evasion_captures<White>(pos, moves) : gen_evasion_captures<Black>(pos, moves);  
		return moves;
	}

	move_t* gen_evasion_moves(position_t* pos, move_t* moves)
	{
		pos->own() == White ? gen_evasion_moves<White>(pos, moves) : gen_evasion_moves<Black>(pos, moves);
		return moves;
	}

	move_t* gen_all_moves(position_t* pos, move_t* moves)
	{
		if (pos->own() == White)
		{
			gen_captures<White>(pos, moves);
			gen_quiet_moves<White>(pos, moves);
		}
		else
		{
			gen_captures<Black>(pos, moves);
			gen_quiet_moves<Black>(pos, moves);
		}
		return moves;
	}

	move_t* gen_all_evasions(position_t* pos, move_t* moves)
	{
		if (pos->own() == White)
		{
			gen_evasion_captures<White>(pos, moves);
			gen_evasion_moves<White>(pos, moves);
		}
		else
		{
			gen_evasion_captures<Black>(pos, moves);
			gen_evasion_moves<Black>(pos, moves);
		}
		return moves;
	}
}
fabianVDW
Posts: 146
Joined: Fri Mar 15, 2019 8:46 pm
Location: Germany
Full name: Fabian von der Warth

Re: Back To The Beginning

Post by fabianVDW »

fabianVDW wrote: Sat Sep 07, 2019 6:55 pm
Joost Buijs wrote: Sat Sep 07, 2019 12:57 pm
fabianVDW wrote: Sat Sep 07, 2019 12:21 pm AFAIK, movegeneration speed is a good chunk of the runtime in modern engines. Atleast in FabChess, it is about 25-30% of the time used, being on par with the evaluation function (IIRC).
25% to 30% of the time used by the move generation is quite a lot. Maybe it depends upon staged move generation or not and whether you execute the hash move with or without move generation, when I profile my engine move generation takes considerably less than 10% of the total time used.
I also do staged movegen in FabChess. Maybe I remembered the numbers wrong... I will do another profiling session tomorrow.
Okay, I did some profiling. First of all, Fab runs perft 6 on startpos with about 80MNPS (with bulk counting) on my I5-6400. During search, movegen takes up about 5.5% of the time, but it uses a struct which contains all the pseudo legal moves for every piece. This struct is later reused in eval for mobility etc. Calculation of that struct is about 15% of the time.

I am sure my engine is badly tuned in that regard aswell, as I seem not to be able to recognize bottlenecks well
Author of FabChess: https://github.com/fabianvdW/FabChess
A UCI compliant chess engine written in Rust.
FabChessWiki: https://github.com/fabianvdW/FabChess/wiki
fabianvonderwarth@gmail.com