I'm Puzzled - Storing Piece Info & Magic Move Gen...

Rein Halbersma · Post by **Rein Halbersma** » Wed Mar 27, 2013 10:19 pm

Evert wrote:
Rein Halbersma wrote: How big is your Position class? Would adding 64 bytes of board[64] make such a big difference? Just one cache line per copy.
Six bitboards (one for each piece type) and two for each side, plus some extra data for a total of 96 bytes (some of which is padding added by the compiler). So adding 64 bytes to that is significant, but actually the perft results aren't that much worse if I add 64 (unused) bytes to the end of the struct, only about 10%. I'd recover some of that from a more efficient get_piece() function.

Still, I suspect a more sensible way to do it would be to use make/unmake for the board[64] array even if I use copy/make for the bitboards, but then it might make more sense to switch to make/unmake altogether (which would be a big hassle since parts of the code rely on being able to access previous positions directly).

Nice to see you have such compact data structures. In that case, have you tried to use Gerd's quad-bitboard approach to use only 4 bitboards to encode all 6 pieces and colors and also have a reasonably efficient array view over these 4 bitboards?

Evert · Post by **Evert** » Wed Mar 27, 2013 10:58 pm

Rein Halbersma wrote:Nice to see you have such compact data structures.

Thanks.

It used to be bigger when I was using rotated bitboards. Not much room to squeeze things out now though; there's a mostly useless "init" bitboard that I think is only used anymore for castling rights, which can be dealt with more compactly of course. That would give a reduction of 8 bytes at the expense of slightly more complicated code. I have a vague plan to do that "someday", but it's such a hassle that I keep finding other things to do first.

In that case, have you tried to use Gerd's quad-bitboard approach to use only 4 bitboards to encode all 6 pieces and colors and also have a reasonably efficient array view over these 4 bitboards?

No.
I'd looked at it, but decided against it because my gut feeling was that it was too complicated. Perhaps unfairly, but it would be quite difficult to switch over now because I have (or had, when I started writing the code) the bad habit of accessing the struct members directly to get at the data (as opposed to using accessor functions that hide the implementation).

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 27, 2013 11:31 pm

When i was in bad need of reducing size of the position structure, I used the representation that you use now i.e. with two white/black pieces to differentiate colors. That was enough for me so i didn't need to compress them further to quad-bitboards. But Srdja used them. The more you compress the more time to decompress so unless you are in severe need of space (as is on gpu), i don't see much use for them.

spambanane · Post by **spambanane** » Thu Mar 28, 2013 11:25 am

Evert wrote:
Rein Halbersma wrote: How big is your Position class? Would adding 64 bytes of board[64] make such a big difference? Just one cache line per copy.
Six bitboards (one for each piece type) and two for each side, plus some extra data for a total of 96 bytes (some of which is padding added by the compiler). So adding 64 bytes to that is significant, but actually the perft results aren't that much worse if I add 64 (unused) bytes to the end of the struct, only about 10%.

Is 64 byte more that significant? If yes: why? Can you go more into detail please (I am new to chess programming.)? I can't imagine a big difference with L1 chaches having 32kb. What am I missing?

Rein Halbersma · Post by **Rein Halbersma** » Thu Mar 28, 2013 12:24 pm

spambanane wrote:
Evert wrote:
Rein Halbersma wrote: How big is your Position class? Would adding 64 bytes of board[64] make such a big difference? Just one cache line per copy.
Six bitboards (one for each piece type) and two for each side, plus some extra data for a total of 96 bytes (some of which is padding added by the compiler). So adding 64 bytes to that is significant, but actually the perft results aren't that much worse if I add 64 (unused) bytes to the end of the struct, only about 10%.
Is 64 byte more that significant? If yes: why? Can you go more into detail please (I am new to chess programming.)? I can't imagine a big difference with L1 chaches having 32kb. What am I missing?

The problem is not that the 64 bytes do not fit inside the cache, but rather the limited memory bandwidth. With 30 ply searches, 192 byte positions, 35 moves per ply of 2 bytes each, you have 30 * 262 ~ 8K, so comfortably inside the cache. But with copy-make, positions of 3 cache lines (192 bytes) and ~10 Mnps you are copying ~2 Gb/s *per core*.

Evert · Post by **Evert** » Thu Mar 28, 2013 2:20 pm

spambanane wrote: Is 64 byte more that significant? If yes: why? Can you go more into detail please (I am new to chess programming.)? I can't imagine a big difference with L1 chaches having 32kb. What am I missing?

I use copy-make. That means I make a copy of the current state before making the move, and when I "unmake" the move, instead of unmaking it, I simply go back to my previous copy.

Compared to make/unmake, where you make the move and then reverse it, this makes the "unmake" much simpler (and faster, since it's free), but it adds a cost to "make", which is copying the entire content of the board. Increasing the size from 96 to 140 bytes is significant because the size increase is significant (and that's before we factor in crossing different cache lines).

Let's not blow that out of proportion though: the bottle-neck is going to be the speed of the evaluation, not the speed of make/unmake move.

By the way: even if you use make/unmake as opposed to copy/make you still need to copy some information (like the castling rights, en-passant square and depending on how you encode captures, the captured piece).

Chan Rasjid · Post by **Chan Rasjid** » Thu Mar 28, 2013 3:50 pm

Chan Rasjid wrote:
AlvaroBegue wrote:
Chan Rasjid wrote:I could think of scanning the bits of bits[0][Rook] for white rooks and use bitscan to get the squares for the rooks. Is this way faster. I use to dislike bitScanForward as it is said to be expensive.
"It is said to be expensive" is not much of an argument for anything. Measure it!

Modern CPUs (the ones that support SSE 4.2) have hardware implementations for population count and finding the least-significant bit set (accessible through __builtin_popcountl and __builtin_ctzl, if you are using gcc).
I think yes - my piece list goes. I'll keep the current version, do the changes and compare.

I was thinking a little, that it should be better without the piece list. Updating them don't seem to come cheap. Less global array of pclist[2][8][16] means more cache friendly - just need u64 bits[2][8] all the time. Also not much need for pawn piece list in evaluation as pawn hash has hit rate of 95%.

Best Regards,
Rasjid.

I have changed my codes taking away the piece list. My cpu is a core-2 duo and using the gcc builtin popcnt and bitscans.

It is good news and good news!

The first good new is - there is about a speed-up of 2.5% in the nps.

The second good news is - I am not that bad a programmer after all.

My implementation of the piece list was good.

Besides, the codes are much simpler.

Best Regards,
Rasjid.

I'm Puzzled - Storing Piece Info & Magic Move Gen...

Re: I'm Puzzled - Storing Piece Info & Magic Move Gen...

Re: I'm Puzzled - Storing Piece Info & Magic Move Gen...

Re: I'm Puzzled - Storing Piece Info & Magic Move Gen...

Re: I'm Puzzled - Storing Piece Info & Magic Move Gen...

Re: I'm Puzzled - Storing Piece Info & Magic Move Gen...

Re: I'm Puzzled - Storing Piece Info & Magic Move Gen...

Re: I'm Puzzled - Storing Piece Info & Magic Move Gen...