Perft as a Measure of Speed...

hgm · Post by **hgm** » Mon Apr 22, 2013 8:41 pm

In a bitboard there are usually multiple boards that have to be updated. E.g. on moving a Queen you would need to update occupancy, color, diagonal sliders, orthogonal sliders. All 64 bits. And the same for removing the victim, if the move was a capture. With mailbox it is just moving a single byte to the new square, and clearing the old.

AlvaroBegue · Post by **AlvaroBegue** » Mon Apr 22, 2013 8:54 pm

hgm wrote:In a bitboard there are usually multiple boards that have to be updated. E.g. on moving a Queen you would need to update occupancy, color, diagonal sliders, orthogonal sliders. All 64 bits. And the same for removing the victim, if the move was a capture. With mailbox it is just moving a single byte to the new square, and clearing the old.

Oh, I see. Well, I use bitboards, but I simply have one bitboard per piece type, and I use magics for sliding attacks, so I don't need diagonal sliders or orthogonal sliders. I do need to recompute color and occupancy, but I recompute them every time, which just takes a few clock cycles (6 is a quick guess, but I'll try to measure it).

hgm · Post by **hgm** » Mon Apr 22, 2013 9:10 pm

Well, that is even worse, right? With 6 piece types the color boards each need you to OR 6 boards, which you then have to or together to get occupancy (12 loads, 11 ORs, 3 stores, plus 2 loads, 2 XORs and 2 stores to update the mover and victim piece types). Updating color incrementally (and occupancy from those) would just require 2 loads, 2 XOR, 3 stores (so 6 loads, 4 XOR, 1 OR and 5 stores in total).

Mailbox would require 1 load, 1 XOR (to create a zero) and 2 stores.

This does not take into account saving stuff for the later UnMake (which in the mailbox case would just be 1 load, 2 stores).

Having Q on separate boards from R and B would require you to alway test two boards if you want to know if a piece is attacked from a given direction. (E.g. the RookAttacks board on the King square would have to be ANDed with both opponent Q and R to detect if you are in check. Which loses you time there.

AlvaroBegue · Post by **AlvaroBegue** » Mon Apr 22, 2013 9:30 pm

I am not sure that counting operations the way you are doing is very relevant to how long it's going to take the CPU to perform these operations. ORs are ridiculously simple to implement in hardware, and it's very likely the CPU will be able to parallelize a bunch of them. Stores and loads from L1 cache are incredibly fast too.

Doing an extra OR (like taking the union of rook and queen) can often be scheduled together with other operations, so it's effectively free.

My best attempt at measuring how long it takes to compute those 11 ORs tells me it takes 5 cycles.

Rein Halbersma · Post by **Rein Halbersma** » Mon Apr 22, 2013 9:52 pm

hgm wrote:Well, that is even worse, right? With 6 piece types the color boards each need you to OR 6 boards, which you then have to or together to get occupancy (12 loads, 11 ORs, 3 stores, plus 2 loads, 2 XORs and 2 stores to update the mover and victim piece types). Updating color incrementally (and occupancy from those) would just require 2 loads, 2 XOR, 3 stores (so 6 loads, 4 XOR, 1 OR and 5 stores in total).

Mailbox would require 1 load, 1 XOR (to create a zero) and 2 stores.

This does not take into account saving stuff for the later UnMake (which in the mailbox case would just be 1 load, 2 stores).

Having Q on separate boards from R and B would require you to alway test two boards if you want to know if a piece is attacked from a given direction. (E.g. the RookAttacks board on the King square would have to be ANDed with both opponent Q and R to detect if you are in check. Which loses you time there.

Isn't having 2 color boards + 6 piece_type boards a good compromise? MakeMove takes 2 updates (one for the color, one for the piece type) and using Occupied takes 2 loads + 1 xor. Bitboards are a tradeoff: you gain some by doing pattern matching across squares in parallel, but you lose some by needing supporting data to also get patterns within squares (i.e. which color/piece occupy a single square).

hgm · Post by **hgm** » Mon Apr 22, 2013 10:18 pm

Well, let me stress I have never written a bitboard program, but I am not sure why it would matter whether you have a full set of type-boards for each color or share them. You would only update one of them anyway (or two, for a capture). It seems just a nuisance that you would have to add an extra &myPieces or &hisPieces in every test you do. I cannot imagine situations in which you would be interested to get all Bishops (say) of both colors.

The way I imagine a bitboard MakeMove would work is:

Code: Select all

Make:
  change = square[move->from] | square[move->to];
  capture = square[move->to] & occupied;
  pieceType = ?; // from move struct? 
  myPieces[pieceType] ^= change;
  myPieces[COLOR] ^= change;
  victimType = ?; // not sure how you would get this at all
  hisPieces[victimType] ^= capture;
  hisPieces[COLOR] ^= capture;
  occupied = myPieces[COLOR] | hisPieces[COLOR];
UnMake:
  myPieces[pieceType] ^= change;
  myPieces[COLOR] ^= change;
  hisPieces[victimType] ^= capture;
  hisPieces[COLOR] ^= capture;
  occupied = myPieces[COLOR] | hisPieces[COLOR];

hgm · Post by **hgm** » Mon Apr 22, 2013 10:24 pm

AlvaroBegue wrote:I am not sure that counting operations the way you are doing is very relevant to how long it's going to take the CPU to perform these operations. ORs are ridiculously simple to implement in hardware, and it's very likely the CPU will be able to parallelize a bunch of them. Stores and loads from L1 cache are incredibly fast too.

Doing an extra OR (like taking the union of rook and queen) can often be scheduled together with other operations, so it's effectively free.

That holds for all operations. Individually they are all 'free', because they are scheduled together with something else.

But if they would not have been executed 'for free', some other instruction would have been executed for free in their place, which now takes time. In the end the only thing that matters is how much you want the CPU to do. When it has to do half as much, it will be twice as fast. That is true whether it can do 1 or 10 instructions at a time.

AlvaroBegue · Post by **AlvaroBegue** » Mon Apr 22, 2013 10:41 pm

hgm wrote:
AlvaroBegue wrote:I am not sure that counting operations the way you are doing is very relevant to how long it's going to take the CPU to perform these operations. ORs are ridiculously simple to implement in hardware, and it's very likely the CPU will be able to parallelize a bunch of them. Stores and loads from L1 cache are incredibly fast too.

Doing an extra OR (like taking the union of rook and queen) can often be scheduled together with other operations, so it's effectively free.
That holds for all operations. Individually they are all 'free', because they are scheduled together with something else.

But if they would not have been executed 'for free', some other instruction would have been executed for free in their place, which now takes time. In the end the only thing that matters is how much you want the CPU to do. When it has to do half as much, it will be twice as fast. That is true whether it can do 1 or 10 instructions at a time.

That's only partially correct. Very often code has long dependency chains (the parameters of an operation depend on the result of previous operations), where the processor has to churn through them one at a time. This code is cheap to execute because it's easy for a superscalar CPU to perform some of these operations in parallel.

An example of code that has dependency chains:

Code: Select all

u64 rook_attacks(int from, u64 occupied) {
  MagicInfo const &magic_info = rook_attack_table[from];
  unsigned scrambled = ((occupied & magic_info.mask) * magic_info.magic) >> magic_info.shift;

  return magic_info.table[scrambled];
}

So what is that code doing? Looking up a table, ANDing something from the table entry with `occupied', multiplying the result by something, shifting the result of that, and looking up another table using the result of that. As you see, every operation uses the result of the previous operation in some way, and the processor has tons of arithmetic units sitting idle.

In a program that has a lot of code like that (and I am afraid mine does), performing the occasional OR operation turns out to be free.

Evert · Post by **Evert** » Mon Apr 22, 2013 10:42 pm

hgm wrote:I cannot imagine situations in which you would be interested to get all Bishops (say) of both colors.

I can. For instance: I skip bishop (rook, knight) evaluation if there are no bishops (rooks, knights) remaining, which is a simple test. If I want to classify an ending it's convenient to test for the presence of queens (say) using "if (board[QUEEN] != 0)". I'm sure there are other examples.
Typically though you do want to know about pieces of a particular side.

The reason I use 6 piece types + 2 colours in Jazz is that I use copy-make and copying less stuff means it's faster.

hgm · Post by **hgm** » Mon Apr 22, 2013 10:49 pm

AlvaroBegue wrote:That's only partially correct. Very often code has long dependency chains (the parameters of an operation depend on the result of previous operations), where the processor has to churn through them one at a time.

This was true on Pentium I. But-of-order execution would completely hide that, by executing other dependency chains that are independent of this one (like calculating the attack set of the other Rook, or of the Bishops) in parallel with this one.

Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...

Re: Perft as a Measure of Speed...