TalkChess.com

Posted: **Sat Aug 25, 2007 9:06 pm**

This is the bishop code in RomiChess.

      case WBISHOP&#58;
        h->moves&#91;id&#93; = (&#40;ray7p&#91;fs&#93; & belowInc&#91;FirstBit&#40;ray7p&#91;fs&#93; & aPieces&#41;&#93;)
                     |  &#40;ray9p&#91;fs&#93; & belowInc&#91;FirstBit&#40;ray9p&#91;fs&#93; & aPieces&#41;&#93;)
                     |  &#40;ray7m&#91;fs&#93; & aboveInc&#91;LastBit&#40;ray7m&#91;fs&#93; & aPieces&#41;&#93;)
                     |  &#40;ray9m&#91;fs&#93; & aboveInc&#91;LastBit&#40;ray9m&#91;fs&#93; & aPieces&#41;&#93;));
        h->attacked |= h->moves&#91;id&#93;;
        h->moves&#91;id&#93; &= notMe;
        continue;

The ifs are hidden in the FirstBit and LastBit functions.

Code: Select all

__inline s32 FirstBit&#40;u64 bits&#41; &#123;
	s32 index;
	if&#40;_BitScanForward64&#40;&index, bits&#41;)
		return index;
	return 64;
&#125;

__inline s32 LastBit&#40;u64 bits&#41; &#123;
	s32 index;
	if&#40;_BitScanReverse64&#40;&index, bits&#41;)
		return index;
	return 64;
&#125;

Posted: **Sat Aug 25, 2007 9:26 pm**

Michael Sherwin wrote:This is the bishop code in RomiChess.

So what kind of performance does it actually get compared to all these other methods.

I haven't tried the code in that testing framework because I don't have a 64 bit system.

He tried it in his code and he said it ran rather well, but the program was so unoptimized at that point he couldn't get dependable timings. He ended up going with Gerd's method simply because it seemed 'cleaner'.

It'd be kind of funny if it turns out to be competitive to all those other methods.

I figured surely somebody was using that method, but I wasn't seeing any performance results about it.

Everyboyd always talks about Gerd's and Rotated and Magics, etc.

The ifs are hidden in the FirstBit and LastBit functions.

Sure is a shame that Intel screwed up the specs for BSR & BSR...

Didn't AMD fix that in their 64 bit cpu's?? I thought I remembered reading somebody saying that some new processors do it right.

Posted: **Sat Aug 25, 2007 10:09 pm**

Carey wrote:
Michael Sherwin wrote:This is the bishop code in RomiChess.
So what kind of performance does it actually get compared to all these other methods.

I haven't tried the code in that testing framework because I don't have a 64 bit system.

He tried it in his code and he said it ran rather well, but the program was so unoptimized at that point he couldn't get dependable timings. He ended up going with Gerd's method simply because it seemed 'cleaner'.

It'd be kind of funny if it turns out to be competitive to all those other methods.

I figured surely somebody was using that method, but I wasn't seeing any performance results about it.

Everyboyd always talks about Gerd's and Rotated and Magics, etc.

The ifs are hidden in the FirstBit and LastBit functions.
Sure is a shame that Intel screwed up the specs for BSR & BSR...

Didn't AMD fix that in their 64 bit cpu's?? I thought I remembered reading somebody saying that some new processors do it right.

Hi Carey,

I do not have any comparison figures for the classic bitboard method, except that my perft example ran at ~twice the rate of Crafty's. On new 64 bit machines running a 64 bit compile it would be extreamly fast. But, I do not have a new 64 bit machine, so I can not give an answer there either.

The best thing to do would be to use Harald's source code to devise a test for the various methods that can be compiled for both 32 bit and 64 bit machines. Then you will have your answers. I just hope that you share them with us!

Mike

Posted: **Sat Aug 25, 2007 10:44 pm**

Michael Sherwin wrote: I do not have any comparison figures for the classic bitboard method, except that my perft example ran at ~twice the rate of Crafty's. On new

That sounds encouraging.

64 bit machines running a 64 bit compile it would be extreamly fast. But, I do not have a new 64 bit machine, so I can not give an answer there either.

Neither do I. Not yet, anyway. I'm hoping for enough spare cash towards the end of the year to buy a decent new system & large monitor. But until them, I'm using an old P3 for my own work. (Yes, I know the P3 is old and I should have upgraded long ago. But I didn't.... It did what I needed. Just not what I wanted...)

The best thing to do would be to use Harald's source code to devise a test for the various methods that can be compiled for both 32 bit and 64 bit machines. Then you will have your answers. I just hope that you share them with us!

That probably would be the best thing.

But I don't have a 64 bit system. Which is the system of most interest. Both the X2 and Core2.

For 32 bit, all I have is an old P3, which would give very different results than a P4 or a 64 bit cpu in 32 bit mode.

(Plus, we all know that we tend to tune the code for whatever system & compiler we are working on, so the results would be further skewed away from results usable for anyone else.)

Even compiling them would be a little bit of a problem because I don't have a C compiler installed. I just recently reinstalled Windows fresh, getting rid of a lot of accumulated junk and instabilities, and I'm not eager to install MSVC Express and all the assocaited junk onto my nice clean working system.

I just hadn't heard anybody talk about the "Pitty" method and was hoping somebody already had some numbers for it.

Posted: **Sat Aug 25, 2007 10:58 pm**

Carey wrote:Something I've been curious about for a while. Since I don't have a 64 bit system to test it on, I figured I'd ask...

What about the "Classic" method. (Called here the "Pitty method".)

You know, the old school kind from before rotated bitboards were developed.

(Somewhat like how Chess 4.x did it, except they were more concerned about updating their database, which was very slow.)

Do the 8 rays, do the FindFirstBit() & FindLastBit() stuff, etc.

The memory is only 8 rays * 64 squares * sizeof(BitBoard) = 4k.
I figured that with 64 bit systems having fast BSF & BSR instructions, something like this might be practical.

But I don't have a 64 bit system to test it on.

(Note: This code snippet came from a friend. So if you talk about it, please call it the 'Pitty method". Any newbie who's ever tried to figure out how to do rotated bitboards for his first chess program might be able to figure out why it's called the "Pitty method".... And yes, there are two T's in this spelling.)

edit: Minor code typo due to cleanup for posting.

The fascinating thing is the number of possible approaches to get sliding attacks. It is still a nice cycle hunting challenge and contest for some of us

Your mentioned Pitty method (was not aware on that name) is still worth to consider with nowadays faster bitscans (core2duo, K10 - K8 has slow 9-10 cycles vector path bitscans - not to mention P4). Specially if you need one direction only. It is for instance great to get the source square of direction wise target-sets from kogge-stone fills, with different cases (bsr/bsf) for positive and negative directions.

The advantage is the small 4-KByte lookup array and moderate register usage. Also it only deals with one but no rotated occupied bitboards and is therefor as versatile as magics. Drawback are the four branches with some chance of miss-predictions up and then and the sequential nature. Up to eight L1 reads, likely distinct cachelines, four (conditional) bitscans, xors, some ors sounds not that cheap. Otoh with a polluted L1 it might be better to stay with a few branch misses rather than a lot of L1-misses. I wonder how many % L1 misses are necessary so that and, mul, shr, lookup, becomes slower, reading mask/factor/shiftamount[sq] from one cacheline and the lookup from a second.

Finally, emprical evidence for a particular program/cpu is what counts. And if we are talking about 1%-noise in total program performance, I suggest to take the appraoch you like or even (re-)invent by yourself.

Posted: **Sun Aug 26, 2007 12:14 am**

Gerd Isenberg wrote:The fascinating thing is the number of possible approaches to get sliding attacks. It is still a nice cycle hunting challenge and contest for some of us

I am very happy with forgotten Gerd's "bswap" approach:

Code: Select all

    struct M &#123;
        BitBoard singleton; //optional well-known "bitmask" &#40;1 << sq&#41;
        BitBoard vertical; //attack direction on empty board
        BitBoard diagonal;
        BitBoard antidiag;
        BitBoard knight;
        BitBoard king;
        BitBoard whitePawn;
        BitBoard blackPawn;
    &#125; mask&#91;64&#93;;

    BitBoard attack&#40;Square from, BitBoard occupied, BitBoard dirMask&#41; const &#123;
        bit64_t forward = occupied & dirMask;
        bit64_t reverse = bswap&#40;forward&#41;;
        forward -= mask&#91;from&#93;.singleton;
        reverse -= mask&#91;from ^ 070&#93;.singleton;
        forward ^= bswap&#40;reverse&#41;;
        return BitBoard&#40;forward & dirMask&#41;;
    &#125;

    BitBoard bishop&#40;Square from, BitBoard occupied&#41; const &#123;
        return
            attack&#40;from, occupied, mask&#91;from&#93;.diagonal&#41; |
            attack&#40;from, occupied, mask&#91;from&#93;.antidiag&#41;;
    &#125;

    BitBoard rook&#40;Square from, BitBoard occupied&#41; const &#123;
        return
            attack&#40;from, occupied, mask&#91;from&#93;.vertical&#41; |
            rankAttack&#40;from, occupied&#41;; //any traditional approach
    &#125;

32 bit and cache friendly. Memory acceses nicely interleaved with simple arithmetic.

Posted: **Sun Aug 26, 2007 12:51 am**

Gerd Isenberg wrote: The fascinating thing is the number of possible approaches to get sliding attacks. It is still a nice cycle hunting challenge and contest for some of us

There has been a staggering amount of research and results the past couple years.

You and a few others have really revolutionized bitboard data structures.

It's safe to say that if you had tried to do all this back in the days of the ICCA, with its quartlery journal, there wouldn't have been nearly as much progress made.

Your mentioned Pitty method (was not aware on that name) is still worth

The method is so old I doubt it has a name.

That was probably the way Slate & Atkin did some bitboard stuff way back in the selective search Chess 3.x. At the very least, it comes pretty direct from their Chess 4.x and Atkin's Chess 0.5 program.

I called it the 'Pitty method' for several reasons, including that if you try to do rotated bitboards for your very first chess program, somebody needs to take pity on you and suggest a simpler way.

(It is worth pointing out that when he told me he had tried this method, I was definetly surprised he got good results. With the branch misses and the bsf/bsr's I did not expect decent performance.)

to consider with nowadays faster bitscans (core2duo, K10 - K8 has slow 9-10 cycles vector path bitscans - not to mention P4). Specially if you need

That was kind of my feeling too. But since I don't have a 64 bit system (or even a fast 32 bit system), I don't really know what the performance would be. Possibly pretty good though.

one direction only. It is for instance great to get the source square of direction wise target-sets from kogge-stone fills, with different cases (bsr/bsf) for positive and negative directions.

The advantage is the small 4-KByte lookup array and moderate register

Your 1.5k aproach is better. That's what I prefer, even if it's not as fast as other methods. It's just more elegant.

usage. Also it only deals with one but no rotated occupied bitboards and is therefor as versatile as magics. Drawback are the four branches with some chance of miss-predictions up and then and the sequential nature.

Don't the newer X2's & Core2's behave differently in this?

Also, what about the built in compare & swap etc. instruction(s)? Would those be better?

What about a jumpless version. Something to make Mask = $f...f or 0 depending whether Board has any bits at all. Then you could do Bit=Mask&(bsf()+1)

Don't know.... Just throwing out ideas. I don't know the x86 instruction set too well. Can't stand assembler anymore.

Up to eight L1 reads, likely distinct cachelines, four (conditional) bitscans,

It'd only have to be one cache line, though. Put all 8 rays into one data structure. 8 rays of 8 bytes each equals 64 bytes. One cache line.

xors, some ors sounds not that cheap. Otoh with a polluted L1 it might be better to stay with a few branch misses rather than a lot of L1-misses. I wonder how many % L1 misses are necessary so that and, mul, shr, lookup, becomes slower, reading mask/factor/shiftamount[sq] from one cacheline and the lookup from a second.

Finally, emprical evidence for a particular program/cpu is what counts. And if we are talking about 1%-noise in total program performance, I suggest to take the appraoch you like or even (re-)invent by yourself.

There is always personal preference, yes.

But if it works and is competitive, then I think it could be quite practical. Even perhaps recommended for newbies. It's something they could easily understand.

As for the performance... Who knows... it might be 10% faster. Or 20% slower.

But it is an interesting idea. Everybody has been talking about every other fancy method that I just finally decided I had to ask about this classical 'stupid' method.

Posted: **Sun Aug 26, 2007 1:33 am**

Carey wrote:
Gerd Isenberg wrote: Up to eight L1 reads, likely distinct cachelines, four (conditional) bitscans,
It'd only have to be one cache line, though. Put all 8 rays into one data structure. 8 rays of 8 bytes each equals 64 bytes. One cache line.

Arghhhh. I'm stupid.

I realize what you mean now.

Don't know what I was thinking. Probably not thinking at all.

Yes, it would cause 8 cache lines. But you might be able to reduce that a little.

Maybe adding two new arrays. One for the 'forward' directions and one for the 'reverse' directions. Those would get accessed more often and might still be in the cache? Just a guess, though.

I wonder if you could compress the 8 rays down to just 4, with each one covering both directions. Combining them with the Forward & Reverse arrays would get the direction needed.

Not sure it'd be worth the effort, though.

(shrug)

Posted: **Sun Aug 26, 2007 10:29 am**

Carey wrote:
Gerd Isenberg wrote: The fascinating thing is the number of possible approaches to get sliding attacks. It is still a nice cycle hunting challenge and contest for some of us
There has been a staggering amount of research and results the past couple years.

You and a few others have really revolutionized bitboard data structures.

It's safe to say that if you had tried to do all this back in the days of the ICCA, with its quartlery journal, there wouldn't have been nearly as much progress made.

Thank you for your kind words, I only picked up some ideas from others. Kindergarten bitboards is just a minor extension of Steffan Westcott's collapsed file/rank-index routines. http://www.stmintz.com/ccc/index.php?id=491079

I can't imagine that multiplicative hashing of the occupancies wasn't thought or tried before by the old pioneers, but multiplication was much too expensive that times.

Beside their Inventors, Arthur Samuel, Adelson-Velsky, Slate and Atkins, the bitboard revolutionaries were Bob Hyatt and Ernst Heinz with the rotated bitboards idea. http://en.wikipedia.org/wiki/Bitboard#History

The guy (knows somebody his name?) who wrote the hyperbolica paper in 99, introducing reversed bitboards and the idea (at least to me) to use

Code: Select all

   attacks &#58;= occupied ^ &#40;occupied - 2*rooks&#41;

Albeit it was only used with single sliders, masked rays and shifted squares, the expression works setwise (multiple rooks) in principle - that is with carry or borrow propagation in all eight but not only one direction.
Steffan Westcott, introduced the Kogge-Stone routines - which perform in principle the above expression in all eight directions setwise. To treat sliding attacks as carries of parallel prefix adders. Probably this was the most revolutionary or vanguard approach.

If we classify sliding attack approaches, I would come up with

1.) mapping the relevant occupancy to consecutive bits of an index, to lookup some precalculated tables. Rotated, magic, kindergarten, etc.

2.) arithmetical approaches to direct calculate the attack set. With more or less help of precalculated tables.

Posted: **Sun Aug 26, 2007 10:53 am**

Carey wrote: Don't the newer X2's & Core2's behave differently in this?

Branch prediction has improved a lot. But still it is possible to pollute
some ressources related to it, namely branch target buffer.

Carey wrote: Also, what about the built in compare & swap etc. instruction(s)? Would those be better?

What about a jumpless version. Something to make Mask = $f...f or 0 depending whether Board has any bits at all. Then you could do Bit=Mask&(bsf()+1)

Don't know.... Just throwing out ideas. I don't know the x86 instruction set too well. Can't stand assembler anymore.

We will have fast leading zero count (beside popcount) in K10 and new intels. They may be treated as a branchless bsr^63, since lzc returns 64 for empty sets. For emulating bitscan forward by lzc, intersect the two's complement is necessary, to isolate the least significant bit (if any).

TalkChess.com

BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits

Re: BitBoard Tests Magic v Non-Rotated 32 Bits v 64 Bits