Gerd Isenberg wrote:
The fascinating thing is the number of possible approaches to get sliding attacks. It is still a nice cycle hunting challenge and contest for some of us
There has been a staggering amount of research and results the past couple years.
You and a few others have really revolutionized bitboard data structures.
It's safe to say that if you had tried to do all this back in the days of the ICCA, with its quartlery journal, there wouldn't have been nearly as much progress made.
Your mentioned Pitty method (was not aware on that name) is still worth
The method is so old I doubt it has a name.
That was probably the way Slate & Atkin did some bitboard stuff way back in the selective search Chess 3.x. At the very least, it comes pretty direct from their Chess 4.x and Atkin's Chess 0.5 program.
I called it the 'Pitty method' for several reasons, including that if you try to do rotated bitboards for your very first chess program, somebody needs to take pity on you and suggest a simpler way.
(It is worth pointing out that when he told me he had tried this method, I was definetly surprised he got good results. With the branch misses and the bsf/bsr's I did not expect decent performance.)
to consider with nowadays faster bitscans (core2duo, K10 - K8 has slow 9-10 cycles vector path bitscans - not to mention P4). Specially if you need
That was kind of my feeling too. But since I don't have a 64 bit system (or even a fast 32 bit system), I don't really know what the performance would be. Possibly pretty good though.
one direction only. It is for instance great to get the source square of direction wise target-sets from kogge-stone fills, with different cases (bsr/bsf) for positive and negative directions.
The advantage is the small 4-KByte lookup array and moderate register
Your 1.5k aproach is better. That's what I prefer, even if it's not as fast as other methods. It's just more elegant.
usage. Also it only deals with one but no rotated occupied bitboards and is therefor as versatile as magics. Drawback are the four branches with some chance of miss-predictions up and then and the sequential nature.
Don't the newer X2's & Core2's behave differently in this?
Also, what about the built in compare & swap etc. instruction(s)? Would those be better?
What about a jumpless version. Something to make Mask = $f...f or 0 depending whether Board has any bits at all. Then you could do Bit=Mask&(bsf()+1)
Don't know.... Just throwing out ideas. I don't know the x86 instruction set too well. Can't stand assembler anymore.
Up to eight L1 reads, likely distinct cachelines, four (conditional) bitscans,
It'd only have to be one cache line, though. Put all 8 rays into one data structure. 8 rays of 8 bytes each equals 64 bytes. One cache line.
xors, some ors sounds not that cheap. Otoh with a polluted L1 it might be better to stay with a few branch misses rather than a lot of L1-misses. I wonder how many % L1 misses are necessary so that and, mul, shr, lookup, becomes slower, reading mask/factor/shiftamount[sq] from one cacheline and the lookup from a second.
Finally, emprical evidence for a particular program/cpu is what counts. And if we are talking about 1%-noise in total program performance, I suggest to take the appraoch you like or even (re-)invent by yourself.
There is always personal preference, yes.
But if it works and is competitive, then I think it could be quite practical. Even perhaps recommended for newbies. It's something they could easily understand.
As for the performance... Who knows... it might be 10% faster. Or 20% slower.
But it is an interesting idea. Everybody has been talking about every other fancy method that I just finally decided I had to ask about this classical 'stupid' method.