Speedup with bitboards on 64-bit CPUs

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
jswaff

Re: Speedup with bitboards on 64-bit CPUs

Post by jswaff » Sat Apr 28, 2007 12:21 am

Tord Romstad wrote: 60% is a lot, but of course we don't know exactly what Rybka is doing. Does anyone have any numbers for Crafty, SmarThink or other bitboard engines?
I don't have an exact figure, but Prophet's speedup is significant. On my 2ghz Core Duo I would typically see NPS's in the low 200's while playing on ICC. On my wife's AMD64 3200+ (2.2ghz I think) it is right around 500knps.

If I get a chance next week I'll try to take some measurements, but I'm sure it's a significant speedup. I know I won't be able to until at least Tuesday evening, but I should have some time after that.

Oh, Prophet is a rotated bitboard engine BTW. Prophet is also open source, you can get it at http://chessprogramming.org/prophet .

--
James

frankp
Posts: 216
Joined: Sun Mar 12, 2006 2:11 pm

Re: Speedup with bitboards on 64-bit CPUs

Post by frankp » Sat Apr 28, 2007 10:38 am

This strikes a cord. I was writing a new 0x88 program for the Palm z22 I bought recently, but it was just too slow (on the pc) so I have gone back to bitboards mainly because this is the way I think. Eval on 0x88 was so hard. I guess it is familiarity that matters. There seems no intrinsic reason why 0x88 would not be as fast. (Bitboards will be no good for the z22 of course.).

To answer your question, IIRC my old bitboard program got the order of 50% speedup moving to amd64 (gcc/Linux), but does not use rotated bitboards.

Frank

bob
Posts: 20557
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Speedup with bitboards on 64-bit CPUs

Post by bob » Sun Apr 29, 2007 12:34 am

Tord Romstad wrote:
hgm wrote:I would not expect any speedup from bitboards compaired to mailbox, even in 64-bit mode, if you jsut use them to replace the move generator. For bulk move generation bitboards cannot beat mailbox. They are very much faster than mailbox for generating moves selectively (e.g. only captures, only checks), though. But if your program should be doing that a lot in order to see any benefits.
I do generate moves incrementally with bitboards, and have separate move generation functions for captures, checks, non-captures and check evasions. But move generation doesn't consume a significant fraction of the CPU time in my program in any case, neither with bitboards nor with mailbox. The places where bitboards seem to be significantly slower are king safety evaluation, scanning for attacks in a single direction, and making and unmaking moves.

Tord
I suspect you just need some "seat time" to get acclimated. "scanning for attacks" is a bad idea. We don't want to "scan" for anything in bitboards...

bob
Posts: 20557
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Speedup with bitboards on 64-bit CPUs

Post by bob » Sun Apr 29, 2007 12:38 am

Michael Sherwin wrote:
jdart wrote:Arasan uses bitboards extensively (including rotated). The 64-bit version is faster but not by much (10-25%). I am not sure why.

--Jon
Rotated bitboards use lots and lots of multidimentional look-ups which is great for 32 bit processors, because it keeps 64 bit access to a minimum. However, when compiled for 64 bits all the same array indexing must be done and 64 bit access is still minimized. There are other bitboard methods that minimize look-ups and maximize 64 operations that run much faster on 64 bit processors when compiled to 64 bit code.
I haven't found anything "much faster". I tested magic move generation and found it to be identical in speed with the rotated bitmaps I used in older versions. I converted to magic because they are simpler, but not because they are faster. There are a couple of advantages about being able to update one "occupied" bitmap and get moves from that, but for normal stuff, the two approaches are really equal if done properly. And since move generation is a small part of most chess programs, it doesn't make a hill of beans worth of difference no matter what approach you use, bitmaps or mailbox. The advantage of bitmaps lies elsewhere.

Michael Sherwin
Posts: 3041
Joined: Fri May 26, 2006 1:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Speedup with bitboards on 64-bit CPUs

Post by Michael Sherwin » Sun Apr 29, 2007 2:01 am

bob wrote:
Michael Sherwin wrote:
jdart wrote:Arasan uses bitboards extensively (including rotated). The 64-bit version is faster but not by much (10-25%). I am not sure why.

--Jon
Rotated bitboards use lots and lots of multidimentional look-ups which is great for 32 bit processors, because it keeps 64 bit access to a minimum. However, when compiled for 64 bits all the same array indexing must be done and 64 bit access is still minimized. There are other bitboard methods that minimize look-ups and maximize 64 operations that run much faster on 64 bit processors when compiled to 64 bit code.
I haven't found anything "much faster". I tested magic move generation and found it to be identical in speed with the rotated bitmaps I used in older versions. I converted to magic because they are simpler, but not because they are faster. There are a couple of advantages about being able to update one "occupied" bitmap and get moves from that, but for normal stuff, the two approaches are really equal if done properly. And since move generation is a small part of most chess programs, it doesn't make a hill of beans worth of difference no matter what approach you use, bitmaps or mailbox. The advantage of bitmaps lies elsewhere.
Magic is very interesting, no doubt about that, but the 64 bit multiplication is exspensive and the huge data tables that are required stresses the resources available to the processor on lesser hardware. Newer processors with humungous cashes are a different story. What also matters is how 'large the eval is compared to the move generator. In programs with smaller evals the speed difference is substantial. When I compiled Crafty to 64 bits on typical hardware there was about a 30% speedup. And for my program, RomiChess, the speedup going to 64 bit was 50%. I use the real old fashion bitboards that you describe in one of your papers. Although, I reinvented them prior to reading that paper.

I have developed (but not fully implemented) two new bitboard move generators that are faster than magic on both 32 bit and 64 bit machines. I hope to finish this work and publish some results in the coming months.
I hate if statements. Pawns demand if statements. Therefore I hate pawns.

bob
Posts: 20557
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Speedup with bitboards on 64-bit CPUs

Post by bob » Mon Apr 30, 2007 12:41 am

Michael Sherwin wrote:
bob wrote:
Michael Sherwin wrote:
jdart wrote:Arasan uses bitboards extensively (including rotated). The 64-bit version is faster but not by much (10-25%). I am not sure why.

--Jon
Rotated bitboards use lots and lots of multidimentional look-ups which is great for 32 bit processors, because it keeps 64 bit access to a minimum. However, when compiled for 64 bits all the same array indexing must be done and 64 bit access is still minimized. There are other bitboard methods that minimize look-ups and maximize 64 operations that run much faster on 64 bit processors when compiled to 64 bit code.
I haven't found anything "much faster". I tested magic move generation and found it to be identical in speed with the rotated bitmaps I used in older versions. I converted to magic because they are simpler, but not because they are faster. There are a couple of advantages about being able to update one "occupied" bitmap and get moves from that, but for normal stuff, the two approaches are really equal if done properly. And since move generation is a small part of most chess programs, it doesn't make a hill of beans worth of difference no matter what approach you use, bitmaps or mailbox. The advantage of bitmaps lies elsewhere.
Magic is very interesting, no doubt about that, but the 64 bit multiplication is exspensive and the huge data tables that are required stresses the resources available to the processor on lesser hardware. Newer processors with humungous cashes are a different story. What also matters is how 'large the eval is compared to the move generator. In programs with smaller evals the speed difference is substantial. When I compiled Crafty to 64 bits on typical hardware there was about a 30% speedup. And for my program, RomiChess, the speedup going to 64 bit was 50%. I use the real old fashion bitboards that you describe in one of your papers. Although, I reinvented them prior to reading that paper.

I have developed (but not fully implemented) two new bitboard move generators that are faster than magic on both 32 bit and 64 bit machines. I hope to finish this work and publish some results in the coming months.
I compared magic to rotated on my 32 bit PIV xeon box at my office, and using more core-2 laptop. They were pretty much break-even when run on either of those two (both generators compared on the same processor of course, not PIV vs core2...

Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 8:19 pm
Location: Oslo, Norway

Re: Speedup with bitboards on 64-bit CPUs

Post by Tord Romstad » Mon Apr 30, 2007 3:01 pm

Pradu wrote:Making and unmaking moves is slow for bitboard regardless. But perhaps you can optimize your kingsafety code with either flood-filling or specialized magic routines and have it be significantly faster than mailbox for this.
Flood-filling would be terribly slow on 32-bit CPUs, wouldn't it? At the moment, even though I would be happy to see a significant speedup on 64-CPUs, the performance on 32-bit hardware is still by far the most important. The majority of computer users will still run their chess programs on 32-bit CPUs for several more years, I think.

Specialized magic routines might be an option, but that would probably force me to add one or two new big tables to my program, which seems unattractive. Nevertheless, it is possible that I will eventually try something like this.
I haven't looked into Glaurung to see how you do kingsafety but I suppose it is the usual way of counting the number/type of attackers to squares around the king.
Yes and no. This is what Glaurung 1.x does. In Glaurung 2.x, I am currently using a cheap surrogate king safety. I don't consider the attackers or defenders for each square around the king separately, because I haven't found a fast way to compute it. Instead, I just find the number and types of pieces which attack some square around the king, without bothering to find out precisely which squares each piece attacks. This is a lot faster, but of course nowhere near as good.
It is quite strange scanning for attacks in a single direction is faster for you, do you have a SEE to bench for mailbox and bitboards?
My SEE runs at similar speed with both representations, I think, but SEE is a poor benchmark for finding attacks in a single direction. SEE isn't about finding attacks in a single direction, but about finding all pieces attacking a given square. I do this in entirely different ways with mailbox and bitboard code: With mailbox code, I loop through each piece and check whether the piece attacks the destination square. With bitboards, this approach is extremely slow, and I do almost the opposite: I begin with the destination square and compute an attackers bitboard by generating sliding attack bitboards in all directions from the square and ANDing the attack bitboards with sliding pieces.

A better example of a basic operation where I struggle to make bitboards perform well is testing whether a move gives check.

Tord

Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 8:19 pm
Location: Oslo, Norway

Re: Speedup with bitboards on 64-bit CPUs

Post by Tord Romstad » Mon Apr 30, 2007 3:14 pm

bob wrote:I suspect you just need some "seat time" to get acclimated.
I've had plenty of seat time by now. In fact, I have almost certainly spent more time on the bitboard version than on the mailbox version of Glaurung 2, and the mailbox version is beginning to lag behind.
"scanning for attacks" is a bad idea. We don't want to "scan" for anything in bitboards...
Yes, "scan" was a bad choice of verb. Let me be more precise: I haven't found a cheap way to test whether a sliding piece on square X attacks square Y with bitboards. Bitboards are nice if I want to find all squares a sliding piece attacks, but I hardly ever want that (except when doing mobility evaluation, but then the expense of counting the nonzero bits is big enough to offset the speed of generating the attack bitboards).

But please, let's not make this thread degenerate into yet another bitboards vs mailbox fight. All I wanted to know was really how much of a speed improvement I could expect on a 64-bit CPU, and I have received several nice estimates of that by now. :)

Tord

Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 8:19 pm
Location: Oslo, Norway

Re: Speedup with bitboards on 64-bit CPUs

Post by Tord Romstad » Mon Apr 30, 2007 3:22 pm

frankp wrote:This strikes a cord. I was writing a new 0x88 program for the Palm z22 I bought recently, but it was just too slow (on the pc) so I have gone back to bitboards mainly because this is the way I think. Eval on 0x88 was so hard. I guess it is familiarity that matters. There seems no intrinsic reason why 0x88 would not be as fast. (Bitboards will be no good for the z22 of course.).
In my opinion, 0x88 is obsolete. It is similar to mailbox, but less intuitive and somewhat slower, without offering any advantages in compensation.

Personally, I can neither think in terms of bitboards nor in terms of mailbox boards. The only way I can think about chess programming is in terms of higher-level terms. I try to isolate the the parts of my program that make use of the internals of my board representation to a few low-level functions, and keep the interesting parts of my search and evaluation agnostic to the board representation. Most of my code is exactly the same with bitboards and mailbox.
To answer your question, IIRC my old bitboard program got the order of 50% speedup moving to amd64 (gcc/Linux), but does not use rotated bitboards.
Not bad! If my results turn out to be similar, it is likely that my bitboard version will be faster than the mailbox version on a 64-bit CPU.

Thanks to everyone for the answers! I'm getting more optimistic now.

Tord

bob
Posts: 20557
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Speedup with bitboards on 64-bit CPUs

Post by bob » Mon Apr 30, 2007 5:08 pm

Tord Romstad wrote:
bob wrote:I suspect you just need some "seat time" to get acclimated.
I've had plenty of seat time by now. In fact, I have almost certainly spent more time on the bitboard version than on the mailbox version of Glaurung 2, and the mailbox version is beginning to lag behind.
"scanning for attacks" is a bad idea. We don't want to "scan" for anything in bitboards...
Yes, "scan" was a bad choice of verb. Let me be more precise: I haven't found a cheap way to test whether a sliding piece on square X attacks square Y with bitboards. Bitboards are nice if I want to find all squares a sliding piece attacks, but I hardly ever want that (except when doing mobility evaluation, but then the expense of counting the nonzero bits is big enough to offset the speed of generating the attack bitboards).

But please, let's not make this thread degenerate into yet another bitboards vs mailbox fight. All I wanted to know was really how much of a speed improvement I could expect on a 64-bit CPU, and I have received several nice estimates of that by now. :)

Tord
One solution for that we used in Cray Blitz. Set up an array called "intervening squares", as intervening[64][64]. For any two subscripts intervening[a], you set the corresponding bitmap to contain 1 bits between the two squares. For example, intervening[a1][h1] would have 1 bits on b1, c1, d1, e1, f1 and g1. For a rook or queen to attack h1 from a1, the intervening squares must be empty. Simply AND this bitmap with the occupied squares and if the result is zero, the attack is present, otherwise it is blocked by an intervening piece...

That's why I nitpicked the "scan" word. This is more about "sets" than "scanning". The above bitmap is the set of squares that must be empty for a piece on sq1 to be able to slide to / attack sq2...

There are other ways if you don't like the 4K array that I could explain. A little more computation but no big array is needed.

BTW I've been doing bitmaps for 13 years now and I _still_ learn something new every time I look hard..

Post Reply