Do bitboards save time or just space efficient.

Chessnut1071 · Post by **Chessnut1071** » Fri Oct 29, 2021 4:13 am

I developed my engine using pre-computed byte tables. It requires loops to search through the 8 vectors, 4 for rooks, 4 for bishops and 8 for queens, from a given position, but, doesn't compute anything. How can bitboards save time against pre-computed tables? I understand you only need one machine cycle to shift bits, but sooner or later you have to convert back, guard against going off the board and account for blocked enemy and friendly pieces. Do they really save time or just space efficient?

emadsen · Post by **emadsen** » Fri Oct 29, 2021 4:51 am

Chessnut1071 wrote: ↑Fri Oct 29, 2021 4:13 am How can bitboards save time against pre-computed tables? I understand you only need one machine cycle to shift bits, but sooner or later you have to convert back, guard against going off the board and account for blocked enemy and friendly pieces.

Sooner. Guard… blocked… At engine startup and never again.

hgm · Post by **hgm** » Fri Oct 29, 2021 8:40 am

The usual implementation of bitboards already takes account of blocking by pieces or board edge in the tabulated boards. So the conditions you mention have never to be tested.

dangi12012 · Post by **dangi12012** » Fri Oct 29, 2021 3:09 pm

Chessnut1071 wrote: ↑Fri Oct 29, 2021 4:13 am I developed my engine using pre-computed byte tables. It requires loops to search through the 8 vectors, 4 for rooks, 4 for bishops and 8 for queens, from a given position, but, doesn't compute anything. How can bitboards save time against pre-computed tables? I understand you only need one machine cycle to shift bits, but sooner or later you have to convert back, guard against going off the board and account for blocked enemy and friendly pieces. Do they really save time or just space efficient?

You also use a pre-computed table. It contains all relevant blocking configurations for each square. This means that you can do a rook lookup in about 5 branchless instructions. The bitboard contains the occupation and together with a mask you directly index into the table. (via hardware instruction or perfect hashing)
https://www.chessprogramming.org/Magic_ ... hift_Fancy

If you dont want to initialize anything here is a constexpr version:
https://raw.githubusercontent.com/Gigan ... ovemap.hpp

So in summary the time saved is insane and no slider loop will ever be as fast as bitboards.

Chessnut1071 · Post by **Chessnut1071** » Fri Oct 29, 2021 3:25 pm

dangi12012 wrote: ↑Fri Oct 29, 2021 3:09 pm
Chessnut1071 wrote: ↑Fri Oct 29, 2021 4:13 am I developed my engine using pre-computed byte tables. It requires loops to search through the 8 vectors, 4 for rooks, 4 for bishops and 8 for queens, from a given position, but, doesn't compute anything. How can bitboards save time against pre-computed tables? I understand you only need one machine cycle to shift bits, but sooner or later you have to convert back, guard against going off the board and account for blocked enemy and friendly pieces. Do they really save time or just space efficient?
You also use a pre-computed table. It contains all relevant blocking configurations for each square. This means that you can do a rook lookup in about 5 branchless instructions. The bitboard contains the occupation and together with a mask you directly index into the table. (via hardware instruction or perfect hashing)
https://www.chessprogramming.org/Magic_ ... hift_Fancy

If you dont want to initialize anything here is a constexpr version:
https://raw.githubusercontent.com/Gigan ... ovemap.hpp

So in summary the time saved is insane and no slider loop will ever be as fast as bitboards.

Do you have an estimate of the time saved vs. pre-computed byte tables? I spent 2 years developing and redeveloping my engine. It looks like another year here. It never ends.

Bo Persson · Post by **Bo Persson** » Fri Oct 29, 2021 3:42 pm

Chessnut1071 wrote: ↑Fri Oct 29, 2021 3:25 pm I spent 2 years developing and redeveloping my engine. It looks like another year here. It never ends.

Right. The fun goes on forever.

dangi12012 · Post by **dangi12012** » Fri Oct 29, 2021 4:23 pm

Chessnut1071 wrote: ↑Fri Oct 29, 2021 3:25 pm
dangi12012 wrote: ↑Fri Oct 29, 2021 3:09 pm
Chessnut1071 wrote: ↑Fri Oct 29, 2021 4:13 am I developed my engine using pre-computed byte tables. It requires loops to search through the 8 vectors, 4 for rooks, 4 for bishops and 8 for queens, from a given position, but, doesn't compute anything. How can bitboards save time against pre-computed tables? I understand you only need one machine cycle to shift bits, but sooner or later you have to convert back, guard against going off the board and account for blocked enemy and friendly pieces. Do they really save time or just space efficient?
You also use a pre-computed table. It contains all relevant blocking configurations for each square. This means that you can do a rook lookup in about 5 branchless instructions. The bitboard contains the occupation and together with a mask you directly index into the table. (via hardware instruction or perfect hashing)
https://www.chessprogramming.org/Magic_ ... hift_Fancy

If you dont want to initialize anything here is a constexpr version:
https://raw.githubusercontent.com/Gigan ... ovemap.hpp

So in summary the time saved is insane and no slider loop will ever be as fast as bitboards.
Do you have an estimate of the time saved vs. pre-computed byte tables? I spent 2 years developing and redeveloping my engine. It looks like another year here. It never ends.

I can only say that by going bitboard only - not a single array. The perft speed started at 200mnps which is as fast as i could ever manange with arrays. With all C++ optimisation I could go to 2Billion moves/s.

I think whats important here ist that you say that sooner or later you have to convert back into an array - but with bitboards one array becomes a single 64 bit register. So you dont need to maintain both.
So where a knight can go to in terms of bitboards is Lookup::Knight[square] & ~brd.White;
For Sliders its Lookup::Rook[square, brd.Occupy] & ~brd.White;
assuming white moves.

So in both cases its 3-5 branchless x86-64 instructions - which is very very fast. Think about how many instructions and variables you need just to enter a loop body once. If you have "ifs" inside that loop you already are an order of magnitude slower than a pure bitboard.

Harald · Post by **Harald** » Fri Oct 29, 2021 8:59 pm

The real bitboard fun starts when you learn how to check evaluation patterns with the bitboards stored in your position or board struct. A few bitwise operations (&, |, ~) and shifts (<<, >>) can save you a lot of ifs and loops.

Example 1:
int some_king_safety_term = popcount(bb_king_attacked_squares[white_king_pos] & bb_white_pawns) * some factor;

Example 2:
Bitboard bb_pawn_rams = bb_white_pawns & (bb_black_pawns >> 8); // or << 8 depending on your board encoding
Bitboard bb_center_pawn_rams = bb_pawn_rams & 0x00003e3e3e3e0000; // or use some macro for the constant to make the compiler happy
...

pedrojdm2021 · Post by **pedrojdm2021** » Fri Oct 29, 2021 9:13 pm

Is easier to mask out things, like moves, positions, and so on...

for example for pawn promotion i can make a bitboard for rank 8 and another for rank 1, that each bitboard has set bits on those rank squares
and i can just simply do a bitwise operation and that will be a faster operation than for example do a for loop in a normal list or something like that.

I think bitboards is just a "framework" that will help you do things, but the most performance part will be in your algorithms, and of course language of choise, for example you hardly will see a good performance if you do it in pure phyton, than if you do it in C or C++, but again like the more than 60% of performance of your application will depend on how the algorithms are written.

diep · Post by **diep** » Fri Oct 29, 2021 11:51 pm

Chessnut1071 wrote: ↑Fri Oct 29, 2021 4:13 am I developed my engine using pre-computed byte tables. It requires loops to search through the 8 vectors, 4 for rooks, 4 for bishops and 8 for queens, from a given position, but, doesn't compute anything. How can bitboards save time against pre-computed tables? I understand you only need one machine cycle to shift bits, but sooner or later you have to convert back, guard against going off the board and account for blocked enemy and friendly pieces. Do they really save time or just space efficient?

The old uncle Bob discussion - where has he gone?

I don't think move generation is the big issue as soon as your evaluation function grows larger.
Bitboards is difficult to program chessknowledge in.

A total other question is how to use AVX effectively. As that's 4 vectors of 64 bits seemingly yet critical instructions are missing in AVX.

A thing completely missing in AVX last time i checked, could be wrong there, is multiplying 64 bits integers x 64 bits integers == 128 bits result,
even if you want to do this with 2 instructions.

Now that has nothing to do with computerchess possibly but it's a severe problem with AVX. If you implement any sort of FFT type thing the integer equivalent would be a NTT (numeric theoretic transform). That needs a 64 x 64 = 128 bits multiplication function even if you split it up into 2 instructions (one high bits and one low bits). Whereas the floating point equivalent for multiplying doubles is there and has a good throughput.

Yet doing stuff with doubles if the problem itself is in integers - mwah - you want to do this?

Many years even at 64 bits processors you could execute way more 32 bits instructions a clock than 64 bits instructions a clock. This has been obsoleted by AVX and of course already many years intel is toying around with AVX512 without that it's really much faster except for their manycore thing.

Yet a chessprogram at a manycore type chip the underlying problem is that working with array lookups is total impossible as soon as you vectorize it or you face serious slowdowns.

So for example if we have a manycore chip - either gpu or other sort of manycore - let me show some code here in CUDA mine:

while( nextentry != 2048 ) {
uint32
value = scratch[nextentry],
bitindex = ((value >> 5) & 511),
bitmask = (1 << (value & 31)) | (1 << ((value >> 14) & 31));

nextentry = (value >> 20);
bit[bitindex] |= bitmask;
}

This is some code from a prime sieve mine (not the sieve to generate small primes but the discrete logarithm cracking algorithm that works in O(sqrrt (n ) normal spoken)

The whole big deal here is :
value = scratch[nextentry];

Realize in CUDA that a vector of 32 cores executes this. Yet in reality a 'core' is just part of the vector.
Like AVX would have 8 integers of 32 bits there so in manycore terminology that's 8 'cores'. I say it in big gestures here seen from a big distance.

As we see in 'nextentry = (value >>20);
there each of the 32 cuda cores (minimum size of a warp in fact in CUDA) might have a different index there.

This is at all hardware ugly slow if you do this from a vector.
As it means the CPU or GPU or manycore has to retrieve different cachelines for each individual vector core.
You get punsihed severely there then as that lookup is terrible slow.

Powerwise it's easy to make the registers very wide. So a vector of 16 integers of 32 bits is not really a problem so to speak with AVX512.
Which is nearly the size of what Nvidia uses for its gpu's where a warp has a minimum vectorsize of 32.

So discussing bitboards versus how i'm doing move generation - which is practical same speed.

If you want to get a lot faster you'll need to find your way vectorizing things. Question is whether computerchess is worth all that effort.
Vectorizing things you can't use tables except when you read the entire vector from the same cacheline. In case of Nvidia you can do this in fact non sequential as well. So for example if vector unit 17 lookups at entry 15 and if cuda core 10 lookups at entry 13 that's ok as long as each individual cuda core doesn't lookup at the same atomic time something another cuda core lookups at that atomic time.
So that's an additional constraint you will never read about CPU's - simply as the 'vectors' are one whole in AVX, let alone AVX512.
Powerwise it's cheaper to make new cpu's with wider vectors - yet what's the difference then between a CPU and a GPU?
GPU's are ugly slow in short in doing random lookups to arrays - because all the 32 cuda cores then will access something like 32 different cachelines at the same time - which obviously is a lot slower.
In AVX you already 'avoid' this problem as you lookup the entire 256 bits at once.
Yet how happy are you with that?

So whatever we write here about the ancient past of 32 bits versus 64 bits is total useless knowing you also got AVX to your availability.

But if you intend to not vectorize things and do things in classical manner then the major difference is how much knowledge you want to program. If you want to keep it stupid and simple - bitboards have 1 bit of information quite litterary. If you want more information non-bitboards is more interesting.

I'd vote for the last in that case.

I want an ARM cpu with a L1 and L2 cache for each core and 2048 cores in total. Where is that cpu?

Do bitboards save time or just space efficient.

Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.

Re: Do bitboards save time or just space efficient.