OliverBr wrote: ↑Mon Aug 31, 2020 10:11 pm
Funny fact: As it is known there is a direct 1:1 translation of OliThink into Java.
Since Version 5.6.9 the built-in-pop-count method is used, I had a look to the Java version and found this deep in the Java runtime library:
Code: Select all
public static int bitCount(long i) {
// HD, Figure 5-14
i = i - ((i >>> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >>> 8);
i = i + (i >>> 16);
i = i + (i >>> 32);
return (int)i & 0x7f;
}
It's quite fast. Interesting!
Yes, it is quite fast, especially if there are many bits set. If we do a quick C++ translation (so gcc can understand it)
Code: Select all
int bitCount(unsigned long i) {
// HD, Figure 5-14
i = i - ((i >> 1) & 0x5555555555555555L);
i = (i & 0x3333333333333333L) + ((i >> 2) & 0x3333333333333333L);
i = (i + (i >> 4)) & 0x0f0f0f0f0f0f0f0fL;
i = i + (i >> 8);
i = i + (i >> 16);
i = i + (i >> 32);
return (int)i & 0x7f;
we get the straight line code sequence
Code: Select all
bitCount(unsigned long):
movabs rdx, 6148914691236517205
mov rax, rdi
shr rax
and rax, rdx
movabs rdx, 3689348814741910323
sub rdi, rax
mov rax, rdi
shr rdi, 2
and rdi, rdx
and rax, rdx
add rax, rdi
mov rdi, rax
shr rdi, 4
add rdi, rax
movabs rax, 1085102592571150095
and rdi, rax
mov rax, rdi
shr rax, 8
add rdi, rax
mov rax, rdi
shr rax, 16
add rdi, rax
mov rax, rdi
shr rax, 32
add rax, rdi
and eax, 127
ret
I count that as 27 assembly instructions (excluding the RET) for the bit counting part. It has no loops and no conditionals, which is an advantage. Same exectution time regardless of number of bits set.
On the other hand, the "type A" code
Code: Select all
char _bitcnt(unsigned long bit) {
char c = 0;
while (bit) { bit &= (bit - 1); c++; }
return c;
}
(assuming no popcnt available) translates to
Code: Select all
_bitcnt(unsigned long):
xor r8d, r8d
test rdi, rdi
je .L1
.L3:
lea rax, [rdi-1]
add r8d, 1
and rdi, rax
jne .L3
.L1:
mov eax, r8d
ret
which has a 4 instructions per bit central loop. If you have very few bits set in your bitboard, this would result in fewer instructions executed.
So we have to consider which case is more common in a chess program - many bits set or just a few bits set.