Kindergarten bitboards without multiplying

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Gian-Carlo Pascutto wrote:
Gerd Isenberg wrote:
Gian-Carlo Pascutto wrote: In any case there is no need to do 64 bit multiplies with Kindergarten.
Sure, imul64 on 32-bit is a call with three imuls. But Piotr's proposal seems even a faster solution than a dedicated 2 times 32-bit mul approach.
Ah, really? I would have expected that the 32 bit folded kindergarten (1 mul per direction) would be faster than a series of dependent shifts.
Ok, I was probably too optimistic, since I wrongly referred two 32-bit imuls per direction. Still even with one imul you have some shifts around. Processing two lines in parallel (for bishop or rook) relaxes the dependencies, and the 32-bit shifts are nops in 32-bit mode, so Collapsed Files for diagonals, anti-diagonal and ranks seems very competitive, even if compiler don't emit or al, ah. I think it is worth a try if you focus on 32-bit machines.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Gian-Carlo Pascutto wrote:
Gerd Isenberg wrote: Sure, GetTickCount() resolution is far to coarse for that purpose, but you may try cpuid with eax zero before rdtsc, to make sure it doesn't executed out of order.
Still doesn't guarantee it stays on the same CPU. Again: use QueryPerformanceCounter or get burned.
Ok, is something similar available for gcc/linux?
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Kindergarten bitboards without multiplying

Post by Gian-Carlo Pascutto »

gettimeofday()

If you really absolutely need nanoseconds, there is clock_gettime().
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Kindergarten bitboards without multiplying

Post by wgarvin »

Gian-Carlo Pascutto wrote:
Gerd Isenberg wrote:
Gian-Carlo Pascutto wrote: In any case there is no need to do 64 bit multiplies with Kindergarten.
Sure, imul64 on 32-bit is a call with three imuls. But Piotr's proposal seems even a faster solution than a dedicated 2 times 32-bit mul approach.
Ah, really? I would have expected that the 32 bit folded kindergarten (1 mul per direction) would be faster than a series of dependent shifts.
Unless I misunderstand something, the shifts are not dependent on each other (but combining the result together is).
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Kindergarten bitboards without multiplying

Post by Gian-Carlo Pascutto »

wgarvin wrote: Unless I misunderstand something, the shifts are not dependent on each other (but combining the result together is).
The second shift is operating on the result of the first shift. The first shift is operating on the result of the operation just before it.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Kindergarten bitboards without multiplying

Post by wgarvin »

Gian-Carlo Pascutto wrote:
wgarvin wrote: Unless I misunderstand something, the shifts are not dependent on each other (but combining the result together is).
The second shift is operating on the result of the first shift. The first shift is operating on the result of the operation just before it.
Okay, I see what you mean now. I misunderstood that code.

I guess the shifts are a bit expensive, but not as expensive as doing 2 or more 32-bit multiplies.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

wgarvin wrote:
Gian-Carlo Pascutto wrote:
wgarvin wrote: Unless I misunderstand something, the shifts are not dependent on each other (but combining the result together is).
The second shift is operating on the result of the first shift. The first shift is operating on the result of the operation just before it.
Okay, I see what you mean now. I misunderstood that code.

I guess the shifts are a bit expensive, but not as expensive as doing 2 or more 32-bit multiplies.
Two shifts plus "or" are about the same as 32-bit mul on current cpus >= K8, core2. The advantage of 32-bit mode is one safes the shift right 32. One of the seldom cases where 32-bit mode is advantageous for bitboards ;-)

To get the 2* six inner bits collapsedFilesIndex into eax might be 6 cycles each, the same for two indices in parallel:

Code: Select all

; input edx:eax masked rank, dia or antidia
; output eax
or   eax, edx    1
mov  edx, eax     1
shr  eax, 16       1
or   eax, edx       1
or   al, ah          1
and  eax, 2*63        1
versus imul

Code: Select all

; input edx:eax masked rank, dia or antidia
; output eax
or   eax, edx  ; folded  1
imul eax, 0x01010101      111
shr  eax, 24                 1
and  eax, 2*63                1
Beside shorter code, the advantage of imul is to preserve rdx. Also comiler will likely not emit or al,ah, but mov,shr 8,or with two additional cycles. So fast imul is advantageous on recent processors with 3-cycles imul. Oups, nehalem has 4 cycles imul, Netburst has 10 (0F_3H) or 14 (0F_2H) cycles imul.