Page 1 of 1

### Resource for bit twiddlers

Posted: Fri May 04, 2007 1:55 am
I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz

### Re: Resource for bit twiddlers

Posted: Mon May 28, 2007 12:40 pm
jwes wrote:I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz
This is really a great ressource - thanks for providing the link.

"tweaking several bytes at once" (page 19), nicely explains some SWAR (SIMD within a regsiter) tricks suggested by H.G. Dietz.

Code: Select all

``````SWAR add z = x + y
z = (&#40;x &~H&#41; + &#40;y &~H&#41;) ^ (&#40;x ^ y&#41; & H&#41;

SWAR sub z = x - y
z = (&#40;x | H&#41; - &#40;y &~H&#41;) ^ (&#40;x ^~y&#41; & H&#41;

SWAR average z = &#40;x+y&#41;/2 based on x + y = &#40;x^y&#41; + 2*&#40;x&y&#41;
z = &#40;x & y&#41; + ((&#40;x ^ y&#41; & ~L&#41; >> 1&#41;
``````
For bytewise math, H = 0x8080808080808080 and L = 0x0101010101010101. But one can use arbitrary masks for other sizes such as 6*10 or 5*12 bit as well - or even 64-bit structures with mixed "word"-sizes.

Performing bytewise or rankwise sub using the o ^(o-2r)-trick saves some instructions compared to one Kogge-Stone attack getter, even with general purpose registers:

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
empty  = empty &  0xfefefefefefefefe;
rooks |= empty & &#40;rooks << 1&#41;;
empty  = empty & &#40;empty << 1&#41;;
rooks |= empty & &#40;rooks << 2&#41;;
empty  = empty & &#40;empty << 2&#41;;
rooks |= empty & &#40;rooks << 4&#41;;
return &#40;rooks << 1&#41; & 0xfefefefefefefefe;
&#125;
``````

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
const u64 H     = 0x8080808080808080;
u64 occupied    = rooks | ~empty;
u64 occNoRook   = rooks ^ occupied;
u64 difference  = (&#40;occNoRook | H&#41; - &#40;rooks & ~H&#41;)
^  (~occupied & H&#41;;
return occupied ^ difference;
&#125;
``````
Gerd

### Re: Resource for bit twiddlers

Posted: Mon May 28, 2007 3:16 pm
Gerd Isenberg wrote:

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
const u64 H     = 0x8080808080808080;
u64 occupied    = rooks | ~empty;
u64 occNoRook   = rooks ^ occupied;
u64 difference  = (&#40;occNoRook | H&#41; - &#40;rooks & ~H&#41;)
^  (~occupied & H&#41;;
return occupied ^ difference;
&#125;
``````
Gerd
Oups, since

A ^ (~A & B) == A | B

the code becomes a bit cheaper:

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
const u64 H     = 0x8080808080808080;
u64 occInclRook = rooks | ~empty;
u64 occExclRook = rooks ^ occInclRook;
u64 attacks     = (&#40;occExclRook | H&#41; - &#40;rooks & ~H&#41;)
^  &#40;occInclRook | H&#41;;
return attacks;
&#125;
``````

### Re: Resource for bit twiddlers

Posted: Mon May 28, 2007 4:35 pm
The vc2005 compiler surprisingly was not aware of the
A ^ (~A & B) == A | B simplification.

Code: Select all

``````?eastAttacks@@YA_K_K0@Z PROC
00000	49 b9 80 80 80
80 80 80 80 80	 mov	 r9, 8080808080808080H
0000a	48 f7 d2	 not	 rdx
0000d	49 b8 7f 7f 7f
7f 7f 7f 7f 7f	 mov	 r8, 7f7f7f7f7f7f7f7fH
00017	48 0b d1	 or	 rdx, rcx
0001a	48 8b c2	 mov	 rax, rdx
0001d	48 33 c1	 xor	 rax, rcx
00020	49 23 c8	 and	 rcx, r8
00023	49 0b c1	 or	 rax, r9
00026	48 2b c1	 sub	 rax, rcx
00029	48 8b ca	 mov	 rcx, rdx
0002c	48 f7 d1	 not	 rcx
0002f	49 23 c9	 and	 rcx, r9
00032	48 33 c1	 xor	 rax, rcx
00035	48 33 c2	 xor	 rax, rdx
00038	c3		 ret	 0
``````

Code: Select all

``````rooks\$ = 8
empty\$ = 16
?eastAttacks@@YA_K_K0@Z PROC
00000	48 f7 d2	 not	 rdx
00003	49 b9 80 80 80
80 80 80 80 80	 mov	 r9, 8080808080808080H
0000d	49 b8 7f 7f 7f
7f 7f 7f 7f 7f	 mov	 r8, 7f7f7f7f7f7f7f7fH
00017	48 0b d1	 or	 rdx, rcx
0001a	48 8b c2	 mov	 rax, rdx
0001d	49 0b d1	 or	 rdx, r9
00020	48 33 c1	 xor	 rax, rcx
00023	49 23 c8	 and	 rcx, r8
00026	49 0b c1	 or	 rax, r9
00029	48 2b c1	 sub	 rax, rcx
0002c	48 33 c2	 xor	 rax, rdx
0002f	c3		 ret	 0
``````
This one with a De Morgan and -i = ~i+1 transformation takes one register and a few code bytes less. The additional +1 is done together with the add by the AGU instaed of the ALU, performing a lea-instruction.

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
const u64 H     = 0x8080808080808080;
u64 occInclRook = rooks | ~empty;
u64 occExclRook = rooks ^ occInclRook;
u64 attacks     = (&#40;occExclRook | H&#41; + (~rooks | H&#41; + 1&#41;
^  &#40;occInclRook | H&#41;;
return attacks;
&#125;
``````

Code: Select all

``````rooks\$ = 8
empty\$ = 16
?eastAttacks@@YA_K_K0@Z PROC
00000	49 b8 80 80 80
80 80 80 80 80	 mov	 r8, 8080808080808080H
0000a	48 f7 d2	 not	 rdx
0000d	48 0b d1	 or	 rdx, rcx
00010	48 8b c2	 mov	 rax, rdx
00013	49 0b d0	 or	 rdx, r8
00016	48 33 c1	 xor	 rax, rcx
00019	48 f7 d1	 not	 rcx
0001c	49 0b c0	 or	 rax, r8
0001f	49 0b c8	 or	 rcx, r8
00022	48 8d 44 08 01	 lea	 rax, QWORD PTR &#91;rax+rcx+1&#93;
00027	48 33 c2	 xor	 rax, rdx
0002a	c3		 ret	 0
``````

### Re: Resource for bit twiddlers

Posted: Wed May 30, 2007 5:55 pm
Some further optimization by using common sub-expressions. One operation less, and more importantly, only three scratch registers used. The compiler prefers 10 byte opcode loading rax with an 64-bit immediate, while there is already the one's complement in rax, where a 3-byte "not rax" would result in 7 bytes shorter opcode but higher alu-pressure. With 64-bit immediate moves, the 32-byte prefetch of the K10 makes sense.

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
const u64 H  =  0x8080808080808080;
u64 inclRook =  rooks | H | ~empty;
u64 exclRook = &#40;rooks   &=  ~H  ) ^ inclRook;
u64 attacks  = &#40;exclRook - rooks&#41; ^ inclRook;
return attacks;
&#125;

rooks\$ = 8
empty\$ = 16
?eastAttacks@@YA_K_K0@Z PROC
00000	48 f7 d2	 not	 rdx
00003	48 b8 80 80 80
80 80 80 80 80	 mov	 rax, 8080808080808080H
0000d	48 0b d1	 or	 rdx, rcx
00010	48 0b d0	 or	 rdx, rax
00013	48 b8 7f 7f 7f   ; not rax would also possible here
7f 7f 7f 7f 7f	 mov	 rax, 7f7f7f7f7f7f7f7fH
0001d	48 23 c8	 and	 rcx, rax
00020	48 8b c2	 mov	 rax, rdx
00023	48 33 c1	 xor	 rax, rcx
00026	48 2b c1	 sub	 rax, rcx
00029	48 33 c2	 xor	 rax, rdx
0002c	c3		 ret	 0
``````
If one passes already the occupied set, and guarantee that rooks are subset of occupied, one safes another two instructions of course, but that loses a bit flexibility and tends to confuse due to the prototype and semantic of the Kogge-Stone getters.

Gerd

### Re: Resource for bit twiddlers

Posted: Wed May 30, 2007 10:51 pm
Gerd Isenberg wrote:Some further optimization by using common sub-expressions. One operation less, and more importantly, only three scratch registers used. The compiler prefers 10 byte opcode loading rax with an 64-bit immediate, while there is already the one's complement in rax, where a 3-byte "not rax" would result in 7 bytes shorter opcode but higher alu-pressure. With 64-bit immediate moves, the 32-byte prefetch of the K10 makes sense.

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
const u64 H  =  0x8080808080808080;
u64 inclRook =  rooks | H | ~empty;
u64 exclRook = &#40;rooks   &=  ~H  ) ^ inclRook;
u64 attacks  = &#40;exclRook - rooks&#41; ^ inclRook;
return attacks;
&#125;

rooks\$ = 8
empty\$ = 16
?eastAttacks@@YA_K_K0@Z PROC
00000	48 f7 d2	 not	 rdx
00003	48 b8 80 80 80
80 80 80 80 80	 mov	 rax, 8080808080808080H
0000d	48 0b d1	 or	 rdx, rcx
00010	48 0b d0	 or	 rdx, rax
00013	48 b8 7f 7f 7f   ; not rax would also possible here
7f 7f 7f 7f 7f	 mov	 rax, 7f7f7f7f7f7f7f7fH
0001d	48 23 c8	 and	 rcx, rax
00020	48 8b c2	 mov	 rax, rdx
00023	48 33 c1	 xor	 rax, rcx
00026	48 2b c1	 sub	 rax, rcx
00029	48 33 c2	 xor	 rax, rdx
0002c	c3		 ret	 0
``````
If one passes already the occupied set, and guarantee that rooks are subset of occupied, one safes another two instructions of course, but that loses a bit flexibility and tends to confuse due to the prototype and semantic of the Kogge-Stone getters.

Gerd
From what I can understand this subtraction technique can only be used for right fills and left fills. Here's what I could infer from your filling algorithm:

I'll term rooks as squares.

so basically the concept is to use the fact that

Code: Select all

``````&#40;LSB at left, looking at only first rank for now&#41;
HFile      "rooks"    Result
00000001 - 00001000 = 00001110
Result     HFile      Fill
00001110 ^ 00000001 = 00001111
``````
Now lets add an occupancy to see if it still works:

Code: Select all

``````&#40;LSB at left like for the first rank&#41;
HFile|occ  "rooks"    Result
00000011 - 00001000 = 00001101
Result     HFile|occ  Fill
00001101 ^ 00000011 = 00001110
``````
Works perfect! But because of the subtraction, we have to make sure the occupancy and the rook don't coinside eg:

Code: Select all

``````&#40;LSB at left like for the first rank&#41;
HFile|occ  "rooks"    Result
00001001 - 00001000 = 00000001 &#40;BAD&#41;
Result     HFile|occ  Fill
00000001 ^ 00000011 = 00000010 &#40;WRONG&#41;
``````
So we have to make sure we take the rooks out of the occupancy. Assuming the same occupancy as before:

Code: Select all

``````(&#40;HFile|occ&#41;&~rooks&#41; - "rooks"    Result
00000001             - 00001000 = 00001110
Result ^ (&#40;HFile|occ&#41;&~rooks&#41;   = Fill
00001110 00000001               = 00001110
``````
From the above we can make an easy right fill:

Code: Select all

``````U64 rightFill&#40;U64 squares, U64 occ&#41;
&#123;
occ = &#40;occ|FILEHBB&#41;&~squares; //occupancy without squares
return (&#40;occ - squares&#41;^occ&#41; | squares; //xor to get rid of unused occupancy
&#125;``````
I hope my iterpretation is correct? Is it also possible to to do fills other than left and right with the subtraction technique?

### Re: Resource for bit twiddlers

Posted: Thu May 31, 2007 6:17 am
Pradu wrote: From what I can understand this subtraction technique can only be used for right fills and left fills. Here's what I could infer from your filling algorithm:

I'll term rooks as squares.

so basically the concept is to use the fact that

Code: Select all

``````&#40;LSB at left, looking at only first rank for now&#41;
HFile      "rooks"    Result
00000001 - 00001000 = 00001110
Result     HFile      Fill
00001110 ^ 00000001 = 00001111
``````
Now lets add an occupancy to see if it still works:

Code: Select all

``````&#40;LSB at left like for the first rank&#41;
HFile|occ  "rooks"    Result
00000011 - 00001000 = 00001101
Result     HFile|occ  Fill
00001101 ^ 00000011 = 00001110
``````
Works perfect! But because of the subtraction, we have to make sure the occupancy and the rook don't coinside eg:

Code: Select all

``````&#40;LSB at left like for the first rank&#41;
HFile|occ  "rooks"    Result
00001001 - 00001000 = 00000001 &#40;BAD&#41;
Result     HFile|occ  Fill
00000001 ^ 00000011 = 00000010 &#40;WRONG&#41;
``````
So we have to make sure we take the rooks out of the occupancy. Assuming the same occupancy as before:

Code: Select all

``````(&#40;HFile|occ&#41;&~rooks&#41; - "rooks"    Result
00000001             - 00001000 = 00001110
Result ^ (&#40;HFile|occ&#41;&~rooks&#41;   = Fill
00001110 00000001               = 00001110
``````
From the above we can make an easy right fill:

Code: Select all

``````U64 rightFill&#40;U64 squares, U64 occ&#41;
&#123;
occ = &#40;occ|FILEHBB&#41;&~squares; //occupancy without squares
return (&#40;occ - squares&#41;^occ&#41; | squares; //xor to get rid of unused occupancy
&#125;``````
I hope my iterpretation is correct? Is it also possible to to do fills other than left and right with the subtraction technique?

no, the o^(o-2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a->h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's Kogge-Stone routines.

The bytewise or rankwise o^(o-2r) works setwise even with multiple rooks per rank. The first substraction of -2r == - r - r can be done by xor or "andnot" to clear the whole rook-bits in the occupancy, while the substration does the carry propagation, and the final xor leaves the attacks. Simd instructions with MMX or SSE2 are nice for that bytewise-substraction, but it turned out that the SWAR-trick is as quite effective as well.

With single rooks/bishops by square metric, one can use this trick for four positive vertical and diagonal directions with pre- and post-masks of the complete file or diagonal. For bishops and file-attacks one can use vertical flips (bswap) to do the negative directions. But a bishop attack-getter based on this technique was a bit slower than kindergarten, not to mention magic:

Code: Select all

``````u64 bishopAttacks&#40;u64 occ, u32 sq&#41; &#123;
u64 rvsdia, rvsant, bishop;
u64 occdia = occ, occant = occ;
rvsdia  = _byteswap_uint64&#40;occdia&#41;;
rvsant  = _byteswap_uint64&#40;occant&#41;;
occdia ^=  occdia - bishop;
occant ^=  occant - bishop;
bishop  = _byteswap_uint64&#40;bishop&#41;;
rvsdia ^=  rvsdia - bishop;
rvsant ^=  rvsant - bishop;
occdia ^= _byteswap_uint64&#40;rvsdia&#41;;
occant ^= _byteswap_uint64&#40;rvsant&#41;;
return occdia ^ occant;
&#125;
``````
Gerd

### Re: Resource for bit twiddlers

Posted: Fri Jun 01, 2007 12:16 am
Gerd Isenberg wrote:

no, the o^(o-2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a->h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's Kogge-Stone routines.
My mistake. I guess also that my code doesn't work well sometimes for multiple rooks on the same rank when one rook coincides with an occupancy. I need to look at your code a bit more closely.

### Re: Resource for bit twiddlers

Posted: Fri Jun 01, 2007 8:19 am
Gerd Isenberg wrote:

no, the o^(o-2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a->h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's Kogge-Stone routines.
My mistake. I guess also that my code doesn't work well sometimes for multiple rooks on the same rank when one rook coincides with an occupancy. I need to look at your code a bit more closely.

Code: Select all

``````               hex    bin        bin
arithmetical order   reversed according to the a1 = 0 mapping
o              0xDB   11011011   11011011   occupancy including rooks
r = subset&#40;o&#41;  0x12   00010010   01001000   rooks
o - r          0xC9   11001001   10010011   occupancy excluding rooks
o - 2r         0xB7   10110111   11101101   subtracting rooks from either next higher occupancy or borrow
o ^ &#40;o - 2r&#41;   0x6C   01101100   00110110   rook attacks of both rooks
``````
You can not do it the other way around - subtracting none-rooks from rooks to get the opposite direction

### Re: Resource for bit twiddlers

Posted: Mon Jun 04, 2007 2:58 pm
Gerd Isenberg wrote:
jwes wrote:I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz
This is really a great ressource - thanks for providing the link.

"tweaking several bytes at once" (page 19), nicely explains some SWAR (SIMD within a regsiter) tricks suggested by H.G. Dietz.

Code: Select all

``````SWAR add z = x + y
z = (&#40;x &~H&#41; + &#40;y &~H&#41;) ^ (&#40;x ^ y&#41; & H&#41;

SWAR sub z = x - y
z = (&#40;x | H&#41; - &#40;y &~H&#41;) ^ (&#40;x ^~y&#41; & H&#41;

SWAR average z = &#40;x+y&#41;/2 based on x + y = &#40;x^y&#41; + 2*&#40;x&y&#41;
z = &#40;x & y&#41; + ((&#40;x ^ y&#41; & ~L&#41; >> 1&#41;
``````
For bytewise math, H = 0x8080808080808080 and L = 0x0101010101010101. But one can use arbitrary masks for other sizes such as 6*10 or 5*12 bit as well - or even 64-bit structures with mixed "word"-sizes.

Performing bytewise or rankwise sub using the o ^(o-2r)-trick saves some instructions compared to one Kogge-Stone attack getter, even with general purpose registers:

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
empty  = empty &  0xfefefefefefefefe;
rooks |= empty & &#40;rooks << 1&#41;;
empty  = empty & &#40;empty << 1&#41;;
rooks |= empty & &#40;rooks << 2&#41;;
empty  = empty & &#40;empty << 2&#41;;
rooks |= empty & &#40;rooks << 4&#41;;
return &#40;rooks << 1&#41; & 0xfefefefefefefefe;
&#125;
``````

Code: Select all

``````u64 eastAttacks&#40;u64 rooks, u64 empty&#41;
&#123;
const u64 H     = 0x8080808080808080;
u64 occupied    = rooks | ~empty;
u64 occNoRook   = rooks ^ occupied;
u64 difference  = (&#40;occNoRook | H&#41; - &#40;rooks & ~H&#41;)
^  (~occupied & H&#41;;
return occupied ^ difference;
&#125;
``````
Gerd

I'll hope the attack-getter does not violate any United States Patent