Resource for bit twiddlers
Moderators: bob, hgm, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Resource for bit twiddlers
I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://wwwcsfaculty.stanford.edu/~knuth/fasc1a.ps.gz
http://wwwcsfaculty.stanford.edu/~knuth/fasc1a.ps.gz

 Posts: 2139
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Resource for bit twiddlers
This is really a great ressource  thanks for providing the link.jwes wrote:I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://wwwcsfaculty.stanford.edu/~knuth/fasc1a.ps.gz
"tweaking several bytes at once" (page 19), nicely explains some SWAR (SIMD within a regsiter) tricks suggested by H.G. Dietz.
Code: Select all
SWAR add z = x + y
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)
SWAR sub z = x  y
z = ((x  H)  (y &~H)) ^ ((x ^~y) & H)
SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
z = (x & y) + (((x ^ y) & ~L) >> 1)
Performing bytewise or rankwise sub using the o ^(o2r)trick saves some instructions compared to one KoggeStone attack getter, even with general purpose registers:
Code: Select all
u64 eastAttacks(u64 rooks, u64 empty)
{
empty = empty & 0xfefefefefefefefe;
rooks = empty & (rooks << 1);
empty = empty & (empty << 1);
rooks = empty & (rooks << 2);
empty = empty & (empty << 2);
rooks = empty & (rooks << 4);
return (rooks << 1) & 0xfefefefefefefefe;
}
Code: Select all
u64 eastAttacks(u64 rooks, u64 empty)
{
const u64 H = 0x8080808080808080;
u64 occupied = rooks  ~empty;
u64 occNoRook = rooks ^ occupied;
u64 difference = ((occNoRook  H)  (rooks & ~H))
^ (~occupied & H);
return occupied ^ difference;
}

 Posts: 2139
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Resource for bit twiddlers
Oups, sinceGerd Isenberg wrote:GerdCode: Select all
u64 eastAttacks(u64 rooks, u64 empty) { const u64 H = 0x8080808080808080; u64 occupied = rooks  ~empty; u64 occNoRook = rooks ^ occupied; u64 difference = ((occNoRook  H)  (rooks & ~H)) ^ (~occupied & H); return occupied ^ difference; }
A ^ (~A & B) == A  B
the code becomes a bit cheaper:
Code: Select all
u64 eastAttacks(u64 rooks, u64 empty)
{
const u64 H = 0x8080808080808080;
u64 occInclRook = rooks  ~empty;
u64 occExclRook = rooks ^ occInclRook;
u64 attacks = ((occExclRook  H)  (rooks & ~H))
^ (occInclRook  H);
return attacks;
}

 Posts: 2139
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Resource for bit twiddlers
The vc2005 compiler surprisingly was not aware of the
A ^ (~A & B) == A  B simplification.
This one with a De Morgan and i = ~i+1 transformation takes one register and a few code bytes less. The additional +1 is done together with the add by the AGU instaed of the ALU, performing a leainstruction.
A ^ (~A & B) == A  B simplification.
Code: Select all
?eastAttacks@@YA_K_K0@Z PROC
00000 49 b9 80 80 80
80 80 80 80 80 mov r9, 8080808080808080H
0000a 48 f7 d2 not rdx
0000d 49 b8 7f 7f 7f
7f 7f 7f 7f 7f mov r8, 7f7f7f7f7f7f7f7fH
00017 48 0b d1 or rdx, rcx
0001a 48 8b c2 mov rax, rdx
0001d 48 33 c1 xor rax, rcx
00020 49 23 c8 and rcx, r8
00023 49 0b c1 or rax, r9
00026 48 2b c1 sub rax, rcx
00029 48 8b ca mov rcx, rdx
0002c 48 f7 d1 not rcx
0002f 49 23 c9 and rcx, r9
00032 48 33 c1 xor rax, rcx
00035 48 33 c2 xor rax, rdx
00038 c3 ret 0
Code: Select all
rooks$ = 8
empty$ = 16
?eastAttacks@@YA_K_K0@Z PROC
00000 48 f7 d2 not rdx
00003 49 b9 80 80 80
80 80 80 80 80 mov r9, 8080808080808080H
0000d 49 b8 7f 7f 7f
7f 7f 7f 7f 7f mov r8, 7f7f7f7f7f7f7f7fH
00017 48 0b d1 or rdx, rcx
0001a 48 8b c2 mov rax, rdx
0001d 49 0b d1 or rdx, r9
00020 48 33 c1 xor rax, rcx
00023 49 23 c8 and rcx, r8
00026 49 0b c1 or rax, r9
00029 48 2b c1 sub rax, rcx
0002c 48 33 c2 xor rax, rdx
0002f c3 ret 0
Code: Select all
u64 eastAttacks(u64 rooks, u64 empty)
{
const u64 H = 0x8080808080808080;
u64 occInclRook = rooks  ~empty;
u64 occExclRook = rooks ^ occInclRook;
u64 attacks = ((occExclRook  H) + (~rooks  H) + 1)
^ (occInclRook  H);
return attacks;
}
Code: Select all
rooks$ = 8
empty$ = 16
?eastAttacks@@YA_K_K0@Z PROC
00000 49 b8 80 80 80
80 80 80 80 80 mov r8, 8080808080808080H
0000a 48 f7 d2 not rdx
0000d 48 0b d1 or rdx, rcx
00010 48 8b c2 mov rax, rdx
00013 49 0b d0 or rdx, r8
00016 48 33 c1 xor rax, rcx
00019 48 f7 d1 not rcx
0001c 49 0b c0 or rax, r8
0001f 49 0b c8 or rcx, r8
00022 48 8d 44 08 01 lea rax, QWORD PTR [rax+rcx+1]
00027 48 33 c2 xor rax, rdx
0002a c3 ret 0

 Posts: 2139
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Resource for bit twiddlers
Some further optimization by using common subexpressions. One operation less, and more importantly, only three scratch registers used. The compiler prefers 10 byte opcode loading rax with an 64bit immediate, while there is already the one's complement in rax, where a 3byte "not rax" would result in 7 bytes shorter opcode but higher alupressure. With 64bit immediate moves, the 32byte prefetch of the K10 makes sense.
If one passes already the occupied set, and guarantee that rooks are subset of occupied, one safes another two instructions of course, but that loses a bit flexibility and tends to confuse due to the prototype and semantic of the KoggeStone getters.
Gerd
Code: Select all
u64 eastAttacks(u64 rooks, u64 empty)
{
const u64 H = 0x8080808080808080;
u64 inclRook = rooks  H  ~empty;
u64 exclRook = (rooks &= ~H ) ^ inclRook;
u64 attacks = (exclRook  rooks) ^ inclRook;
return attacks;
}
rooks$ = 8
empty$ = 16
?eastAttacks@@YA_K_K0@Z PROC
00000 48 f7 d2 not rdx
00003 48 b8 80 80 80
80 80 80 80 80 mov rax, 8080808080808080H
0000d 48 0b d1 or rdx, rcx
00010 48 0b d0 or rdx, rax
00013 48 b8 7f 7f 7f ; not rax would also possible here
7f 7f 7f 7f 7f mov rax, 7f7f7f7f7f7f7f7fH
0001d 48 23 c8 and rcx, rax
00020 48 8b c2 mov rax, rdx
00023 48 33 c1 xor rax, rcx
00026 48 2b c1 sub rax, rcx
00029 48 33 c2 xor rax, rdx
0002c c3 ret 0
Gerd
Re: Resource for bit twiddlers
From what I can understand this subtraction technique can only be used for right fills and left fills. Here's what I could infer from your filling algorithm:Gerd Isenberg wrote:Some further optimization by using common subexpressions. One operation less, and more importantly, only three scratch registers used. The compiler prefers 10 byte opcode loading rax with an 64bit immediate, while there is already the one's complement in rax, where a 3byte "not rax" would result in 7 bytes shorter opcode but higher alupressure. With 64bit immediate moves, the 32byte prefetch of the K10 makes sense.
If one passes already the occupied set, and guarantee that rooks are subset of occupied, one safes another two instructions of course, but that loses a bit flexibility and tends to confuse due to the prototype and semantic of the KoggeStone getters.Code: Select all
u64 eastAttacks(u64 rooks, u64 empty) { const u64 H = 0x8080808080808080; u64 inclRook = rooks  H  ~empty; u64 exclRook = (rooks &= ~H ) ^ inclRook; u64 attacks = (exclRook  rooks) ^ inclRook; return attacks; } rooks$ = 8 empty$ = 16 ?eastAttacks@@YA_K_K0@Z PROC 00000 48 f7 d2 not rdx 00003 48 b8 80 80 80 80 80 80 80 80 mov rax, 8080808080808080H 0000d 48 0b d1 or rdx, rcx 00010 48 0b d0 or rdx, rax 00013 48 b8 7f 7f 7f ; not rax would also possible here 7f 7f 7f 7f 7f mov rax, 7f7f7f7f7f7f7f7fH 0001d 48 23 c8 and rcx, rax 00020 48 8b c2 mov rax, rdx 00023 48 33 c1 xor rax, rcx 00026 48 2b c1 sub rax, rcx 00029 48 33 c2 xor rax, rdx 0002c c3 ret 0
Gerd
I'll term rooks as squares.
so basically the concept is to use the fact that
Code: Select all
(LSB at left, looking at only first rank for now)
HFile "rooks" Result
00000001  00001000 = 00001110
Result HFile Fill
00001110 ^ 00000001 = 00001111
Code: Select all
(LSB at left like for the first rank)
HFileocc "rooks" Result
00000011  00001000 = 00001101
Result HFileocc Fill
00001101 ^ 00000011 = 00001110
Code: Select all
(LSB at left like for the first rank)
HFileocc "rooks" Result
00001001  00001000 = 00000001 (BAD)
Result HFileocc Fill
00000001 ^ 00000011 = 00000010 (WRONG)
Code: Select all
((HFileocc)&~rooks)  "rooks" Result
00000001  00001000 = 00001110
Result ^ ((HFileocc)&~rooks) = Fill
00001110 00000001 = 00001110
Code: Select all
U64 rightFill(U64 squares, U64 occ)
{
occ = (occFILEHBB)&~squares; //occupancy without squares
return ((occ  squares)^occ)  squares; //xor to get rid of unused occupancy
}

 Posts: 2139
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Resource for bit twiddlers
Hi Pradu,Pradu wrote: From what I can understand this subtraction technique can only be used for right fills and left fills. Here's what I could infer from your filling algorithm:
I'll term rooks as squares.
so basically the concept is to use the fact thatNow lets add an occupancy to see if it still works:Code: Select all
(LSB at left, looking at only first rank for now) HFile "rooks" Result 00000001  00001000 = 00001110 Result HFile Fill 00001110 ^ 00000001 = 00001111
Works perfect! But because of the subtraction, we have to make sure the occupancy and the rook don't coinside eg:Code: Select all
(LSB at left like for the first rank) HFileocc "rooks" Result 00000011  00001000 = 00001101 Result HFileocc Fill 00001101 ^ 00000011 = 00001110
So we have to make sure we take the rooks out of the occupancy. Assuming the same occupancy as before:Code: Select all
(LSB at left like for the first rank) HFileocc "rooks" Result 00001001  00001000 = 00000001 (BAD) Result HFileocc Fill 00000001 ^ 00000011 = 00000010 (WRONG)
From the above we can make an easy right fill:Code: Select all
((HFileocc)&~rooks)  "rooks" Result 00000001  00001000 = 00001110 Result ^ ((HFileocc)&~rooks) = Fill 00001110 00000001 = 00001110
I hope my iterpretation is correct? Is it also possible to to do fills other than left and right with the subtraction technique?Code: Select all
U64 rightFill(U64 squares, U64 occ) { occ = (occFILEHBB)&~squares; //occupancy without squares return ((occ  squares)^occ)  squares; //xor to get rid of unused occupancy }
no, the o^(o2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a>h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's KoggeStone routines.
The bytewise or rankwise o^(o2r) works setwise even with multiple rooks per rank. The first substraction of 2r ==  r  r can be done by xor or "andnot" to clear the whole rookbits in the occupancy, while the substration does the carry propagation, and the final xor leaves the attacks. Simd instructions with MMX or SSE2 are nice for that bytewisesubstraction, but it turned out that the SWARtrick is as quite effective as well.
With single rooks/bishops by square metric, one can use this trick for four positive vertical and diagonal directions with pre and postmasks of the complete file or diagonal. For bishops and fileattacks one can use vertical flips (bswap) to do the negative directions. But a bishop attackgetter based on this technique was a bit slower than kindergarten, not to mention magic:
Code: Select all
u64 bishopAttacks(u64 occ, u32 sq) {
u64 rvsdia, rvsant, bishop;
u64 occdia = occ, occant = occ;
occdia &= smsk[sq].diagonalMaskEx;
occant &= smsk[sq].antidiagMaskEx;
bishop = smsk[sq].bitMask;
rvsdia = _byteswap_uint64(occdia);
rvsant = _byteswap_uint64(occant);
occdia ^= occdia  bishop;
occant ^= occant  bishop;
bishop = _byteswap_uint64(bishop);
rvsdia ^= rvsdia  bishop;
rvsant ^= rvsant  bishop;
occdia ^= _byteswap_uint64(rvsdia);
occant ^= _byteswap_uint64(rvsant);
occdia &= smsk[sq].diagonalMaskEx;
occant &= smsk[sq].antidiagMaskEx;
return occdia ^ occant;
}
Re: Resource for bit twiddlers
My mistake. I guess also that my code doesn't work well sometimes for multiple rooks on the same rank when one rook coincides with an occupancy. I need to look at your code a bit more closely.Gerd Isenberg wrote:
Hi Pradu,
no, the o^(o2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a>h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's KoggeStone routines.

 Posts: 2139
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Resource for bit twiddlers
Pradu wrote:My mistake. I guess also that my code doesn't work well sometimes for multiple rooks on the same rank when one rook coincides with an occupancy. I need to look at your code a bit more closely.Gerd Isenberg wrote:
Hi Pradu,
no, the o^(o2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a>h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's KoggeStone routines.
Code: Select all
hex bin bin
arithmetical order reversed according to the a1 = 0 mapping
o 0xDB 11011011 11011011 occupancy including rooks
r = subset(o) 0x12 00010010 01001000 rooks
o  r 0xC9 11001001 10010011 occupancy excluding rooks
o  2r 0xB7 10110111 11101101 subtracting rooks from either next higher occupancy or borrow
o ^ (o  2r) 0x6C 01101100 00110110 rook attacks of both rooks

 Posts: 2139
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Resource for bit twiddlers
Gerd Isenberg wrote:This is really a great ressource  thanks for providing the link.jwes wrote:I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://wwwcsfaculty.stanford.edu/~knuth/fasc1a.ps.gz
"tweaking several bytes at once" (page 19), nicely explains some SWAR (SIMD within a regsiter) tricks suggested by H.G. Dietz.
For bytewise math, H = 0x8080808080808080 and L = 0x0101010101010101. But one can use arbitrary masks for other sizes such as 6*10 or 5*12 bit as well  or even 64bit structures with mixed "word"sizes.Code: Select all
SWAR add z = x + y z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H) SWAR sub z = x  y z = ((x  H)  (y &~H)) ^ ((x ^~y) & H) SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y) z = (x & y) + (((x ^ y) & ~L) >> 1)
Performing bytewise or rankwise sub using the o ^(o2r)trick saves some instructions compared to one KoggeStone attack getter, even with general purpose registers:Code: Select all
u64 eastAttacks(u64 rooks, u64 empty) { empty = empty & 0xfefefefefefefefe; rooks = empty & (rooks << 1); empty = empty & (empty << 1); rooks = empty & (rooks << 2); empty = empty & (empty << 2); rooks = empty & (rooks << 4); return (rooks << 1) & 0xfefefefefefefefe; }
GerdCode: Select all
u64 eastAttacks(u64 rooks, u64 empty) { const u64 H = 0x8080808080808080; u64 occupied = rooks  ~empty; u64 occNoRook = rooks ^ occupied; u64 difference = ((occNoRook  H)  (rooks & ~H)) ^ (~occupied & H); return occupied ^ difference; }
I'll hope the attackgetter does not violate any United States Patent
see also
http://aggregate.org/MAGIC/#SIMD%20With ... Operations
by Dr. Henry G. (Hank) Dietz
SIMD Within A Register (SWAR) Operations
Before we coined the name SWAR in Fall 1996, we already had defined a complete set of basic operations and described how they could be implemented with good efficiency. On February 4, 1997, we posted this fairly complete overview document and there also are slides from a seminar presentation I gave at Purdue. These methods were used in our SWARC compiler and were detailed in a number of our publications throughout the 1990s. We hadn't posted them on this page because they were so prominently published elsewhere.
However, much to our surprize, United States Patent 7039906, "Compiler for enabling multiple signed independent data elements per register," was filed October 20, 2000 and was issued to IBM on May 2, 2006! By our reading, this patent appears to seriously overlap my and Randy Fisher's earlier published work  much of which was cited by the patent examiner. I am posting this note here so that people who happen to hear about the IBM patent will not be discouraged from using the prior art technologies developed by us, which, by being cited in the patent, are explicitly not covered by the patent.