| View previous topic :: View next topic |
| Author |
Message |
Gerd Isenberg
Joined: 08 Mar 2006 Posts: 1785 Location: Hattingen, Germany
|
Posted: Mon Apr 23, 2007 10:28 pm Post subject: Two small in-register-lookups |
|
|
Color of square - assumes 32-bit shift by mod 32.
Is that portable for 32-bit ints?
(otherwise sq&31 or sq&15 is necessary)
| Code: |
unsigned int colorOfSquare (unsigned int sq) {
return (0xAA55AA55 >> sq) & 1;
}
|
Center-Distance in the 0..3 range as a folded [32]-element lookup of two bits:
| Code: |
3 3 3 3 3 3 3 3
3 2 2 2 2 2 2 3
3 2 1 1 1 1 2 3
3 2 1 0 0 1 2 3
3 2 1 0 0 1 2 3
3 2 1 1 1 1 2 3
3 2 2 2 2 2 2 3
3 3 3 3 3 3 3 3
|
| Code: |
unsigned int centerDistance(unsigned int a) {
const u64 _13 = 0xffffeaabe55be41b;
a ^= (a - 32) >> 27;
return (_13 >> 2*a) & 3;
}
|
or 32-bit friendly:
| Code: |
unsigned int centerDistance(unsigned int a) {
a ^= (a - 32) >> 27;
return ((-8274523 >> a) & 1)
| (( -30842 >> a) & 2);
}
|
Cheers,
Gerd |
|
| Back to top |
|
 |
J. Wesley Cleveland
Joined: 01 Jul 2006 Posts: 622
|
Posted: Tue Apr 24, 2007 9:08 am Post subject: Re: Two small in-register-lookups |
|
|
| Gerd Isenberg wrote: |
Color of square - assumes 32-bit shift by mod 32.
Is that portable for 32-bit ints?
(otherwise sq&31 or sq&15 is necessary)
| Code: |
unsigned int colorOfSquare (unsigned int sq) {
return (0xAA55AA55 >> sq) & 1;
}
|
|
or
| Code: |
unsigned int colorOfSquare (unsigned int sq) {
return ((sq>>3)^ sq) & 1;
}
|
|
|
| Back to top |
|
 |
H.G.Muller

Joined: 10 Mar 2006 Posts: 12765 Location: Amsterdam
|
Posted: Tue Apr 24, 2007 9:51 am Post subject: Re: Two small in-register-lookups |
|
|
Why would you want to use in-register lookups? Fetching such simple things from a memory table should be very competative, if not faster. I noticed that these out-of-order CPUs often perform very poorly if you only use registers, it seems that there is some internal bottleneck on how many registers you can access per clock. Having some of the operands come from memory is usually more successful in keeping the ALUs busy.
On AMD machines load and stores are piggybacked on ALU makro ops, so there is no decoder penalty in using instructions with memory operands. uOp fusion in Intel machines now does the same (Pentium M and Core CPUs), while Pentium IV circumvented decoder limits by using a trace cache.
In addition, shifts are usually a bottleneck, as most CPUs have only one ALU capable of doing shifts, wich can become a bottleneck if your code uses them heavily. So better reserve them for what you really can't do without. (Plus, the i86 machine language requires a variable shift count to be in one specific register, leading to extra register-to-register moves to implement them.)
If you are worried about L1 footprint, you could pack several such tables in 64 byte. E.g. bits 0 and 1 the center distance, bit 2 the square color, etc.
| Code: |
colorOfSquare = Data[sqr] & 4; // assuming it is used as a boolean, so 4=true
centerDistance = Data[sqr] & 3;
|
(Btw, Joker uses Wesley's method for square color, but I am considering to change that.) |
|
| Back to top |
|
 |
Gerd Isenberg
Joined: 08 Mar 2006 Posts: 1785 Location: Hattingen, Germany
|
Posted: Tue Apr 24, 2007 10:58 am Post subject: Re: Two small in-register-lookups |
|
|
| jwes wrote: |
| Gerd Isenberg wrote: |
Color of square - assumes 32-bit shift by mod 32.
Is that portable for 32-bit ints?
(otherwise sq&31 or sq&15 is necessary)
| Code: |
unsigned int colorOfSquare (unsigned int sq) {
return (0xAA55AA55 >> sq) & 1;
}
|
|
or
| Code: |
unsigned int colorOfSquare (unsigned int sq) {
return ((sq>>3)^ sq) & 1;
}
|
|
Yes, I am aware of that. It has one dependent instruction more and with my mapping (a1 == 0, -> dark square), I even need one additional xor 1 or not
Thus loading the immediate constant to eax, plus shift, mask looks not that bad, despite variable shift needs cl.
| Code: |
; ecx == sq
mov eax, AA55AA55H
shr eax, cl
and eax, 1
|
|
|
| Back to top |
|
 |
Gerd Isenberg
Joined: 08 Mar 2006 Posts: 1785 Location: Hattingen, Germany
|
Posted: Tue Apr 24, 2007 11:00 am Post subject: Re: Two small in-register-lookups |
|
|
| hgm wrote: |
| Why would you want to use in-register lookups? Fetching such simple things from a memory table should be very competative, if not faster. I noticed that these out-of-order CPUs often perform very poorly if you only use registers, it seems that there is some internal bottleneck on how many registers you can access per clock. Having some of the operands come from memory is usually more successful in keeping the ALUs busy. |
Bit-twiddling addiction
Some alternative methods to compute such data - no serious claims intended. Quite modest register usage on x64, RAX (EAX) and CL, one variable shift. Sometimes with some other memory accesses around such computation may be helpfull to find the "best" balance between memory accesss and register computation...
| hgm wrote: |
On AMD machines load and stores are piggybacked on ALU makro ops, so there is no decoder penalty in using instructions with memory operands. uOp fusion in Intel machines now does the same (Pentium M and Core CPUs), while Pentium IV circumvented decoder limits by using a trace cache.
In addition, shifts are usually a bottleneck, as most CPUs have only one ALU capable of doing shifts, wich can become a bottleneck if your code uses them heavily. So better reserve them for what you really can't do without. (Plus, the i86 machine language requires a variable shift count to be in one specific register, leading to extra register-to-register moves to implement them.)
If you are worried about L1 footprint, you could pack several such tables in 64 byte. E.g. bits 0 and 1 the center distance, bit 2 the square color, etc.
| Code: |
colorOfSquare = Data[sqr] & 4; // assuming it is used as a boolean, so 4=true
centerDistance = Data[sqr] & 3;
|
(Btw, Joker uses Wesley's method for square color, but I am considering to change that.) |
|
|
| Back to top |
|
 |
H.G.Muller

Joined: 10 Mar 2006 Posts: 12765 Location: Amsterdam
|
Posted: Tue Apr 24, 2007 12:37 pm Post subject: Re: Two small in-register-lookups |
|
|
You could of course replace the (sqr>>3^sqr)&1 by
for the purpose of making a boolean.
The multiplication could be implemented with a single LEA instruction:
| Code: |
movl _sqr, %eax
leal (%eax, %eax, 8), %eax
andl $8, %eax
|
|
|
| Back to top |
|
 |
Gerd Isenberg
Joined: 08 Mar 2006 Posts: 1785 Location: Hattingen, Germany
|
Posted: Tue Apr 24, 2007 6:36 pm Post subject: Re: Two small in-register-lookups |
|
|
| hgm wrote: |
You could of course replace the (sqr>>3^sqr)&1 by
for the purpose of making a boolean.
The multiplication could be implemented with a single LEA instruction:
| Code: |
movl _sqr, %eax
leal (%eax, %eax, 8), %eax
andl $8, %eax
|
|
Cute, lea takes only one register.
Since I need "1" for dark squares (square a1 == 0) and "0" for light squares, lea eax+8*eax+8 would be nice...
But... lea with scaling takes two cycles on K8.
And i can't force vc2005 to generate that code. It always uses imul -7
| Code: |
0000 8D 44 C9 08 lea eax, [eax*8 + eax + 8]
0004 C1 E8 01 shr eax, 3
0007 83 E0 01 and eax, 1
|
But with sq already in ecx, i still would prefere the in-register lookup.
Codesize is exactly the same
| Code: |
0000 B8 55 AA 55 AA mov eax, aa55aa55
0005 D3 E8 shr eax, cl
0007 83 E0 01 and eax, 1
|
|
|
| Back to top |
|
 |
Gerd Isenberg
Joined: 08 Mar 2006 Posts: 1785 Location: Hattingen, Germany
|
Posted: Tue Apr 24, 2007 8:11 pm Post subject: Re: Two small in-register-lookups |
|
|
Three of zillions versions for a boolean subroutine, to look whether two squares share same color. And what vc2005 makes out of it...
| Code: |
bool sameSquareColor(unsigned int sq1, unsigned int sq2) {
//return ((0xAA55AA55 >> sq1) ^ (sq2*9 >> 3)) & 1;
//return (( ((sq1 ^ sq2)>>3) ^ sq1 ^ sq2 ) ^ 1) & 1;
return ( (0xAA55AA55 >> sq1) ^ (sq2>>3) ^ sq2 ) & 1;
}
|
| Code: |
?sameSquareColor@@YA_NII@Z PROC
00000 6b d2 f9 imul edx, -7
00003 b8 55 aa 55 aa mov eax, aa55aa55H
00008 d3 e8 shr eax, cl
0000a c1 ea 03 shr edx, 3
0000d 33 c2 xor eax, edx
0000f 83 e0 01 and eax, 1
00012 c3 ret 0
|
| Code: |
?sameSquareColor@@YA_NII@Z PROC
00000 8b c1 mov eax, ecx
00002 33 c2 xor eax, edx
00004 c1 e8 03 shr eax, 3
00007 f7 d0 not eax
00009 33 c1 xor eax, ecx
0000b 33 c2 xor eax, edx
0000d 83 e0 01 and eax, 1
00010 c3 ret 0
|
| Code: |
?sameSquareColor@@YA_NII@Z PROC
00000 b8 55 aa 55 aa mov eax, aa55aa55H
00005 d3 e8 shr eax, cl
00007 8b ca mov ecx, edx
00009 c1 e9 03 shr ecx, 3
0000c 33 c1 xor eax, ecx
0000e 33 c2 xor eax, edx
00010 83 e0 01 and eax, 1
00013 c3 ret 0
|
handmade assembly (ml64)
| Code: |
00000000 _sameSquareColor PROC
00000000 8D 44 C9 08 lea eax, [rcx + 8*rcx + 8]
00000004 8D 0C D2 lea ecx, [rdx + 8*rdx]
00000007 33 C1 xor eax, ecx
00000009 C1 E8 03 shr eax, 3
0000000C 83 E0 01 and eax, 1
0000000F C3 ret 0
|
|
|
| Back to top |
|
 |
H.G.Muller

Joined: 10 Mar 2006 Posts: 12765 Location: Amsterdam
|
Posted: Tue Apr 24, 2007 9:48 pm Post subject: Re: Two small in-register-lookups |
|
|
You can XOR the squares first, and then do the test for one square.
| Code: |
bool sameSquareColor(unsigned int sq1, unsigned int sq2) {
sq1 ^= sq2;
return ( (0xAA55AA55 >> sq1) & 1);
}
|
|
|
| Back to top |
|
 |
Gerd Isenberg
Joined: 08 Mar 2006 Posts: 1785 Location: Hattingen, Germany
|
Posted: Tue Apr 24, 2007 9:57 pm Post subject: Re: Two small in-register-lookups |
|
|
| hgm wrote: |
You can XOR the squares first, and then do the test for one square.
| Code: |
bool sameSquareColor(unsigned int sq1, unsigned int sq2) {
sq1 ^= sq2;
return ( (0xAA55AA55 >> sq1) & 1);
}
|
|
Of course! Looks quite competetive now.
| Code: |
?sameSquareColor@@YA_NII@Z PROC
00000 33 ca xor ecx, edx
00002 b8 55 aa 55 aa mov eax, aa55aa55H
00005 d3 e8 shr eax, cl
00007 83 e0 01 and eax, 1
|
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|