Score multiply

Zach Wegner · Post by **Zach Wegner** » Sun Nov 08, 2009 11:05 pm

bob wrote:First idea, why not get rid of the negative scores completely? Negative numbers embedded like that are going to be an issue. If you make everything positive, then you don't worry about 2's complement stuff letting the lower order half wreck the high-order half.

Whether this is a good idea at all is a good question. Extra adds are not bad. You rarely keep multiple pipelines busy anyway...

What do you mean? Like having two counters, one for each side, and each is only incremented? So if one side gets a penalty, it is just added to the other side? That could work, but seems a bit messy to me.

One other possibility: since the scores are going to be in a certain range, you could use maybe 12 bits instead of 16 to represent the score. Since adding negative numbers in two's complement results in overflows that we need to get rid of, the extra bits gained out of 32 would be in the middle, and could be used for overflow space for the lower score (and would be ignored). So one score is &0xFFF, the other is >>20. To add -1 to the lower score, you add 0xFFF. This generates a carry into 0x1000, but we can add up to 512 negative numbers without the carries reaching the upper score.

mcostalba · Post by **mcostalba** » Sun Nov 08, 2009 11:06 pm

Gerd Isenberg wrote: Sure, 70*248 + 71*271 == 36601 which is an short overflow. The value range is a bit restricted.

Yes, you are right, when filtering out values that don't respect the condition

Code: Select all

(abs(a*c)+abs(b*d) < 30000)

There are no more hits and the routine works as expected !

Thanks !

mcostalba · Post by **mcostalba** » Sun Nov 08, 2009 11:11 pm

Here is the disassembled

Code: Select all


  // Mobility
  ei.value += apply_weight(ei.mobility, WeightMobility);
00402ABC  mov         eax,dword ptr [ebx+0C0h] 
00402AC2  imul        dword ptr [`anonymous namespace'::WeightMobility (51F83Ch)] 
00402AC8  mov         ecx,eax 
00402ACA  mov         ebp,edx 
00402ACC  and         eax,80000000h 
00402AD1  mov         edx,2 
00402AD6  mul         eax,edx 
00402AD8  add         eax,ecx 
00402ADA  adc         edx,ebp 
00402ADC  mov         eax,edx 
00402ADE  movsx       edx,dx 
00402AE1  sar         eax,1Fh 
00402AE4  movsx       eax,cx 
00402AE7  shl         edx,10h 
00402AEA  add         edx,eax 
00402AEC  add         dword ptr [ebx],edx

Where

Code: Select all

   inline Score apply_weight(Score v, Score w) {

      int64_t v64 = int64_t(v) * int64_t(w);
      short bd = short(v64);
      short ac = short((v64 + 2*(v64 & 0x80000000)) >> 32);
      return make_score(ac, bd);
  }

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Nov 08, 2009 11:26 pm

mcostalba wrote:Here is the disassembled

Code: Select all


  // Mobility
  ei.value += apply_weight(ei.mobility, WeightMobility);
00402ABC  mov         eax,dword ptr [ebx+0C0h] 
00402AC2  imul        dword ptr [`anonymous namespace'::WeightMobility (51F83Ch)] 
00402AC8  mov         ecx,eax 
00402ACA  mov         ebp,edx 
00402ACC  and         eax,80000000h 
00402AD1  mov         edx,2 
00402AD6  mul         eax,edx 
00402AD8  add         eax,ecx 
00402ADA  adc         edx,ebp 
00402ADC  mov         eax,edx 
00402ADE  movsx       edx,dx 
00402AE1  sar         eax,1Fh 
00402AE4  movsx       eax,cx 
00402AE7  shl         edx,10h 
00402AEA  add         edx,eax 
00402AEC  add         dword ptr [ebx],edx

Where

Code: Select all

   inline Score apply_weight(Score v, Score w) {

      int64_t v64 = int64_t(v) * int64_t(w);
      short bd = short(v64);
      short ac = short((v64 + 2*(v64 & 0x80000000)) >> 32);
      return make_score(ac, bd);
  }

How ugly! A second mul by 2! Better try this one

Code: Select all

inline Score apply_weight(Score v, Score w) {

      int64_t v64 = int64_t(v) * int64_t(w);
      unsigned int v32 = (int) v64;
      short bd = short(v64);
      short ac = short(v64 >> 32) + (v32 >> 31);
      return make_score(ac, bd);
  }

Assembly should look like that

Code: Select all

mov         eax,dword ptr [ebx+0C0h]
imul        dword ptr [`anonymous namespace'::WeightMobility (51F83Ch)]
mov         ecx, edx
shr         ecx, 31
and         eax, 0xffff
add         ecx, edx
shl         ecx, 16
or          eax, ecx

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Nov 08, 2009 11:39 pm

mcostalba wrote:

Code: Select all

   inline Score apply_weight(Score v, Score w) {

      int64_t v64 = int64_t(v) * int64_t(w);
      short bd = short(v64);
      short ac = short((v64 + 2*(v64 & 0x80000000)) >> 32);
      return make_score(ac, bd);
  }

And better don't use short at all, but int. Compiler are really weak with shorts.

Code: Select all

Score make_score(int ac, int bd) {return (ac << 16) | (bd & 0xffff);}
    
inline Score apply_weight(Score v, Score w) {

      int64_t v64 = int64_t(v) * int64_t(w);
      unsigned int v32 = (int) v64;
      int bd = int(v64);
      int ac = int(v64 >> 32) + (v32 >> 31);
      return make_score(ac, bd);
  }

mcostalba · Post by **mcostalba** » Sun Nov 08, 2009 11:40 pm

Under MSVC I get (UPDATED)

Code: Select all

       int64_t v64 = int64_t(v) * int64_t(w);
      unsigned int v32 = (int) v64;
      int bd = int(v64);
      int ac = int(v64 >> 32) + (v32 >> 31);
      return make_score(ac / 16, bd / 16);


00402360  imul        dword ptr [esp+4] 
00402364  mov         ecx,eax 
00402366  mov         eax,edx 
00402368  sar         edx,1Fh 
0040236B  mov         edx,ecx 
0040236D  shr         edx,1Fh 
00402370  add         eax,edx 
00402372  cdq              
00402373  and         edx,0Fh 
00402376  add         eax,edx 
00402378  push        esi  
00402379  mov         esi,eax 
0040237B  mov         eax,ecx 
0040237D  cdq              
0040237E  and         edx,0Fh 
00402381  sar         esi,4 
00402384  add         eax,edx 
00402386  shl         esi,10h 
00402389  sar         eax,4 
0040238C  add         eax,esi

The last two /16 are to scale to working values.

And it doesn't works !!!

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Nov 08, 2009 11:49 pm

mcostalba wrote:Under MSVC I get (UPDATED)

Code: Select all

       int64_t v64 = int64_t(v) * int64_t(w);
      unsigned int v32 = (int) v64;
      int bd = int(v64);
      int ac = int(v64 >> 32) + (v32 >> 31);
      return make_score(ac / 16, bd / 16);


00402360  imul        dword ptr [esp+4] 
00402364  mov         ecx,eax 
00402366  mov         eax,edx 
00402368  sar         edx,1Fh 
0040236B  mov         edx,ecx 
0040236D  shr         edx,1Fh 
00402370  add         eax,edx 
00402372  cdq              
00402373  and         edx,0Fh 
00402376  add         eax,edx 
00402378  push        esi  
00402379  mov         esi,eax 
0040237B  mov         eax,ecx 
0040237D  cdq              
0040237E  and         edx,0Fh 
00402381  sar         esi,4 
00402384  add         eax,edx 
00402386  shl         esi,10h 
00402389  sar         eax,4 
0040238C  add         eax,esi

The last two /16 are to scale to working values.

And it doesn't works !!!

with this

Code: Select all

Score make_score(int ac, int bd) {return (ac << 16) | (bd & 0xffff);}

?

mcostalba · Post by **mcostalba** » Sun Nov 08, 2009 11:55 pm

Gerd Isenberg wrote: with this
Code: Select all
Score make_score(int ac, int bd) {return (ac << 16) | (bd & 0xffff);} 
?

Still doesn't works

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Nov 09, 2009 6:51 am

mcostalba wrote:
Gerd Isenberg wrote: with this
Code: Select all
Score make_score(int ac, int bd) {return (ac << 16) | (bd & 0xffff);} 
?
Still doesn't works

No idea, short in time. You may still fall back to the working version, may be better with the conditional ac++ to avoid the second mul - otherwise with all the other overhead one may better use two distinct multiplications ac and bd. This is the intended (untested) assembly with scaling down by 16.

Code: Select all

mov         eax,dword ptr [ebx+0C0h]
imul        dword ptr [`anonymous namespace'::WeightMobility (51F83Ch)]
mov         ecx, edx
sar         eax, 4
shr         ecx, 31
and         eax, 0xffff
add         ecx, edx
sar         ecx, 4
shl         ecx, 16
or          eax, ecx

bob · Post by **bob** » Mon Nov 09, 2009 5:13 pm

Zach Wegner wrote:
bob wrote:First idea, why not get rid of the negative scores completely? Negative numbers embedded like that are going to be an issue. If you make everything positive, then you don't worry about 2's complement stuff letting the lower order half wreck the high-order half.

Whether this is a good idea at all is a good question. Extra adds are not bad. You rarely keep multiple pipelines busy anyway...
What do you mean? Like having two counters, one for each side, and each is only incremented? So if one side gets a penalty, it is just added to the other side? That could work, but seems a bit messy to me.

One other possibility: since the scores are going to be in a certain range, you could use maybe 12 bits instead of 16 to represent the score. Since adding negative numbers in two's complement results in overflows that we need to get rid of, the extra bits gained out of 32 would be in the middle, and could be used for overflow space for the lower score (and would be ignored). So one score is &0xFFF, the other is >>20. To add -1 to the lower score, you add 0xFFF. This generates a carry into 0x1000, but we can add up to 512 negative numbers without the carries reaching the upper score.

I was thinking of two different alternatives.

A penalty for one side is a bonus for the other, yes. But this still leads to negative numbers since the two scores being combined are EG and MG values.

1. I was thinking about the idea I use in my hash table scores. I simply add a large enough value to guarantee that the stored score is always positive (sort of like IEEE floating point does for the exponent). I subtract the bias before actually using one of the scores.

2. One could use two 31 bit values. This leaves the 32nd bit as unused. You can then construct constants that will effectively "subtract" and cause an overflow into that unused bit, which promptly gets ANDed back to zero so that it can't overflow into the upper half of the 64 bit value, which would screw up that value.

However, I don't think the idea is worth anything, since it is hard enough to keep the functional units inside the CPU busy. An extra add that is completely independent of the first add won't add a single clock cycle to the time required to complete the computation. It just gets slipped into another pipe and is done in parallel with the first add.

Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply