Score multiply

wgarvin · Post by **wgarvin** » Wed Nov 11, 2009 7:15 am

Gerd Isenberg wrote:
wgarvin wrote: Just a wild guess: mov involving edx followed by sar edx,31 seems like an "optimized for speed" expansion of cdq. Maybe they model it like a cdq in intermediate code, so that they are able to generate cdq when optimizing for size.

I have no idea why that first sar wouldn't get marked as dead and removed after the expansion, though, because its obviously dead. It does look pretty weird.
Good catch, looks similar to cdq, but makes the result 33-bit (instead of 32) sign extended to 64. I first thought the "dead" sar is related to the explicit sar 31 from the source, but it isn't.

Its still just a 32-bit result (and the first sar is definitely dead, both edx and flags get clobbered by the next 2 insns).

The first two sars are for the int to __int64 conversion. For both of them, it generated the mov / sar insns, and it used eax:edx register pair (not noticing that one half of it was dead, in the first case at least). So internally the sign-extension was probably a cdq, and then a late peephole pass or something, converted it to those two instructions (too late for DCE).

[Edit: or maybe it wasn't THAT late of a pass.. the register allocator probably has to allocate 2 registers to the __int64, and liveness might be tracked for the entire variable rather than for the individual register. Its not really a surprise to see dumb generated code for 64-bit variables from a 32-bit compiler, its the sort of functionality that is not used much by most programs.]

[Edit 2: You could try the Int32x32To64 or UInt32x32To64 intrinsics. MS documentation claims that they produce a single multiply instruction.]

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Nov 11, 2009 8:34 am

wgarvin wrote:
Gerd Isenberg wrote:
wgarvin wrote: Just a wild guess: mov involving edx followed by sar edx,31 seems like an "optimized for speed" expansion of cdq. Maybe they model it like a cdq in intermediate code, so that they are able to generate cdq when optimizing for size.

I have no idea why that first sar wouldn't get marked as dead and removed after the expansion, though, because its obviously dead. It does look pretty weird.
Good catch, looks similar to cdq, but makes the result 33-bit (instead of 32) sign extended to 64. I first thought the "dead" sar is related to the explicit sar 31 from the source, but it isn't.
Its still just a 32-bit result (and the first sar is definitely dead, both edx and flags get clobbered by the next 2 insns).

The first two sars are for the int to __int64 conversion. For both of them, it generated the mov / sar insns, and it used eax:edx register pair (not noticing that one half of it was dead, in the first case at least). So internally the sign-extension was probably a cdq, and then a late peephole pass or something, converted it to those two instructions (too late for DCE).

[Edit: or maybe it wasn't THAT late of a pass.. the register allocator probably has to allocate 2 registers to the __int64, and liveness might be tracked for the entire variable rather than for the individual register. Its not really a surprise to see dumb generated code for 64-bit variables from a 32-bit compiler, its the sort of functionality that is not used much by most programs.]

[Edit 2: You could try the Int32x32To64 or UInt32x32To64 intrinsics. MS documentation claims that they produce a single multiply instruction.]

Yes, compiler emits a 32-bit imul where edx:eax contains the signed 64-bit product, which might be used for our purpose. The sign bit is MSB from edx (bit 63). It might be set, even if the MSB from eax (bit 31) is zero. The sar edx,31 "smears" all edx bits as sign. So there are 33-bits inside the product, 32-bits inside eax plus one bit from edx. Otoh CDQ only sign extends eax to edx:eax and smears MSB from eax to edx. Thus imho the semantics are a (one) bit different, for whatever reason

mcostalba · Post by **mcostalba** » Wed Nov 11, 2009 9:28 am

Gerd Isenberg wrote: I'll vote for the sse2 approach with the option to multiply vector of eight shorts instead of two. It also has a range advantage.

Hi Gerd,

the sse2 approach could handle this ?

Code: Select all

  // apply_weight&#40;) applies an evaluation weight to a value trying to prevent overflow

  inline Score apply_weight&#40;Score v, Score w&#41; &#123;
      return make_score&#40;&#40;int&#40;mg_value&#40;v&#41;) * mg_value&#40;w&#41;) / 0x100, &#40;int&#40;eg_value&#40;v&#41;) * eg_value&#40;w&#41;) / 0x100&#41;;
  &#125;

This is the usual formula but now as you can see they are scaled by 0x100, not by 0x10 as before. The end result is the same and this means that the values are up to 16x times bigger then before, IOW there can be some overflow on short size of 16bit, let's assume I need 16+4 = 20 bit of precision for the multiplication of each term before the final scale down of 0x100.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Nov 11, 2009 11:07 am

mcostalba wrote:
Gerd Isenberg wrote: I'll vote for the sse2 approach with the option to multiply vector of eight shorts instead of two. It also has a range advantage.
Hi Gerd,

the sse2 approach could handle this ?
Code: Select all
  // apply_weight&#40;) applies an evaluation weight to a value trying to prevent overflow

  inline Score apply_weight&#40;Score v, Score w&#41; &#123;
      return make_score&#40;&#40;int&#40;mg_value&#40;v&#41;) * mg_value&#40;w&#41;) / 0x100, &#40;int&#40;eg_value&#40;v&#41;) * eg_value&#40;w&#41;) / 0x100&#41;;
  &#125;
This is the usual formula but now as you can see they are scaled by 0x100, not by 0x10 as before. The end result is the same and this means that the values are up to 16x times bigger then before, IOW there can be some overflow on short size of 16bit, let's assume I need 16+4 = 20 bit of precision for the multiplication of each term before the final scale down of 0x100.

Yes, sse2 can handle this with less brain damage

Code: Select all

int simdIMulDiv256&#40;int s1, int s2&#41;
&#123;
   __asm &#123;
        movd   xmm0, &#91;s1&#93;
        movd   xmm1, &#91;s2&#93;
        pmullw xmm0, xmm1
        psraw  xmm0, 8
        movd   eax, xmm0
   &#125;
&#125;

One may even use pmulhw additionally to get the signed higher 16-bit results into a second register for 32-bit results.

mcostalba · Post by **mcostalba** » Wed Nov 11, 2009 11:53 am

Gerd Isenberg wrote: Yes, sse2 can handle this with less brain damage

I have tried to subsitute as is but it is not equivalent to the commented out code:

Code: Select all

  inline Score apply_weight&#40;Score v, Score w&#41; &#123;

      //return make_score&#40;&#40;int&#40;mg_value&#40;v&#41;) * mg_value&#40;w&#41;) / 0x100, &#40;int&#40;eg_value&#40;v&#41;) * eg_value&#40;w&#41;) / 0x100&#41;;

    __asm &#123;
        movd   xmm0, &#91;v&#93;
        movd   xmm1, &#91;w&#93;
        pmullw xmm0, xmm1
        psraw  xmm0, 8
        movd   eax, xmm0
   &#125;
  &#125;

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Nov 11, 2009 12:33 pm

mcostalba wrote:
Gerd Isenberg wrote: Yes, sse2 can handle this with less brain damage
I have tried to subsitute as is but it is not equivalent to the commented out code:
Code: Select all
  inline Score apply_weight&#40;Score v, Score w&#41; &#123;

      //return make_score&#40;&#40;int&#40;mg_value&#40;v&#41;) * mg_value&#40;w&#41;) / 0x100, &#40;int&#40;eg_value&#40;v&#41;) * eg_value&#40;w&#41;) / 0x100&#41;;

    __asm &#123;
        movd   xmm0, &#91;v&#93;
        movd   xmm1, &#91;w&#93;
        pmullw xmm0, xmm1
        psraw  xmm0, 8
        movd   eax, xmm0
   &#125;
  &#125;

I guess this is also true if you replace / x0x100 by >> 8?

Code: Select all

return make_score&#40;&#40;int&#40;mg_value&#40;v&#41;) * mg_value&#40;w&#41;) >> 8, &#40;int&#40;eg_value&#40;v&#41;) * eg_value&#40;w&#41;)  >> 8&#41;;

Div is not equivalent to shift arithmetical right for negative numbers. Shift right rounds negative number towards -oo but not zero, e.g. -1 >> 1 == -1 , but -1 div 2 == 0. But I guess there are solutions for the simd code as well, as you may figur out by yourself by comparing the assembly from / 256 versus >> 8

Cheers,
Gerd

mcostalba · Post by **mcostalba** » Sun Feb 07, 2010 11:55 am

Here we are again

Due to overflow I abandoned unified score multilpy, but now I come up with another idea that works.

Score is an integer of 32 bits where the lower 16 bits are used for endgame score value and the higher 16 bits for midgame value.

A weight is another Score, splitted in 16 lower bits for weight for endgame and higher 16 bits for weight for midgame.

I need to multiply those two scores and get a score as result so that:

Code: Select all

Score s = Score&#40;mg_s, eg_s&#41;;
Score w = Score&#40;mg_w, eg_w&#41;

Score r = Score&#40;mg_s * mg_w  / 0x100, eg_s * eg_w / 0x100&#41;;

I want to do this with a single multiplication, but due to overflow I cannot. Now the idea is to use a Score64 where the highest 22 bits (from 44 to 63) keep the midgame score and the lowest 22 bits (from 0 to 21) keeps the endgame score. Now I can multiply these Score64 without any problem, the new apply_weight() function is

Code: Select all

Score apply_weight&#40;int64_t s64, int64_t w64&#41;
&#123;
  int64_t v64;
  int ac, bd;
  bool sign;

  static const int SH = 22;

  v64 = s64 * w64;

  sign = v64 & &#40;1ULL << &#40;2*SH - 1&#41;);
  ac = &#40;int&#40;v64 >> &#40;2*SH&#41;) + sign&#41; / 0x100;
  bd = &#40;int&#40;v64&#41; << &#40;32 - SH&#41;) / &#40;1 << &#40;32 - SH + 8&#41;);

  return Score&#40;&#40;ac << 16&#41; + bd&#41;;
&#125;

Now my question is what is the best and fastest way to transform a Score32 in a Score64 ?

Given a 32 bit integer like

s32 = MSB16 | LSB16

I want to get a 64 bit quantity like:

s64 = MSB16 00000000000000000 LSB16

IOW I whould like to space the MSb and LSB parts putting zeros in the middle (or 1 in the middle to keep the sign).

The current way to do it is:

Code: Select all

int64_t s64 = &#40;int64_t&#40;mg_value&#40;v&#41;) << SH&#41; + int64_t&#40;eg_value&#40;v&#41;);

But I would like something faster that do not pass through extracting mg and eg parts and then recombining again to get the 64 bit quantity.

Could someone help me solve this bit tweaks puzzle ?

Thanks
Marco

Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply