Score multiply

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Nov 09, 2009 8:08 pm

mcostalba wrote:
Gerd Isenberg wrote: with this
Code: Select all
Score make_score&#40;int ac, int bd&#41; &#123;return &#40;ac << 16&#41; | &#40;bd & 0xffff&#41;;&#125; 
?
Still doesn't works

Does this work for you?

Code: Select all

int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  return &#40;ac<<12&#41; ^ (&#40;bd>>4&#41; & 0xffff&#41;;
&#125;

the VC6 generated assembly looks a bit funny with three redundant instructions, typical compiler confusion

Code: Select all

?swarIMulDiv16@@YAHHH@Z PROC NEAR
  00000	8b 44 24 04	 	mov	 eax, DWORD PTR _s1$&#91;esp-4&#93;
  00004	f7 6c 24 08	 	imul	 DWORD PTR _s2$&#91;esp-4&#93;
  00008	8b c8		 		mov	 ecx, eax    ; ???
  0000a	8b c2		 		mov	 eax, edx    ; ???
  0000c	c1 fa 1f	 		sar	 edx, 31     ; ????
  0000f	8b d1		 		mov	 edx, ecx
  00011	c1 fa 1f	 		sar	 edx, 31
  00014	2b c2		 		sub	 eax, edx
  00016	c1 f9 04	 		sar	 ecx, 4
  00019	c1 e0 0c	 		shl	 eax, 12
  0001c	81 e1 ff ff 00 00 and	 ecx, 65535
  00022	33 c1		 		xor	 eax, ecx
  00024	c3		 			ret	 0
?swarIMulDiv16@@YAHHH@Z ENDP

mcostalba · Post by **mcostalba** » Tue Nov 10, 2009 8:12 am

Gerd Isenberg wrote: Does this work for you?

Code: Select all

int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  return &#40;ac<<12&#41; ^ (&#40;bd>>4&#41; & 0xffff&#41;;
&#125;

Hi Gerd, sorry for the late reply.

It gives me problems, for instance:

Code: Select all

&#40;512, 0&#41; * &#40;46, 0&#41; -> ac = 23552, bd = 0 -> return 1543503872 &#40;instead of 96468992&#41;


(-2, -9&#41; * &#40;15, 16&#41; -> ac = -30, bd = -10944656 ->return -86025 &#40;instead of -1966224&#41;

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Nov 10, 2009 8:42 am

mcostalba wrote:

Gerd Isenberg wrote: Does this work for you?

Code: Select all

int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  return &#40;ac<<12&#41; ^ (&#40;bd>>4&#41; & 0xffff&#41;;
&#125;

Hi Gerd, sorry for the late reply.

It gives me problems, for instance:

Code: Select all

&#40;512, 0&#41; * &#40;46, 0&#41; -> ac = 23552, bd = 0 -> return 1543503872 &#40;instead of 96468992&#41;


(-2, -9&#41; * &#40;15, 16&#41; -> ac = -30, bd = -10944656 ->return -86025 &#40;instead of -1966224&#41;

Oups, I only focused on the higher part

But the first case works correct:
(512, 0) * (46, 0) -> 96468992 or 0x05c0 0000 = (1472, 0)
I will see what works wrong in the second case, where the lower short is wrong.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Nov 10, 2009 11:05 am

Gerd Isenberg wrote:
mcostalba wrote:
Gerd Isenberg wrote: Does this work for you?
Code: Select all
int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  return &#40;ac<<12&#41; ^ (&#40;bd>>4&#41; & 0xffff&#41;;
&#125;
Hi Gerd, sorry for the late reply.

It gives me problems, for instance:
Code: Select all
&#40;512, 0&#41; * &#40;46, 0&#41; -> ac = 23552, bd = 0 -> return 1543503872 &#40;instead of 96468992&#41;


(-2, -9&#41; * &#40;15, 16&#41; -> ac = -30, bd = -10944656 ->return -86025 &#40;instead of -1966224&#41;
Oups, I only focused on the higher part
But the first case works correct:
(512, 0) * (46, 0) -> 96468992 or 0x05c0 0000 = (1472, 0)
I will see what works wrong in the second case, where the lower short is wrong.

The swarImul (without div 16) seems to work correct with the appropriate range constrains:

Code: Select all

int swarIMul&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  return &#40;ac<<16&#41; + &#40;bd & 0xffff&#41;;
&#125;

and produces reasonable assembly (VC2008), despite three needless instructions

Code: Select all

?swarIMul@@YAHHH@Z PROC
; _s1$ = eax
  00000	f7 6c 24 04	 imul	 DWORD PTR _s2$&#91;esp-4&#93;
  00004	8b c8		 mov	 ecx, eax
  00006	8b c2		 mov	 eax, edx
  00008	c1 fa 1f	 sar	 edx, 31
  0000b	8b d1		 mov	 edx, ecx
  0000d	c1 fa 1f	 sar	 edx, 31
  00010	2b c2		 sub	 eax, edx
  00012	c1 e0 10	 shl	 eax, 16
  00015	81 e1 ff ff 00
	00		 and	 ecx, 65535
  0001b	03 c1		 add	 eax, ecx
  0001d	c3		 ret	 0

Problem is the swarDiv by 16, which might work that way, but with horrible assembly

Code: Select all

int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  ac /= 16;
  bd = short&#40;bd&#41; / 16;
  return &#40;ac<<16&#41; + &#40;bd & 0xffff&#41;;
&#125;

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Nov 10, 2009 9:43 pm

Gerd Isenberg wrote: [/code]
Problem is the swarDiv by 16, which might work that way, but with horrible assembly

Code: Select all

int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  ac /= 16;
  bd = short&#40;bd&#41; / 16;
  return &#40;ac<<16&#41; + &#40;bd & 0xffff&#41;;
&#125;

Of course I should have used >> 4, which "rounds" differently as div for negative numbers. This seems to give quite fine assembly as well.

Code: Select all

int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  bd = short&#40;bd&#41; >> 4;
  return (&#40;ac>>4&#41;<<16&#41; + &#40;bd & 0xffff&#41;; 
&#125;

Strange thing, all ms-compilers seem to produce this strange code with 3 needless instructions and "dead" first sar edx,31!? I wonder this a kind of trick for register renaming or something. The 16 bit sar cx,4 does the trick. I really had to look twice, since it is long time ago I saw such 16-bit code regulary

Code: Select all

_s1$ = 8
_s2$ = 12
?swarIMulDiv16@@YAHHH@Z PROC NEAR
  00000	8b 44 24 04	 mov	 eax, DWORD PTR _s1$&#91;esp-4&#93;
  00004	f7 6c 24 08	 imul	 DWORD PTR _s2$&#91;esp-4&#93;
  00008	8b c8		 mov	 ecx, eax
  0000a	8b c2		 mov	 eax, edx
  0000c	c1 fa 1f	 sar	 edx, 31
  0000f	8b d1		 mov	 edx, ecx
  00011	c1 fa 1f	 sar	 edx, 31
  00014	2b c2		 sub	 eax, edx
  00016	24 f0		 and	 al, -16
  00018	66 c1 f9 04	 sar	 cx, 4
  0001c	c1 e0 0c	 shl	 eax, 12
  0001f	81 e1 ff ff 00
	00		 and	 ecx, 65535
  00025	03 c1		 add	 eax, ecx
  00027	c3		 ret	 0
?swarIMulDiv16@@YAHHH@Z ENDP

Alternatively one may have a closer look to _mm_mullo_epi16 for eight shorts at once ...

wgarvin · Post by **wgarvin** » Tue Nov 10, 2009 9:51 pm

Gerd Isenberg wrote:Strange thing, all ms-compilers seem to produce this strange code with 3 needless instructions and "dead" first sar edx,31!? I wonder this a kind of trick for register renaming or something. The 16 bit sar cx,4 does the trick. I really had to look twice, since it is long time ago I saw such 16-bit code regulary
Code: Select all
_s1$ = 8
_s2$ = 12
?swarIMulDiv16@@YAHHH@Z PROC NEAR
  00000	8b 44 24 04	 mov	 eax, DWORD PTR _s1$&#91;esp-4&#93;
  00004	f7 6c 24 08	 imul	 DWORD PTR _s2$&#91;esp-4&#93;
  00008	8b c8		 mov	 ecx, eax
  0000a	8b c2		 mov	 eax, edx
  0000c	c1 fa 1f	 sar	 edx, 31
  0000f	8b d1		 mov	 edx, ecx
  00011	c1 fa 1f	 sar	 edx, 31
  00014	2b c2		 sub	 eax, edx
  00016	24 f0		 and	 al, -16
  00018	66 c1 f9 04	 sar	 cx, 4
  0001c	c1 e0 0c	 shl	 eax, 12
  0001f	81 e1 ff ff 00
	00		 and	 ecx, 65535
  00025	03 c1		 add	 eax, ecx
  00027	c3		 ret	 0
?swarIMulDiv16@@YAHHH@Z ENDP
Alternatively one may have a closer look to _mm_mullo_epi16 for eight shorts at once ...

Just a wild guess: mov involving edx followed by sar edx,31 seems like an "optimized for speed" expansion of cdq. Maybe they model it like a cdq in intermediate code, so that they are able to generate cdq when optimizing for size.

I have no idea why that first sar wouldn't get marked as dead and removed after the expansion, though, because its obviously dead. It does look pretty weird.

mcostalba · Post by **mcostalba** » Tue Nov 10, 2009 10:48 pm

Gerd Isenberg wrote: This seems to give quite fine assembly as well.

Code: Select all

int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  bd = short&#40;bd&#41; >> 4;
  return (&#40;ac>>4&#41;<<16&#41; + &#40;bd & 0xffff&#41;; 
&#125;

Hi Gerd, sorry again for not answering you today, but it was not possible for me to test your code.

Tomorrow will not be better, I will be short of time but I hope to give you a deserved answer about your last solution.

Thanks
Marco

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Nov 10, 2009 11:55 pm

mcostalba wrote:
Gerd Isenberg wrote: This seems to give quite fine assembly as well.
Code: Select all
int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  __int64 v64 = __int64&#40;s1&#41; * __int64&#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  bd = short&#40;bd&#41; >> 4;
  return (&#40;ac>>4&#41;<<16&#41; + &#40;bd & 0xffff&#41;; 
&#125;
Hi Gerd, sorry again for not answering you today, but it was not possible for me to test your code.

Tomorrow will not be better, I will be short of time but I hope to give you a deserved answer about your last solution.

Thanks
Marco

Hi Marco,
no need to test it. Still buggy. I found a bug in my test ad hoc framework in synthesizing the two shorts to an int

Yes, this is a time waster, and I am also a bit short off due to work. But an interesting bit twiddling challenge

So far this should be the reference.

Code: Select all

int simdIMulDiv16&#40;int s1, int s2&#41;
&#123;
	__asm &#123;
        movd   xmm0, &#91;s1&#93;
        movd   xmm1, &#91;s2&#93;
        pmullw xmm0, xmm1
        psraw  xmm0, 4
        movd   eax, xmm0
	&#125;
&#125;

More later ...

Cheers,
Gerd

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Nov 11, 2009 12:35 am

From my synthesizing bug, I learned to add the {0,-1}signextension of the lower short, i.e subtract one from high if low is negative before multiplication to make it work like the sse2 simd approach. Looks ugly, and makes me believe the whole idea to safe one imul sucks.

Code: Select all

int swarIMulDiv16&#40;int s1, int s2&#41;
&#123;
  s1 += int&#40;short&#40;s1&#41;) & 0xffff0000;
  s2 += int&#40;short&#40;s2&#41;) & 0xffff0000;
  __int64 v64 = __int64&#40;s1&#41; * __int64 &#40;s2&#41;;
  int bd = int&#40;v64&#41;;
  int ac = int&#40;v64>>32&#41; - &#40;bd>>31&#41;;
  bd = short&#40;bd&#41; >> 4;
  return (&#40;ac>>4&#41;<<16&#41; + &#40;bd & 0xffff&#41;; 
&#125;

I'll vote for the sse2 approach with the option to multiply vector of eight shorts instead of two. It also has a range advantage.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Nov 11, 2009 12:46 am

wgarvin wrote: Just a wild guess: mov involving edx followed by sar edx,31 seems like an "optimized for speed" expansion of cdq. Maybe they model it like a cdq in intermediate code, so that they are able to generate cdq when optimizing for size.

I have no idea why that first sar wouldn't get marked as dead and removed after the expansion, though, because its obviously dead. It does look pretty weird.

Good catch, looks similar to cdq, but makes the result 33-bit (instead of 32) sign extended to 64. I first thought the "dead" sar is related to the explicit sar 31 from the source, but it isn't.

Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply

Re: Score multiply