(512, 0) * (46, 0) -> ac = 23552, bd = 0 -> return 1543503872 (instead of 96468992)
(-2, -9) * (15, 16) -> ac = -30, bd = -10944656 ->return -86025 (instead of -1966224)
Oups, I only focused on the higher part
But the first case works correct:
(512, 0) * (46, 0) -> 96468992 or 0x05c0 0000 = (1472, 0)
I will see what works wrong in the second case, where the lower short is wrong.
(512, 0) * (46, 0) -> ac = 23552, bd = 0 -> return 1543503872 (instead of 96468992)
(-2, -9) * (15, 16) -> ac = -30, bd = -10944656 ->return -86025 (instead of -1966224)
Oups, I only focused on the higher part
But the first case works correct:
(512, 0) * (46, 0) -> 96468992 or 0x05c0 0000 = (1472, 0)
I will see what works wrong in the second case, where the lower short is wrong.
The swarImul (without div 16) seems to work correct with the appropriate range constrains:
int swarIMulDiv16(int s1, int s2)
{
__int64 v64 = __int64(s1) * __int64(s2);
int bd = int(v64);
int ac = int(v64>>32) - (bd>>31);
bd = short(bd) >> 4;
return ((ac>>4)<<16) + (bd & 0xffff);
}
Strange thing, all ms-compilers seem to produce this strange code with 3 needless instructions and "dead" first sar edx,31!? I wonder this a kind of trick for register renaming or something. The 16 bit sar cx,4 does the trick. I really had to look twice, since it is long time ago I saw such 16-bit code regulary
Gerd Isenberg wrote:Strange thing, all ms-compilers seem to produce this strange code with 3 needless instructions and "dead" first sar edx,31!? I wonder this a kind of trick for register renaming or something. The 16 bit sar cx,4 does the trick. I really had to look twice, since it is long time ago I saw such 16-bit code regulary
Alternatively one may have a closer look to _mm_mullo_epi16 for eight shorts at once ...
Just a wild guess: mov involving edx followed by sar edx,31 seems like an "optimized for speed" expansion of cdq. Maybe they model it like a cdq in intermediate code, so that they are able to generate cdq when optimizing for size.
I have no idea why that first sar wouldn't get marked as dead and removed after the expansion, though, because its obviously dead. It does look pretty weird.
From my synthesizing bug, I learned to add the {0,-1}signextension of the lower short, i.e subtract one from high if low is negative before multiplication to make it work like the sse2 simd approach. Looks ugly, and makes me believe the whole idea to safe one imul sucks.
wgarvin wrote:
Just a wild guess: mov involving edx followed by sar edx,31 seems like an "optimized for speed" expansion of cdq. Maybe they model it like a cdq in intermediate code, so that they are able to generate cdq when optimizing for size.
I have no idea why that first sar wouldn't get marked as dead and removed after the expansion, though, because its obviously dead. It does look pretty weird.
Good catch, looks similar to cdq, but makes the result 33-bit (instead of 32) sign extended to 64. I first thought the "dead" sar is related to the explicit sar 31 from the source, but it isn't.