Rebel wrote:
However when I was porting my ASM engine back to C using MSVC I ran into several problems causing speed losses. One of the examples:
In my eval I have a bunch of variables that need zeroing before starting. For instance, when I declare them as follows:
Code: Select all
static char a1,a2,a3,a4,a5,a6,a7,a8;
static char b1,b2,b3,b4,b5,b6,b7,b8;
Then using "Digital Mars" in ASM
and C I could clear those 16 variables in 4 instructions:
Code: Select all
ASM
mov dword ptr a1,0
mov dword ptr a5,0
mov dword ptr b1,0
mov dword ptr b5,0
Code: Select all
C
long *p_a1 = (long *) &a1; // 32-bit redefinition
long *p_b1 = (long *) &b1; // 32-bit redefinition
p_a1[0] = p_a1[1] = p_b1[0] = p_b1[1]=0;
This was (still is in the 2012 version?) impossible with MSVC because the compiler apparently has its own philosophy organizing a1-a8 and b1-b8 into memory while Digital Mars just leaves the chain as declared by the programmer in tact.
This is not a problem:
- are they not already zero out at startup as specified by the C standard ?
- this task is only done at startup and is so fast it's not even measurable. If it is, you should perhaps reorganize your C code because there must be something wrong with it
- assuming this task is indeed on a performance critical path, which means it is called in a loop a lot of times, you can organize your data better (use an array and memset instead which is basically a "repnz stosb", or an compiler intrinsic for a 8-byte moveq, or a 16-byte moveaps SSE instruction). or use an union! In all cases well written C code without inline assembly will be as fast as hand optimized assembly
Let's have a look at a tivial example:
Code: Select all
#include <stdio.h>
static char a,b,c,d,e,f,g,h;
void main()
{
a=b=c=d=e=f=g=h=0;
printf("%d,%d,%d,%d,%d,%d,%d,%d\n",a,b,c,d,e,f,g,h);
}
64-bit compile, using GCC 4.7.2, and spittting out the assembly code
So what do I see:
Code: Select all
.file "main.c"
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "%d,%d,%d,%d,%d,%d,%d,%d\n"
.section .text.startup,"ax",@progbits
.p2align 4,,15
.globl main
.type main, @function
main:
.LFB22:
.cfi_startproc
subq $40, %rsp
.cfi_def_cfa_offset 48
/* a=b=c=d=e=f=g=h=0 */
xorl %r9d, %r9d
xorl %r8d, %r8d
/***********************/
/* now the printf call, pushing all the param on the stack*/
movl $0, 24(%rsp)
movl $0, 16(%rsp)
xorl %ecx, %ecx
movl $0, 8(%rsp)
movl $0, (%rsp)
xorl %edx, %edx
movl $.LC0, %esi
movl $1, %edi
xorl %eax, %eax
movb $0, h(%rip)
movb $0, g(%rip)
movb $0, f(%rip)
movb $0, e(%rip)
movb $0, d(%rip)
movb $0, c(%rip)
movb $0, b(%rip)
movb $0, a(%rip)
call __printf_chk
/*************************/
/* return to the operating system in good order */
addq $40, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
/******************************************/
/* and some useless bullshit to add bloat to the executable */
.LFE22:
.size main, .-main
.local h
.comm h,1,1
.local g
.comm g,1,1
.local f
.comm f,1,1
.local e
.comm e,1,1
.local d
.comm d,1,1
.local c
.comm c,1,1
.local b
.comm b,1,1
.local a
.comm a,1,1
.ident "GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
.section .note.GNU-stack,"",@progbits
So pretty good, no ?
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.