C vs ASM

hgm · Post by **hgm** » Thu Mar 07, 2013 9:43 am

Rebel wrote:However when I was porting my ASM engine back to C using MSVC I ran into several problems causing speed losses. One of the examples:

In my eval I have a bunch of variables that need zeroing before starting. For instance, when I declare them as follows:
Code: Select all
static char a1,a2,a3,a4,a5,a6,a7,a8;
static char b1,b2,b3,b4,b5,b6,b7,b8;
Then using "Digital Mars" in ASM and C I could clear those 16 variables in 4 instructions:
Code: Select all
ASM
mov dword ptr a1,0
mov dword ptr a5,0
mov dword ptr b1,0
mov dword ptr b5,0
Code: Select all
C
long *p_a1 = (long *) &a1;       // 32-bit redefinition
long *p_b1 = (long *) &b1;       // 32-bit redefinition
p_a1[0] = p_a1[1] = p_b1[0] = p_b1[1]=0;
This was (still is in the 2012 version?) impossible with MSVC because the compiler apparently has its own philosophy organizing a1-a8 and b1-b8 into memory while Digital Mars just leaves the chain as declared by the programmer in tact.

There are tricks to force a certain memory layout on the compiler. You make the variables in question part of a struct or array. In your case you could write

Code: Select all

static char a[8];
static char b[8];

{
    ....
    *(long long int*)a = 0;
    *(long long int*)b = 0;
}

to clear them in two instructions. Or, if you don't want to rewrite existing code by replacing a1...a8 by a[1]...a[8] everywhere, you can use preprocessor macros:

Code: Select all

#define a1 a[1]
...
#define a8 a[8]

In Spartacus I have a lot of interleaved tables, because 0x88-style mailbox boards have a lot of unused elements. So I fill those elements with other tables, so I optimally use cache space. So I use code like this:

Code: Select all

unsigned char raw[1024];
#define promoTab raw
#define boardStep ((signed char *)raw + 8)
#define promoPiece (raw+128)

and then I can use promoTab[sqr], boardStep[sqr] and promoPiece[sqr] as if they are simple arrays, and the compiler will treat them exactly like they are (except in a funny place, that would make the overlap). Expressions like ((signed char*)raw + 8) evaluate to a constant (known at compile time) of type (signed char*), which is exactly what the name of an array of signed char would be.

Evert · Post by **Evert** » Thu Mar 07, 2013 9:51 am

Rebel wrote:However when I was porting my ASM engine back to C using MSVC I ran into several problems causing speed losses. One of the examples:

In my eval I have a bunch of variables that need zeroing before starting. For instance, when I declare them as follows:
Code: Select all
static char a1,a2,a3,a4,a5,a6,a7,a8;
static char b1,b2,b3,b4,b5,b6,b7,b8;
Then using "Digital Mars" in ASM and C I could clear those 16 variables in 4 instructions:
Code: Select all
ASM
mov dword ptr a1,0
mov dword ptr a5,0
mov dword ptr b1,0
mov dword ptr b5,0
Code: Select all
C
long *p_a1 = (long *) &a1;       // 32-bit redefinition
long *p_b1 = (long *) &b1;       // 32-bit redefinition
p_a1[0] = p_a1[1] = p_b1[0] = p_b1[1]=0;
This was (still is in the 2012 version?) impossible with MSVC because the compiler apparently has its own philosophy organizing a1-a8 and b1-b8 into memory while Digital Mars just leaves the chain as declared by the programmer in tact.

I think this is a very bad example because there is a perfectly well-defined and portable way to do what you describe without making unsafe assumptions on what the compiler does. In general it's bad to rely on undefined behaviour, like how the compiler organises variables in memory. It may well have a good (performance) reason for organising things differently, depending on target architecture.

Anyway, if you care about the exact memory layout of variables, the correct (and portable) solution is to put them in a struct (but the compiler may add padding to the end of the struct) or, if they're all the same size, put them in an array. Then you can use memset to clear the lot, or (if you insist) a union with an array of the same size but using a larger integer type.

Joost Buijs · Post by **Joost Buijs** » Thu Mar 07, 2013 11:33 am

Rebel wrote:
lucasart wrote: (*) Ed please don't take this as a personal attack. I write sucky code too, and so does everyone (then we fix it, programming is often an iterative process). And I would like to thank you for your efforts and time on this case study.
I don't feel offended, instead I blame myself for (unconsciencely) cherry picking a too small piece of code that performed faster in ASM than in C on my PC.

It certainly helped to debunk the myth that putting assembly code in a chess program is a good ide: it looks tempting at first, until you do it and realize that it's a bloody stupid idea...
Certainly my respect for the compiler has grown.

However when I was porting my ASM engine back to C using MSVC I ran into several problems causing speed losses. One of the examples:

In my eval I have a bunch of variables that need zeroing before starting. For instance, when I declare them as follows:
Code: Select all
static char a1,a2,a3,a4,a5,a6,a7,a8;
static char b1,b2,b3,b4,b5,b6,b7,b8;
Then using "Digital Mars" in ASM and C I could clear those 16 variables in 4 instructions:
Code: Select all
ASM
mov dword ptr a1,0
mov dword ptr a5,0
mov dword ptr b1,0
mov dword ptr b5,0
Code: Select all
C
long *p_a1 = (long *) &a1;       // 32-bit redefinition
long *p_b1 = (long *) &b1;       // 32-bit redefinition
p_a1[0] = p_a1[1] = p_b1[0] = p_b1[1]=0;
This was (still is in the 2012 version?) impossible with MSVC because the compiler apparently has its own philosophy organizing a1-a8 and b1-b8 into memory while Digital Mars just leaves the chain as declared by the programmer in tact.

Like Evert already said you can use an anonymous union for it.
For instance MSVC and Intel C++ allow you to do something like this:

Code: Select all

static union {
	#pragma pack(1)
	struct {
		char a1, a2, a3, a4, a5, a6, a7, a8;
	};
	#pragma pack()
	__int64 an;
};

an = 0; // this clears all 8 characters at once

You have to use the pack(1) pragma, otherwise the characters will be aligned on the default boundary (probably 8).

lucasart · Post by **lucasart** » Thu Mar 07, 2013 11:41 am

Rebel wrote: However when I was porting my ASM engine back to C using MSVC I ran into several problems causing speed losses. One of the examples:

In my eval I have a bunch of variables that need zeroing before starting. For instance, when I declare them as follows:
Code: Select all
static char a1,a2,a3,a4,a5,a6,a7,a8;
static char b1,b2,b3,b4,b5,b6,b7,b8;
Then using "Digital Mars" in ASM and C I could clear those 16 variables in 4 instructions:
Code: Select all
ASM
mov dword ptr a1,0
mov dword ptr a5,0
mov dword ptr b1,0
mov dword ptr b5,0
Code: Select all
C
long *p_a1 = (long *) &a1;       // 32-bit redefinition
long *p_b1 = (long *) &b1;       // 32-bit redefinition
p_a1[0] = p_a1[1] = p_b1[0] = p_b1[1]=0;
This was (still is in the 2012 version?) impossible with MSVC because the compiler apparently has its own philosophy organizing a1-a8 and b1-b8 into memory while Digital Mars just leaves the chain as declared by the programmer in tact.

This is not a problem:
- are they not already zero out at startup as specified by the C standard ?
- this task is only done at startup and is so fast it's not even measurable. If it is, you should perhaps reorganize your C code because there must be something wrong with it
- assuming this task is indeed on a performance critical path, which means it is called in a loop a lot of times, you can organize your data better (use an array and memset instead which is basically a "repnz stosb", or an compiler intrinsic for a 8-byte moveq, or a 16-byte moveaps SSE instruction). or use an union! In all cases well written C code without inline assembly will be as fast as hand optimized assembly

Let's have a look at a tivial example:

Code: Select all

#include <stdio.h>
static char a,b,c,d,e,f,g,h;
void main()
{
        a=b=c=d=e=f=g=h=0;
        printf("%d,%d,%d,%d,%d,%d,%d,%d\n",a,b,c,d,e,f,g,h);
}

64-bit compile, using GCC 4.7.2, and spittting out the assembly code

Code: Select all

$ gcc ./main.c -O3 -S

So what do I see:

Code: Select all

	.file	"main.c"
	.section	.rodata.str1.1,"aMS",@progbits,1
.LC0:
	.string	"%d,%d,%d,%d,%d,%d,%d,%d\n"
	.section	.text.startup,"ax",@progbits
	.p2align 4,,15
	.globl	main
	.type	main, @function
main:
.LFB22:
	.cfi_startproc
	subq	$40, %rsp
	.cfi_def_cfa_offset 48
/* a=b=c=d=e=f=g=h=0 */
	xorl	%r9d, %r9d
	xorl	%r8d, %r8d
/***********************/
/* now the printf call, pushing all the param on the stack*/
	movl	$0, 24(%rsp)
	movl	$0, 16(%rsp)
	xorl	%ecx, %ecx
	movl	$0, 8(%rsp)
	movl	$0, (%rsp)
	xorl	%edx, %edx
	movl	$.LC0, %esi
	movl	$1, %edi
	xorl	%eax, %eax
	movb	$0, h(%rip)
	movb	$0, g(%rip)
	movb	$0, f(%rip)
	movb	$0, e(%rip)
	movb	$0, d(%rip)
	movb	$0, c(%rip)
	movb	$0, b(%rip)
	movb	$0, a(%rip)
	call	__printf_chk
/*************************/
/* return to the operating system in good order */
	addq	$40, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc
/******************************************/
/* and some useless bullshit to add bloat to the executable */
.LFE22:
	.size	main, .-main
	.local	h
	.comm	h,1,1
	.local	g
	.comm	g,1,1
	.local	f
	.comm	f,1,1
	.local	e
	.comm	e,1,1
	.local	d
	.comm	d,1,1
	.local	c
	.comm	c,1,1
	.local	b
	.comm	b,1,1
	.local	a
	.comm	a,1,1
	.ident	"GCC: (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2"
	.section	.note.GNU-stack,"",@progbits

So pretty good, no ?

Joost Buijs · Post by **Joost Buijs** » Thu Mar 07, 2013 12:05 pm

Joost Buijs wrote: Like Evert already said you can use an anonymous union for it.
For instance MSVC and Intel C++ allow you to do something like this:
Code: Select all
static union {
	#pragma pack(1)
	struct {
		char a1, a2, a3, a4, a5, a6, a7, a8;
	};
	#pragma pack()
	__int64 an;
};

an = 0; // this clears all 8 characters at once
You have to use the pack(1) pragma, otherwise the characters will be aligned on the default boundary (probably 8).

I guess alignment is also the problem with your example.
Probably Digital Mars does not give a damn about alignment.
You can tell MSVC to align all data items on a 1 byte boundary, of course this will decrease performance.
Anyway it is not good practice to rely on unspecified behavior of a compiler.

hgm · Post by **hgm** » Thu Mar 07, 2013 2:14 pm

I thought the C standard for aligning char was on 1-byte boundaries? I am pretty sure it must be, as in Fairy-Max my hash entry is defined as

Code: Select all

struct _ { int signature, score; char from, to, depth, flags; } *hashTable;

and I know from the memory footprint that this measures 12 bytes.

Joost Buijs · Post by **Joost Buijs** » Thu Mar 07, 2013 2:27 pm

hgm wrote:I thought the C standard for aligning char was on 1-byte boundaries? I am pretty sure it must be, as in Fairy-Max my hash entry is defined as
Code: Select all
struct _ { int signature, score; char from, to, depth, flags; } *hashTable;
and I know from the memory footprint that this measures 12 bytes.

Well, I don't know which compiler you use, but with MSVC and Intel C++ I think this is not true.

But of course I can have it wrong. It is something I read in the documentation a long time ago, and since that time I always used the pragma. Now you make me curious, and I'm going to check it immediately.

Joost Buijs · Post by **Joost Buijs** » Thu Mar 07, 2013 2:54 pm

hgm wrote:I thought the C standard for aligning char was on 1-byte boundaries? I am pretty sure it must be, as in Fairy-Max my hash entry is defined as
Code: Select all
struct _ { int signature, score; char from, to, depth, flags; } *hashTable;
and I know from the memory footprint that this measures 12 bytes.

It is a bit fishy, the alignment has to do with the padding at the end of the struct. With the default alignment of 8 I get the following:

struct a { char a, b; } sizeof(struct a) == 2
struct b { int x; char a, b; } sizeof(struct b) == 8

To make things more easy I always use the pragma when I want to have the struct packed.

Joost Buijs · Post by **Joost Buijs** » Thu Mar 07, 2013 4:34 pm

hgm wrote:I thought the C standard for aligning char was on 1-byte boundaries? I am pretty sure it must be, as in Fairy-Max my hash entry is defined as
Code: Select all
struct _ { int signature, score; char from, to, depth, flags; } *hashTable;
and I know from the memory footprint that this measures 12 bytes.

As it seems a struct is always padded to a multiple of the size of the largest element or a multiple of the default alignment whichever is smaller.
So, when you only use chars in a struct it is always packed.
The example I gave is not wrong but the pragma pack(1) is redundant.
A human is never to old to learn something new.

rbarreira · Post by **rbarreira** » Fri Mar 08, 2013 10:21 am

Joost Buijs wrote:
hgm wrote:I thought the C standard for aligning char was on 1-byte boundaries? I am pretty sure it must be, as in Fairy-Max my hash entry is defined as
Code: Select all
struct _ { int signature, score; char from, to, depth, flags; } *hashTable;
and I know from the memory footprint that this measures 12 bytes.
It is a bit fishy, the alignment has to do with the padding at the end of the struct. With the default alignment of 8 I get the following:

struct a { char a, b; } sizeof(struct a) == 2
struct b { int x; char a, b; } sizeof(struct b) == 8

To make things more easy I always use the pragma when I want to have the struct packed.

I don't think that that padding is only at the end of a struct. AFAIK there can be padding in between elements too.

C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM

Re: C vs ASM