Compilation question for gcc experts

hgm · Post by **hgm** » Wed Jul 04, 2007 8:53 pm

Joker runs lik shit on a Pentium 4, because its frequently used variables collide in the small L1 cache of the latter. To avoid this, I want to have better control over the location where variables map into the cache. Having several stacks, the top of which moving through memory in an unpredictable way (depending on ply depth and moves per ply) makes this almost impossible.

To reduce the risk for cache thrashing, I would like to merge all stacks into one. In particular, the move stack should go into the stack frame of local variables. Problem is that the number of moves per ply is variable, with a huge worst case. And leaving large unused holes in the combined stack, would defeat the purpose, as it would quickly make this stack 'wrap around' the cache, and collide with itself.

So I was thinking about allocating space on the system stack through an assembler routine that adjusts the stack pointer, and returns a pointer to the area thus allocated. I can then allocate ample space, pass the pointer to my move generator, (which will put its own locals on top of the expanded stack), and let it fill the area with all moves it generates. After that, I call the assembly routine again, to deallocate the unused part of the move array. The recursive call will then put its locals directly on top of the moves, without any hole. Before the final return, I deallocate the used part of the move array.

Problem is that I have to know how much space gcc wants to use near the top of the stack, to know which (say) 100 bytes to use for the move list if I expand the stack by 100 bytes. If I look how my version of gcc handles the stack frame, there seems an inconsistency. I compiled the following routine (gcc -O2):

Code: Select all

int test&#40;int a, int b&#41;
&#123;
    int m&#91;10&#93;;

    m&#91;a&#93; += m&#91;b&#93;;

    printf&#40;"hello\n");
    return m&#91;b&#93;;
&#125;

It has 40 bytes of local variables, in the array m[]. The generated code is:

Code: Select all

	.file	"test.c"
	.section .rdata,"dr"
LC0&#58;
	.ascii "hello\0"
	.text
	.p2align 4,,15
.globl _test
	.def	_test;	.scl	2;	.type	32;	.endef
_test&#58;
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%ebx
	subl	$68, %esp
	movl	12&#40;%ebp&#41;, %ebx
	movl	$LC0, (%esp&#41;
	movl	8&#40;%ebp&#41;, %edx
	movl	-56&#40;%ebp,%ebx,4&#41;, %eax
	addl	%eax, -56&#40;%ebp,%edx,4&#41;
	call	_puts
	movl	-56&#40;%ebp,%ebx,4&#41;, %eax
	addl	$68, %esp
	popl	%ebx
	popl	%ebp
	ret
	.def	_puts;	.scl	3;	.type	32;	.endef

On entry, after the 'movl %esp, %ebp' the layout of the new stack frame is:
0(%ebp): the saved old frame pointer (that was just pushed).
4(%ebp): the return address
8(%ebp): argument a
12(%ebp): argument b
The rest is then used, after expanding the stack, as:
-4(%ebp): the saved register %ebp (which is used as a temporary)
-56(%ebp): the array m[]. As this is 40 bytes, the first location behind it is -16(%ebp).

now -16(%ebp), -12(%ebp) and -8(%ebp) are unused. In more complicated routines, though, it seems these are used for saving other registers that are used as scratch registers in the routine: %edi, %esi, and %ecx. (%edx and %eax are never saved, even if they are used as scratch registers, as they will contain the returned value). A bit sloppy that the optimizer does not squeeze out this unused space, but at least it is understandable.

Now the puzzle is that in allocating the stack frame, 68 is subtracted from %esp. Together with the pushed %ebp, this makes the new top of the stack the same as -72(%ebp). This top is used to pass the argument LC0 for the call to printf (which apparently is optimized to puts, as there are no variables printed). but from -68(%ebp) to -56(%ebp), there seems another unused area.

I have no idea what purpose that area serves. If I squeeze in the move list, should I leave that mystery area of 12 bytes before it (closer to the top of the stack) or after it (adjacent to the other locals)? What is this area for? Is it a bug in gcc that it wastes this space? Or is it reserved for returning values that wouldn't fit a register?

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Jul 04, 2007 9:26 pm

alloca()

Michael Sherwin · Post by **Michael Sherwin** » Wed Jul 04, 2007 11:58 pm

Hi HG,

I was facinated by the GCC compiler when I first found out about it. It is free and it comes with source. But, grew disinchanted with it when I realized that optimized code from GCC ran just a little faster than full debug code from the Watcom compiler. The Watcom compiler is now free and the source code is also available. If you have not downloaded it already (Open Watcom) then you should give it a try. MSVC 6 produces code that is about 5% (10% at most) faster than Watcom and that is why I use MSVC 6. Watcom produces faster code than MSVS 2005, unless you want a pgo compile then I imagine that MSVS 2005 is faster than what watcom can do. Just to be clear, MSVC 6 produces faster code than MSVS 2005!

Mike

hgm · Post by **hgm** » Thu Jul 05, 2007 12:01 am

Aleks Peshkov wrote:alloca()

Great, thanks!

I didn't know about this function. If it also allows negative argument, to deallocate part of the earlier allocated storage, it will be all I need. If not, I can reverse engineer it, to see where exactly in the stack the allocated space should be positioned.

(Note that I don't know in advance how many moves will be generated, so I have to allocate for the worst case, and then give the unused space back.)

Carey · Post by **Carey** » Thu Jul 05, 2007 12:46 am

Michael Sherwin wrote:Hi HG,

I was facinated by the GCC compiler when I first found out about it. It is free and it comes with source. But, grew disinchanted with it when I realized that optimized code from GCC ran just a little faster than full debug code from the Watcom compiler.

I'm surprised by that statement.

I haven't done any chess programming in close to a year, but for a while I was doing my testing with MingW (GNU C for Win), MSVC, TurboC, OpenWatcom, and LCC. All of which were free. (And for the record, no, not all of them were giving the same results... Joys of compiler bugs and unsafe programming techniques.)

The best I could ever get OpenWatcom was about 10% slower than GNU C.

It has so few optimization abilities that you can't really compile for any modern processor. All you can do is compile for a generic processor and hope for the best.

Of course, that has been a while. Have the newer versions of OW improved?

The worst compile, of course, was LCC, but Borland's free TurboC compiler wasn't to far from that. Their performance was nearly double what MSVC was giving.

Of course, your results can vary a lot depending on your system, your coding style, etc. etc.

The Watcom compiler is now free and the source code is also available. If you have not downloaded it already (Open Watcom) then you should give it a try. MSVC 6 produces code that is about 5% (10% at most) faster than Watcom and that is why I use MSVC 6.

For me, MSVC was usually around 20% faster than GNU C.

Of course, coding style, etc. etc. certainly influences this.

Watcom produces faster code than MSVS 2005, unless you want a pgo compile then I imagine that MSVS 2005 is faster than what watcom can do. Just to be clear, MSVC 6 produces faster code than MSVS 2005!
Mike

I never bothered with PGO optimizations, so I can't comment on that.

I do readily admit that everybody's results will vary depending on their processor, their coding style, and lots of other factors.

I was just very surprised by your statement that OpenWatcom was easily beating GNU C and was nearly as good as MSVC.

I've never heard anybody say that many good things about OW.

I have a couple versions of Watcom from before it went free, and I tried a couple versions afterwards, but I haven't tried any lately.

I was never really all that impressed with the performance and optimization abilities. It was okay, but nothing special.

Has it really improved that much in the past year or two??

Michael Sherwin · Post by **Michael Sherwin** » Thu Jul 05, 2007 1:10 am

Watcom has always had a reputation for producing fast code. I was surprised when I got a student version of the MSVC 6 compiler and it was faster. So I switched. But that was about three years ago. I did order OW on CD and made a couple of compiles and the code that it produced was about the same. Like you said, it depends a lot on programming style. Maybe Watcom just does better at compiling my code than GCC does!

Roman Hartmann · Post by **Roman Hartmann** » Thu Jul 05, 2007 9:59 am

Carey wrote:...
For me, MSVC was usually around 20% faster than GNU C.
...

That's also my experience with gcc (3.2.3) and MSVC (Version 14.00.50727.42).

My codes seems to run equally well (or bad) either on Intel or on AMD.

Roman

Carey · Post by **Carey** » Thu Jul 05, 2007 4:28 pm

Michael Sherwin wrote:Watcom has always had a reputation for producing fast code.

They did way back in the days of 486's and regular Pentiums, but they never really kept up with all the versions of processors available and the differences in instruction scheduling required, etc.

Like you said, it depends a lot on programming style. Maybe Watcom just does better at compiling my code than GCC does!

That is certainly possible.

When you use a specific compiler to do your development work with, and you start doing performance tuning, you do tend to write code that works well with the particular compiler you are using.

Switching to other compilers that generate code just a little differently suddenly makes your tuned written code not so tuned anymore.

I guess really it's just a fact of programming. Either you write generic code for all compilers and processors, or you optimize for the specific platform you are using.

bob · Post by **bob** » Fri Jul 06, 2007 4:21 am

hgm wrote:
Aleks Peshkov wrote:alloca()
Great, thanks!

I didn't know about this function. If it also allows negative argument, to deallocate part of the earlier allocated storage, it will be all I need. If not, I can reverse engineer it, to see where exactly in the stack the allocated space should be positioned.

(Note that I don't know in advance how many moves will be generated, so I have to allocate for the worst case, and then give the unused space back.)

that won't work. alloca() allocates space on the stack. When the calling function (the one that calls alloca()) returns, the space is freed at that instant. You don't need to, nor can you free the space via a call... It would leave holes in the stack which can't be handled with a pure stack architecture like X86.

hgm · Post by **hgm** » Fri Jul 06, 2007 8:06 am

Well, it allocates the space at the top of the stack. So I should be able to de-allocate it when it is still (or again) at the top of the stack without leaving holes.

But for some reason,

Code: Select all

    Move *MovePtr = &#40;Move *) alloca&#40;256*&#40;sizeof Move&#41;) + 256; // points to end of area
    n = GenMoves&#40;MovePtr&#41;;         // fills area from the back
    alloca&#40;&#40;256-n&#41; * &#40;sizeof Move&#41;);   // give unused space back

crashes the program.

However, when I edit the generated assembly code, to remove the second call to __alloca, and just subtract the calculated argument from the stack pointer with

Code: Select all

    subl %eax, %esp

It does run as I intended.

It is not faster, though. But that might depend on the search depth and other allocations. Accessing the move list through a pointer *CurMove, rather than as a (global) array MoveStack[CurMove] in itself did not produce slower code (when the pointer was still pointing into the global array).

Compilation question for gcc experts

Compilation question for gcc experts

Re: Compilation question for gcc experts

Re: Compilation question for gcc experts

Re: Compilation question for gcc experts

Re: Compilation question for gcc experts

Re: Compilation question for gcc experts

Re: Compilation question for gcc experts

Re: Compilation question for gcc experts

Re: Compilation question for gcc experts

Re: Compilation question for gcc experts