AMD hex core

rbarreira · Post by **rbarreira** » Sun Oct 03, 2010 6:15 pm

No problem!

It seems your crafty had more configuration options than mine, any easy way to run it with the same parameters? (except with 6 threads of course)

frankp · Post by **frankp** » Sun Oct 03, 2010 6:41 pm

rbarreira wrote:No problem!

It seems your crafty had more configuration options than mine, any easy way to run it with the same parameters? (except with 6 threads of course)

This is my .craftyrc file:

mt=4
log=off
hash 4096M
hashp 64M
cache=5M
tbpath /data/CompressedTb
egtb=5
ponder on
info
resign 6
learn=7
exit

rbarreira · Post by **rbarreira** » Sun Oct 03, 2010 7:12 pm

Two problems: I was using crafty 23.2, and I was not using -DPOPCNT before.

I'll run a new benchmark with the new crafty, DPOPCNT and some of your settings later.

frankp · Post by **frankp** » Sun Oct 03, 2010 7:15 pm

rbarreira wrote:Two problems: I was using crafty 23.2, and I was not using -DPOPCNT before.

I'll run a new benchmark with the new crafty, DPOPCNT and some of your settings later.

Do not forget to use mt=6 for your cpu :-)
(The intel cpu, Q9550, I have does not have a hardware popcount.).

rbarreira · Post by **rbarreira** » Sun Oct 03, 2010 7:34 pm

Now my .craftyrc equal to yours except using 6 cpus and 2048 MB hash (my system only has 4 GB of memory).

Using:

crafty 23.3.

gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)

Makefile:

linux-amd64:
$(MAKE) target=LINUX \
CC=gcc CXX=g++ \
CFLAGS='-Wall -pipe -fbranch-probabilities -fomit-frame-pointer -O3 -march=k8' \
CXFLAGS='' \
LDFLAGS='$(LDFLAGS) -lpthread -lstdc++' \
opt='$(opt) -DINLINE64 -DCPUS=8 -DPOPCNT' \
crafty-make

Result:

unable to open book file [./book.bin].
book is disabled
unable to open book file [./books.bin].
Warning-- xboard 'cores' option disabled
max threads set to 6.
Warning-- xboard 'memory' option disabled
hash table memory = 2048M bytes.
Warning-- xboard 'memory' option disabled
pawn hash table memory = 64M bytes.
EGTB cache memory = 5M bytes.
EGTB access enabled
using tbpath=/data/CompressedTb
0 piece tablebase files found
pondering enabled.
Crafty version 23.3
number of threads = 6
hash table memory = 2048M
pawn hash table memory = 64M
EGTB cache memory = 5M
60 moves/30 minutes 0 seconds primary time control
30 moves/15 minutes 0 seconds secondary time control
book frequency (freq)..............1.00
book static evaluation (eval)......0.10
book learning (learn)..............1.00
resign after 5 consecutive moves with score < -6.
book learning enabled

Crafty v23.3 (6 cpus)

White(1): bench
Running benchmark. . .
......
Total nodes: 431150642
Raw nodes per second: 18802906
Total elapsed time: 22.93
White(1): bench
Running benchmark. . .
......
Total nodes: 555185946
Raw nodes per second: 18903164
Total elapsed time: 29.37
White(1): bench
Running benchmark. . .
......
Total nodes: 485104754
Raw nodes per second: 18758884
Total elapsed time: 25.86
White(1): bench
Running benchmark. . .
......
Total nodes: 355304681
Raw nodes per second: 18563462
Total elapsed time: 19.14
White(1):

diep · Post by **diep** » Sun Oct 10, 2010 5:34 am

rbarreira wrote:In my experience the Intel compiler either generates executables that don't run at all when they detect an AMD CPU (if one of the -x options is used) or generates a really crappy codepath which is selected at runtime for AMD CPUs (by default, or if one of the -ax options is used).

Even with the newest AMD CPUs an executable made by icc will run something that would probably work on a 80386 when it detects an AMD CPU. So it will probably be more efficient if I use gcc.

And in order to be able to do this, they need a total fubar sabotaged GCC.

You know, intel has a bigger lookahead than AMD. So if you reschedule branches to be just outside the lookahead of AMD, then you can kill AMD with it and not cripple intel too much and there is tricks likewise to kill intel.

That's exactly what GCC already is doing for years and there are 0 excuses to not generate normal simple straightforward code by the GCC compiler in all those cases.

It's total sabotaged there. It really produces RISC code still and any attempt to get rid of it would directly result in the modification getting undone, as that "would slow down for processor XYZ out of 1980" which no one anymore uses, but avoids a big sabotage getting removed from GCC and thereby directly speeds it up 10% or so, as directly pgo would work for those cases better as well.

Another good example is Linus posting on this same topic; to quote him, "there is no reason to not use cmov now that both intels core as well as AMD have fast cmov handling".

A polish guy replied to linus: "but then it is slower at P4". Thereby overruling Linus, ignoring all objective discussions and parameters like march and mtune. 3 years after Linus posting, GCC still generates the sabotaged code, which basically keeps the branch or even rewrites the branch such that you need to jump inside the code, whereas a simple manner to speedup is generate a cmov.

Under no condition GCC can produce efficient code for a big chessprogram.

If it would, all kind of manufacturers would not be able to use tricks in compilers they use now to avoid competitors to profit from it. That's both true for intel as well as AMD.

The biggest scandal as a result from this, is the PGO pass from GCC. Where intel c++ gets a 22% speedup there, GCC gets just a few %.

Let's suppose now that GCC would produce efficient code and not be deliberate sabotaged, even against wishes of big guys like Linus. In that case there is no way to hide for manufacturers, they simply from performance viewpoint CANNOT AFFORD to produce code that runs worse on their own cpu's anymore then.

Even with sabotage there is not a big IPC difference between AMD and intel cpu's. Many programmers know how to mess up or have themselves fooled too much. From many of the top programmers, when they measure objectively, nehalem is at most 5% faster in ipc using intel c++ than AMD processors.

So it's all about how many Ghz in total you can throw against it, after turning off hyperthreading tricks.

Vincent

wgarvin · Post by **wgarvin** » Thu Oct 14, 2010 11:21 pm

Unfortunately, as was alluded to earlier in the thread, for many years ICC has deliberately crippled performance on AMD chips by selecting a slower code path for them at run-time.

http://www.agner.org/optimize/blog/read.php?i=49

The latest compiler versions still do this.

If you want to use ICC, I would recommend patching out the CPUID vendor checks for 'GenuineIntel' that are cause other vendor's chips to take older/slower code path even if their CPUID return values indicate that they support the newer instruction sets. These checks are generated by ICC as part of the executable (and also found in several Intel math libraries you might link against, however, chess programs are unlikely to be using those). Someone has surely written a tool to automate this, by now. It can be done by hand but its probably too tedious for anything you're going to compile more than once.

rbarreira · Post by **rbarreira** » Fri Oct 15, 2010 1:14 am

wgarvin wrote: If you want to use ICC, I would recommend patching out the CPUID vendor checks for 'GenuineIntel' that are cause other vendor's chips to take older/slower code path even if their CPUID return values indicate that they support the newer instruction sets.

It's actually very easy to override the vendor verification, you don't need to patch the executable. Just adding this to your source code will do the trick:

Code: Select all

int __intel_cpu_indicator = 0;

// this function gets called automatically, don't call it yourself
void __intel_cpu_indicator_init&#40;)
&#123;
    __intel_cpu_indicator = 0x8000; // Pretend we're running on an Intel CPU with SSE 4.2 no matter what CPU we're using &#40;lower bits set other architectures&#41;
&#125;

This has a big problem though, which is that Intel's SSE 4.2 is not fully supported in AMD CPUs. The correct way for the compiler to generate code would be to check for instruction compatibility (using various CPUID mechanisms, which itself Intel invented for the most part), not a broad classification like "genuine intel sse 4.2".

Of course Intel doesn't want to do this since it keeps some benchmarks out there favoring Intel CPUs even though they might not be really faster in those cases...

micron · Post by **micron** » Fri Oct 15, 2010 2:25 am

diep wrote: And in order to be able to do this, they need a total fubar sabotaged GCC.
<snip>
Another good example is Linus posting on this same topic; to quote him, "there is no reason to not use cmov now that both intels core as well as AMD have fast cmov handling".
A polish guy replied to linus: "but then it is slower at P4". Thereby overruling Linus, ignoring all objective discussions and parameters like march and mtune. 3 years after Linus posting, GCC still generates the sabotaged code, which basically keeps the branch or even rewrites the branch such that you need to jump inside the code, whereas a simple manner to speedup is generate a cmov.

Reality check:

Code: Select all

$ cat test.c

int test&#40; int n ) &#123;
 if ( n == 0 ) n = 42;
 return n;
&#125;

$ gcc -O3 -S test.c && cat test.s
	.text
	.align 4,0x90
.globl _test
_test&#58;
LFB2&#58;
	pushq	%rbp
LCFI0&#58;
	movq	%rsp, %rbp
LCFI1&#58;
	testl	%edi, %edi
	movl	$42, %eax
	cmove	%eax, %edi
	movl	%edi, %eax
	leave
	ret

Robert P.

AMD hex core

Re: AMD hex core

Re: AMD hex core

Re: AMD hex core

Re: AMD hex core

Re: AMD hex core

Re: AMD hex core

Re: AMD hex core

Re: AMD hex core

Re: AMD hex core