Developments of the last two years

hgm · Post by **hgm** » Mon Mar 04, 2013 4:33 pm

Yet the speedup of 1.7 was obtained by an automated process, so the hand-coding should beat that, and not just the uninformed gcc -O2 compile. And in a sense loop unrolling is a for of cheating; the program is not doing really the same.

Don · Post by **Don** » Mon Mar 04, 2013 4:44 pm

hgm wrote:Yet the speedup of 1.7 was obtained by an automated process, so the hand-coding should beat that, and not just the uninformed gcc -O2 compile. And in a sense loop unrolling is a for of cheating; the program is not doing really the same.

Years ago, before compiler technolgy was very mature I noticed some very simply optimization that I could make in C code. If you had:

int foo( ... )
{
int i;
int j;
int k;

...

}

I could simply experiment with the order I declared the variables! Moving the "int k" to the top was sometimes worth a lot. I am pretty sure the stupid compiler just assigned the first 2 or 3 declarations to registers or something brain-dead such as that.

Aaron Becker · Post by **Aaron Becker** » Mon Mar 04, 2013 7:31 pm

Martin Thoresen wrote: Hello Aaron!
Quite funny, when selecting participants for the now "resurrected" TCEC, I was thinking of Daydreamer just when I saw that you stopped working on it, so it hadn't been updated since the "old" TCEC.

In any case, welcome back. I can't remember if it was SMP or not, but if it is, then it is a good candidate for Season 2 that will start sometime in April.

Best,
Martin

Thanks, Martin. SMP support isn't there now, but it's probably the first thing I'd add, so we'll see.

michiguel · Post by **michiguel** » Mon Mar 04, 2013 8:19 pm

Don wrote:
hgm wrote:Yet the speedup of 1.7 was obtained by an automated process, so the hand-coding should beat that, and not just the uninformed gcc -O2 compile. And in a sense loop unrolling is a for of cheating; the program is not doing really the same.
Years ago, before compiler technolgy was very mature I noticed some very simply optimization that I could make in C code. If you had:

int foo( ... )
{
int i;
int j;
int k;

...

}

I could simply experiment with the order I declared the variables! Moving the "int k" to the top was sometimes worth a lot. I am pretty sure the stupid compiler just assigned the first 2 or 3 declarations to registers or something brain-dead such as that.

I remember once I renamed the *.c files and I got significant speed up (5% or something like that). That changed the order they were linked in my script, IIRC.

Miguel

Rebel · Post by **Rebel** » Mon Mar 04, 2013 9:00 pm

hgm wrote:Pairing??? You can't mean the Pentium I concept?

From the link Gerd posted:

9.4 Instruction latency and throughput
The latency of an instruction is the number of clock cycles it takes from the time the instruction starts to execute till the result is ready. The time it takes to execute a dependency chain is the sum of the latencies of all instructions in the chain.

The throughput of an instruction is the maximum number of instructions of the same kind that can be executed per clock cycle if the instructions are independent. I prefer to list the reciprocal throughputs because this makes it easier to compare latency and throughput. The reciprocal throughput is the average time it takes from the time an instruction starts to execute till another independent instruction of the same type can start to execute, or the number of clock cycles per instruction in a series of independent instructions of the same kind. For example, floating point addition on a Core 2 processor has a latency of 3 clock cycles and a reciprocal throughput of 1 clock per instruction. This means that the processor uses 3 clock cycles per addition if each addition depends on the result of the preceding addition, but only 1 clock cycle per addition if the additions are independent.

Looks like good pairing is even more important than in the Pentium days.

BubbaTough · Post by **BubbaTough** » Mon Mar 04, 2013 9:13 pm

Aaron Becker wrote:About two years ago, I stopped developing Daydreamer, my own engine, because it was taking up time that I needed to spend on school. Now I have more free time again and have been thinking about coming back to computer chess, and I was wondering what's been going on in the last few years.

I see some new names among engine developers, and certainly the engines at the top are stronger. Have there been any important new techniques developed, or are the improvements due to something else, maybe just incremental tuning.

In short, what did I miss?

A brief interruption to address the original topic

.

I may not be the best guy to address this subject as I do not look at a lot of source code (from others) but my impression is a lot of the elo in the last few years is coming from a wider acceptance that massive reductions / pruning is the way to go. So more aggressive LMR, more pruning near the tips, etc.. Most of the rest is really from improvements in the testing environment (being able to run more games faster) which has helped improve pretty much all areas of the engines as people are not better able to distinguish between there good changes and their bad changes.

-Sam

Rebel · Post by **Rebel** » Mon Mar 04, 2013 9:15 pm

michiguel wrote:
Don wrote:
hgm wrote:Yet the speedup of 1.7 was obtained by an automated process, so the hand-coding should beat that, and not just the uninformed gcc -O2 compile. And in a sense loop unrolling is a for of cheating; the program is not doing really the same.
Years ago, before compiler technolgy was very mature I noticed some very simply optimization that I could make in C code. If you had:

int foo( ... )
{
int i;
int j;
int k;

...

}

I could simply experiment with the order I declared the variables! Moving the "int k" to the top was sometimes worth a lot. I am pretty sure the stupid compiler just assigned the first 2 or 3 declarations to registers or something brain-dead such as that.
I remember once I renamed the *.c files and I got significant speed up (5% or something like that). That changed the order they were linked in my script, IIRC.

Miguel

Changing the order of your include files can do that too, swapping the order of routines inside a source file as well. Or change the order in your variables. Like a slot machine try your luck. I am not sure if that's still possible with nowadays compilers.

bob · Post by **bob** » Mon Mar 04, 2013 9:20 pm

lucasart wrote:
Rebel wrote:
lucasart wrote: So let's make this a challenge. I challenge anyone to write an assembly perft() that will beat my C code. PM me if you are interested!

Anyone up for the challenge ?
I will take the bait provided:

1. As long as your 32-bit Perft (stand alone) code compiles with my compiler;

2. It is not too much code (work).

I will optimize and then we compare.

*** edit *** 32-bit
*** edit *** added stand alone
That's interesting, although it's not the original challenge. Basically you take my code, get your compiler to spue out assembly, and optimize manually the assembly. I'm sure the gain will be negligible if any (certainly not a factor 2 as was discussed earlier). Besides, this challenge is meaningless, and would simply prove that your compiler is deficient (if you manage to beat significantly GCC I will bow before you).

The reason why no one can win the originlal challenge is that:
- writing in assembly fro m scratch a perft (with all the board and move gen code of course) is an enormous task, that would take a huge amount of time
- optimozing that code manually until it beats what GCC generates for my C code is even harder

But fair enough, you certainly can optimize manually the asm produced by the compiler. I say you won't optimize much, and the gain (if any) is certainly not worth transforming your engine into an unmaintainable assembly mess.

I need to hack the code to remove all the DiscoCheck that is not useful and will send you a cleand up version that only does perft.

A key disagreement: beating GCC is hard. A basic hand-written asm code will likely be just as fast, if not faster than what GCC produces, assuming a competent asm programmer. Beating gcc from that point is not an enormous task. It is work, to be sure. But we are not talking weeks or months once the original code is written.

There are simply too many "tricks of the trade" in asm coding that compilers do not do for many different reasons.

hgm · Post by **hgm** » Mon Mar 04, 2013 10:02 pm

Rebel wrote:
hgm wrote:Pairing??? You can't mean the Pentium I concept?
From the link Gerd posted:

9.4 Instruction latency and throughput
The latency of an instruction is the number of clock cycles it takes from the time the instruction starts to execute till the result is ready. The time it takes to execute a dependency chain is the sum of the latencies of all instructions in the chain.

The throughput of an instruction is the maximum number of instructions of the same kind that can be executed per clock cycle if the instructions are independent. I prefer to list the reciprocal throughputs because this makes it easier to compare latency and throughput. The reciprocal throughput is the average time it takes from the time an instruction starts to execute till another independent instruction of the same type can start to execute, or the number of clock cycles per instruction in a series of independent instructions of the same kind. For example, floating point addition on a Core 2 processor has a latency of 3 clock cycles and a reciprocal throughput of 1 clock per instruction. This means that the processor uses 3 clock cycles per addition if each addition depends on the result of the preceding addition, but only 1 clock cycle per addition if the additions are independent.
Looks like good pairing is even more important than in the Pentium days.

So you did mean that kind of pairing?

That is really completely useless since Intel's P6 chip (Pentium Pro / II ) came out. It was based on the Pentium I having two parallel pipelines (U and V), where U could decode complex instruction, and V only basic ones, and the V instruction had to be independent of the one in U, or it would cause a bubble.

P6 and later (with the exception of Atom) did not work like that at all. They only have a single pipe-line, which can push 3 instructions (decoded to 3-5 uOps) in a re-order buffer per clock, and it does not care the slightest whether the instructions are dependent or not, as it can execute them in arbitrary order while they are in the buffer. So it simply picks three that are independent, no matter whether they are consecutive in the code or 30 instructions apart. In Nehalem architecture the pipeline even can accept and decode 4 instructions per clock.

So 'pairing' is based on an architecture that was abandoned more than a decade ago, and is utterly useless with out-of-order architectures. It is not even based on the right number, unless your have 'pairs' of 3 or 4.

It beats me how you can conclude anything about pairing from what Gerd writes.

bob · Post by **bob** » Mon Mar 04, 2013 10:32 pm

mar wrote:
lucasart wrote: How about writing this instead?
Code: Select all
if (piece >= 12) goto skip;
switch (piece) {...};
skip:
The fact that 0 <= piece <= 11 can help to optimize the code is of course not known by a C compiler, nor is it known by an assembler...
It is of course known by the compiler and it will do exactly this (except in assembly), just look at the assembly the compiler produces.

This misses the point. _I_ know I don't need to make that unnecessary test. I am 100% certain that my pieces are never out of range. So whether the compiler produces the test as part of the switch, or you do it yourself, it is STILL a wasted test. One I will NOT have in my hand-written assembler code.

One I can safely omit because I know something the compiler doesn't (can't) know, that the piece will never be outside of the range I gave for the cases...

Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years