Developments of the last two years
Moderator: Ras
-
hgm
- Posts: 28419
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Developments of the last two years
Yet the speedup of 1.7 was obtained by an automated process, so the hand-coding should beat that, and not just the uninformed gcc -O2 compile. And in a sense loop unrolling is a for of cheating; the program is not doing really the same.
-
Don
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Developments of the last two years
Years ago, before compiler technolgy was very mature I noticed some very simply optimization that I could make in C code. If you had:hgm wrote:Yet the speedup of 1.7 was obtained by an automated process, so the hand-coding should beat that, and not just the uninformed gcc -O2 compile. And in a sense loop unrolling is a for of cheating; the program is not doing really the same.
int foo( ... )
{
int i;
int j;
int k;
...
}
I could simply experiment with the order I declared the variables! Moving the "int k" to the top was sometimes worth a lot. I am pretty sure the stupid compiler just assigned the first 2 or 3 declarations to registers or something brain-dead such as that.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
-
Aaron Becker
- Posts: 292
- Joined: Tue Jul 07, 2009 4:56 am
Re: Developments of the last two years
Thanks, Martin. SMP support isn't there now, but it's probably the first thing I'd add, so we'll see.Martin Thoresen wrote: Hello Aaron!
Quite funny, when selecting participants for the now "resurrected" TCEC, I was thinking of Daydreamer just when I saw that you stopped working on it, so it hadn't been updated since the "old" TCEC.
In any case, welcome back. I can't remember if it was SMP or not, but if it is, then it is a good candidate for Season 2 that will start sometime in April.
Best,
Martin
-
michiguel
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Developments of the last two years
I remember once I renamed the *.c files and I got significant speed up (5% or something like that). That changed the order they were linked in my script, IIRC.Don wrote:Years ago, before compiler technolgy was very mature I noticed some very simply optimization that I could make in C code. If you had:hgm wrote:Yet the speedup of 1.7 was obtained by an automated process, so the hand-coding should beat that, and not just the uninformed gcc -O2 compile. And in a sense loop unrolling is a for of cheating; the program is not doing really the same.
int foo( ... )
{
int i;
int j;
int k;
...
}
I could simply experiment with the order I declared the variables! Moving the "int k" to the top was sometimes worth a lot. I am pretty sure the stupid compiler just assigned the first 2 or 3 declarations to registers or something brain-dead such as that.
Miguel
-
Rebel
- Posts: 7425
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: Developments of the last two years
From the link Gerd posted:hgm wrote:Pairing??? You can't mean the Pentium I concept?
Looks like good pairing is even more important than in the Pentium days.9.4 Instruction latency and throughput
The latency of an instruction is the number of clock cycles it takes from the time the instruction starts to execute till the result is ready. The time it takes to execute a dependency chain is the sum of the latencies of all instructions in the chain.
The throughput of an instruction is the maximum number of instructions of the same kind that can be executed per clock cycle if the instructions are independent. I prefer to list the reciprocal throughputs because this makes it easier to compare latency and throughput. The reciprocal throughput is the average time it takes from the time an instruction starts to execute till another independent instruction of the same type can start to execute, or the number of clock cycles per instruction in a series of independent instructions of the same kind. For example, floating point addition on a Core 2 processor has a latency of 3 clock cycles and a reciprocal throughput of 1 clock per instruction. This means that the processor uses 3 clock cycles per addition if each addition depends on the result of the preceding addition, but only 1 clock cycle per addition if the additions are independent.
-
BubbaTough
- Posts: 1154
- Joined: Fri Jun 23, 2006 5:18 am
Re: Developments of the last two years
A brief interruption to address the original topicAaron Becker wrote:About two years ago, I stopped developing Daydreamer, my own engine, because it was taking up time that I needed to spend on school. Now I have more free time again and have been thinking about coming back to computer chess, and I was wondering what's been going on in the last few years.
I see some new names among engine developers, and certainly the engines at the top are stronger. Have there been any important new techniques developed, or are the improvements due to something else, maybe just incremental tuning.
In short, what did I miss?
I may not be the best guy to address this subject as I do not look at a lot of source code (from others) but my impression is a lot of the elo in the last few years is coming from a wider acceptance that massive reductions / pruning is the way to go. So more aggressive LMR, more pruning near the tips, etc.. Most of the rest is really from improvements in the testing environment (being able to run more games faster) which has helped improve pretty much all areas of the engines as people are not better able to distinguish between there good changes and their bad changes.
-Sam
-
Rebel
- Posts: 7425
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: Developments of the last two years
Changing the order of your include files can do that too, swapping the order of routines inside a source file as well. Or change the order in your variables. Like a slot machine try your luck. I am not sure if that's still possible with nowadays compilers.michiguel wrote:I remember once I renamed the *.c files and I got significant speed up (5% or something like that). That changed the order they were linked in my script, IIRC.Don wrote:Years ago, before compiler technolgy was very mature I noticed some very simply optimization that I could make in C code. If you had:hgm wrote:Yet the speedup of 1.7 was obtained by an automated process, so the hand-coding should beat that, and not just the uninformed gcc -O2 compile. And in a sense loop unrolling is a for of cheating; the program is not doing really the same.
int foo( ... )
{
int i;
int j;
int k;
...
}
I could simply experiment with the order I declared the variables! Moving the "int k" to the top was sometimes worth a lot. I am pretty sure the stupid compiler just assigned the first 2 or 3 declarations to registers or something brain-dead such as that.
Miguel
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Developments of the last two years
A key disagreement: beating GCC is hard. A basic hand-written asm code will likely be just as fast, if not faster than what GCC produces, assuming a competent asm programmer. Beating gcc from that point is not an enormous task. It is work, to be sure. But we are not talking weeks or months once the original code is written.lucasart wrote:That's interesting, although it's not the original challenge. Basically you take my code, get your compiler to spue out assembly, and optimize manually the assembly. I'm sure the gain will be negligible if any (certainly not a factor 2 as was discussed earlier). Besides, this challenge is meaningless, and would simply prove that your compiler is deficient (if you manage to beat significantly GCC I will bow before you).Rebel wrote:I will take the bait provided:lucasart wrote: So let's make this a challenge. I challenge anyone to write an assembly perft() that will beat my C code. PM me if you are interested!
Anyone up for the challenge ?
1. As long as your 32-bit Perft (stand alone) code compiles with my compiler;
2. It is not too much code (work).
I will optimize and then we compare.
*** edit *** 32-bit
*** edit *** added stand alone
The reason why no one can win the originlal challenge is that:
- writing in assembly fro m scratch a perft (with all the board and move gen code of course) is an enormous task, that would take a huge amount of time
- optimozing that code manually until it beats what GCC generates for my C code is even harder
But fair enough, you certainly can optimize manually the asm produced by the compiler. I say you won't optimize much, and the gain (if any) is certainly not worth transforming your engine into an unmaintainable assembly mess.
I need to hack the code to remove all the DiscoCheck that is not useful and will send you a cleand up version that only does perft.
There are simply too many "tricks of the trade" in asm coding that compilers do not do for many different reasons.
-
hgm
- Posts: 28419
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Developments of the last two years
So you did mean that kind of pairing?Rebel wrote:From the link Gerd posted:hgm wrote:Pairing??? You can't mean the Pentium I concept?
Looks like good pairing is even more important than in the Pentium days.9.4 Instruction latency and throughput
The latency of an instruction is the number of clock cycles it takes from the time the instruction starts to execute till the result is ready. The time it takes to execute a dependency chain is the sum of the latencies of all instructions in the chain.
The throughput of an instruction is the maximum number of instructions of the same kind that can be executed per clock cycle if the instructions are independent. I prefer to list the reciprocal throughputs because this makes it easier to compare latency and throughput. The reciprocal throughput is the average time it takes from the time an instruction starts to execute till another independent instruction of the same type can start to execute, or the number of clock cycles per instruction in a series of independent instructions of the same kind. For example, floating point addition on a Core 2 processor has a latency of 3 clock cycles and a reciprocal throughput of 1 clock per instruction. This means that the processor uses 3 clock cycles per addition if each addition depends on the result of the preceding addition, but only 1 clock cycle per addition if the additions are independent.
That is really completely useless since Intel's P6 chip (Pentium Pro / II ) came out. It was based on the Pentium I having two parallel pipelines (U and V), where U could decode complex instruction, and V only basic ones, and the V instruction had to be independent of the one in U, or it would cause a bubble.
P6 and later (with the exception of Atom) did not work like that at all. They only have a single pipe-line, which can push 3 instructions (decoded to 3-5 uOps) in a re-order buffer per clock, and it does not care the slightest whether the instructions are dependent or not, as it can execute them in arbitrary order while they are in the buffer. So it simply picks three that are independent, no matter whether they are consecutive in the code or 30 instructions apart. In Nehalem architecture the pipeline even can accept and decode 4 instructions per clock.
So 'pairing' is based on an architecture that was abandoned more than a decade ago, and is utterly useless with out-of-order architectures. It is not even based on the right number, unless your have 'pairs' of 3 or 4.
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Developments of the last two years
This misses the point. _I_ know I don't need to make that unnecessary test. I am 100% certain that my pieces are never out of range. So whether the compiler produces the test as part of the switch, or you do it yourself, it is STILL a wasted test. One I will NOT have in my hand-written assembler code.mar wrote:It is of course known by the compiler and it will do exactly this (except in assembly), just look at the assembly the compiler produces.lucasart wrote: How about writing this instead?The fact that 0 <= piece <= 11 can help to optimize the code is of course not known by a C compiler, nor is it known by an assembler...Code: Select all
if (piece >= 12) goto skip; switch (piece) {...}; skip:
One I can safely omit because I know something the compiler doesn't (can't) know, that the piece will never be outside of the range I gave for the cases...