Developments of the last two years

hgm · Post by **hgm** » Mon Mar 04, 2013 11:05 pm

bob wrote:
mar wrote:
lucasart wrote: How about writing this instead?
Code: Select all
if (piece >= 12) goto skip;
switch (piece) {...};
skip:
The fact that 0 <= piece <= 11 can help to optimize the code is of course not known by a C compiler, nor is it known by an assembler...
It is of course known by the compiler and it will do exactly this (except in assembly), just look at the assembly the compiler produces.
This misses the point. _I_ know I don't need to make that unnecessary test. I am 100% certain that my pieces are never out of range. So whether the compiler produces the test as part of the switch, or you do it yourself, it is STILL a wasted test. One I will NOT have in my hand-written assembler code.

One I can safely omit because I know something the compiler doesn't (can't) know, that the piece will never be outside of the range I gave for the cases...

The compiler can know that if you use an enum type for piece.

The compiler does not have to know that if you won't use a switch().

Basically what you are saying is: "if the C-code is poorly written, I can write assembly code that is better". But that is pretty obvious, and not the issue. The issue is if you can do better in assembler than your best in C.

bob · Post by **bob** » Mon Mar 04, 2013 11:41 pm

hgm wrote:
bob wrote:
mar wrote:
lucasart wrote: How about writing this instead?
Code: Select all
if (piece >= 12) goto skip;
switch (piece) {...};
skip:
The fact that 0 <= piece <= 11 can help to optimize the code is of course not known by a C compiler, nor is it known by an assembler...
It is of course known by the compiler and it will do exactly this (except in assembly), just look at the assembly the compiler produces.
This misses the point. _I_ know I don't need to make that unnecessary test. I am 100% certain that my pieces are never out of range. So whether the compiler produces the test as part of the switch, or you do it yourself, it is STILL a wasted test. One I will NOT have in my hand-written assembler code.

One I can safely omit because I know something the compiler doesn't (can't) know, that the piece will never be outside of the range I gave for the cases...
The compiler can know that if you use an enum type for piece.

The compiler does not have to know that if you won't use a switch().

Basically what you are saying is: "if the C-code is poorly written, I can write assembly code that is better". But that is pretty obvious, and not the issue. The issue is if you can do better in assembler than your best in C.

I only consider "my best in C" vs "my best in asm". I've never found a case I could not improve on, unless you talk about completely trivial computation.

The switch case was simply an example showing what I might know that the compiler can't determine. Other examples are carrying important things in registers across procedures. Compiler can't do that since it won't know what happens down-stream when one compiles modules individually. There's a ton of good books on doing such optimizations. Compilers are very good at the common cases like recognizing common sub-expressions and the like. But then a good C programmer probably doesn't use those anyway because it looks ugly. But all the individual "tricks of the trade" simply are not included in optimizers because they affect so few programs overall, and one doesn't want a horrendously complex optimizer since they are already complex enough.

Note that this is not about "classic optimizations" which might include the above-mentioned common-subexpression elimination, strength reduction, loop unrolling, blocking (for cache) optimizations, and such. It might well include unusual things like using unexpected registers, or uncommon instruction usage, and other tricks like instruction hoisting and instruction scheduling tricks that the compiler simply doesn't try to deal with.

Compilers are certainly very good. But they are far from perfect. The issue becomes one of time. Is the speed gain from hand-coding worth the increased cost of doing the programming at a low level?

BTW ENUM doesn't help every case. Suppose _I_ know that this variable can only have white piece values at this particular point? The compiler likely won't be able to determine that, which still leaves it at a disadvantage on switch() statements.

Rebel · Post by **Rebel** » Tue Mar 05, 2013 12:19 am

hgm wrote:
Rebel wrote:
hgm wrote:Pairing??? You can't mean the Pentium I concept?
From the link Gerd posted:

9.4 Instruction latency and throughput
The latency of an instruction is the number of clock cycles it takes from the time the instruction starts to execute till the result is ready. The time it takes to execute a dependency chain is the sum of the latencies of all instructions in the chain.

The throughput of an instruction is the maximum number of instructions of the same kind that can be executed per clock cycle if the instructions are independent. I prefer to list the reciprocal throughputs because this makes it easier to compare latency and throughput. The reciprocal throughput is the average time it takes from the time an instruction starts to execute till another independent instruction of the same type can start to execute, or the number of clock cycles per instruction in a series of independent instructions of the same kind. For example, floating point addition on a Core 2 processor has a latency of 3 clock cycles and a reciprocal throughput of 1 clock per instruction. This means that the processor uses 3 clock cycles per addition if each addition depends on the result of the preceding addition, but only 1 clock cycle per addition if the additions are independent.
Looks like good pairing is even more important than in the Pentium days.
So you did mean that kind of pairing?

That is really completely useless since Intel's P6 chip (Pentium Pro / II ) came out. It was based on the Pentium I having two parallel pipelines (U and V), where U could decode complex instruction, and V only basic ones, and the V instruction had to be independent of the one in U, or it would cause a bubble.

P6 and later (with the exception of Atom) did not work like that at all. They only have a single pipe-line, which can push 3 instructions (decoded to 3-5 uOps) in a re-order buffer per clock, and it does not care the slightest whether the instructions are dependent or not, as it can execute them in arbitrary order while they are in the buffer. So it simply picks three that are independent, no matter whether they are consecutive in the code or 30 instructions apart. In Nehalem architecture the pipeline even can accept and decode 4 instructions per clock.

So 'pairing' is based on an architecture that was abandoned more than a decade ago, and is utterly useless with out-of-order architectures. It is not even based on the right number, unless your have 'pairs' of 3 or 4. It beats me how you can conclude anything about pairing from what Gerd writes.

The document I quoted from is about MODERN processors.

http://www.agner.org/optimize/optimizing_assembly.pdf

Start reading at chapter 9 and perhaps you will understand why your 40% code reduction gave you no speed improvement. Some snips:

Execution units
Out-of-order execution becomes even more efficient when the CPU can do more than one thing at the same time. Many CPUs can do two, three or four things at the same time if the things to do are independent of each other and do not use the same execution units in the CPU. Most CPUs have at least two integer ALU’s (Arithmetic Logic Units) so that they can do two or more integer additions per clock cycle.

• A chain of instructions where each instruction depends on the previous one cannot execute out of order. Avoid long dependency chains. (See page 64).

• The CPU can do more things simultaneously if the code contains a good mixture of instructions from different categories, such as: simple integer instructions, floating point addition, multiplication, memory read, memory write.

----

Loads of other tips.

lucasart · Post by **lucasart** » Tue Mar 05, 2013 12:33 am

Gusev wrote:
I laughed so much when I read the passage about Thomas Eddison. Seriously, some people vastly overrate themselves...
(...)
The quote from Edison is quite relevant, but I did not want to project a false impression that I considered Norm a failure.
(...)

You misunderstood. What is laughable is you comparing yourself to Thomas Eddison. So you compiled a program with a better compiler (and added Intel intrinsics for popcnt). What a prowess indeed. You deserve the Nobel price for this at least, perhaps even the Fields medal

Don't take this too harshly, I just found it too funny not to notice. You're just giving the stick to get beaten with that Eddison thing.

bob · Post by **bob** » Tue Mar 05, 2013 5:41 am

Uri Blass wrote:
bob wrote:
Don wrote:
Uri Blass wrote:
Don wrote:

There are some examples on the web of fairly simply programs being optimized in assembler to get a 2x speedup.

Don
I wonder if you think that 2x speedup is possible for a chess program
by some good programmer.

If yes then maybe somebody can prove it by improving stockfish only by translating it to assmbler and making it twice faster.
You probably think that is ridiculous, right? I don't.

I think an exceptionally talented assembler guru could squeeze a lot out of Stockfish. I don't know if it would be 2X but it might be even more and that would not surprise me whatsoever.

There are 2 limits to how fast a "functionally identical" optimized version of Stockfish could get. The first limit is simply what is possible which we can probably never know. A metaphor for this is to ask, "what could God achieve?" I can easily imagine that there are many possible optimizations that we never even thought of and probably even a few "head slappers" in there.

The second limitation is the one that is most relevant and that is what we believe is possible - and that is the one that limits us the most. If we become satisfied at some point that we have pretty much gotten close to the limit, we won't look for any more.

If you are talking about pure code optimization where you completely ignore anything that might speed up the code substantially but is more of a algorithmic nature, then there is probably still at least 50% speedup possible. Maybe a lot more. You should realize that compilers are probably missing quite a bit even though they continue to improve. They cannot understand anything that is too complicated and they have to make the most conservative assumptions to be sure of being correct and they cannot rewrite algorithms a better way. Keep in mind that C also has limitations too, such as lack of access to all the instructions that a chip provides. In fact most programs now have to work around the popCount thing - either with special intrinsic's to address this limitation or a small assembly language routine. That should tell you something!
2x is likely possible. More? very difficult to say and it gets more unlikely as you increase the improvement. BUT, and there is a very big BUT here... time. It is a time-consuming project to rewrite to ask. Cray Blitz had 20K lines of ask to go with the base fortran source. It was a daunting task to write, to debug, and to maintain. Much easier to add null-move (1988 or so idea) to the fortran search, than it was to add it to the cray ASM code I had written. One was a 10 minute project, one was a several day project, due to all the hand-optimizing harry and I had done...

It is sometimes worth it for code that will not change, but most parts of a chess program are subject to rewrite/modification at any point in time, making the effort expended go way up.
Even if the code changes then there is no contradiction between modification of the code and optimizing the code.

One programmer in a team of 2 programmers may only modify the slow code and another programmer can simply optimize the code without modifying the algorithm.

The programs will always have 2 version(best slow version based on testing and some fast version based on older code)

If the programmer who optimize the code is fast enough to get 100% speed improvement in 6 months,
then you can expect the old optimized program to be better because usually you do not get something that is equivalent to being twice faster in 6 months.

Question is, is that the best use of programmer #2, who will NEVER be involved in "innovation", only in "rewriting in ask"...

hgm · Post by **hgm** » Tue Mar 05, 2013 7:34 am

Rebel wrote:The document I quoted from is about MODERN processors.

http://www.agner.org/optimize/optimizing_assembly.pdf

Start reading at chapter 9 and perhaps you will understand why your 40% code reduction gave you no speed improvement. Some snips:

Execution units
Out-of-order execution becomes even more efficient when the CPU can do more than one thing at the same time. Many CPUs can do two, three or four things at the same time if the things to do are independent of each other and do not use the same execution units in the CPU. Most CPUs have at least two integer ALU’s (Arithmetic Logic Units) so that they can do two or more integer additions per clock cycle.

• A chain of instructions where each instruction depends on the previous one cannot execute out of order. Avoid long dependency chains. (See page 64).

• The CPU can do more things simultaneously if the code contains a good mixture of instructions from different categories, such as: simple integer instructions, floating point addition, multiplication, memory read, memory write.

----

Loads of other tips.

No need for me to read that stuff, as I can already dream it for a decade. Remember I reverse engineered the P6, which enables me to add twice as many details about the workings of its pipeline as you can read there, or in fact anywhere.

And no, it did not explain why I did not get a speedup: the speed of the code was still well below the 2 instructions per clock level, which it should be able to reach if the limitations listed in that text (2 ALUs, 3 uOps/clock) were the only ones. Note that store instructions are expensive: they decode into 2 uOps. That means they do put an extra burden on the throughput; when you run at 2 ALU instructions per clock, you cannot do load/store pairs on the side for free as the third instruction, even though there is a separate load/store unit with its own port to the register unit, because that would violate the 3-uOps/clock of the retirement unit. In addition, it means that store instructions can only be handled by decoder 1, so that you create decoder bubbles for decoder 3 and 2 if you don't carefully align them in triples.

Now explain us what you think any of this has to do with 'pairing'. Remember I asked you what you meant by that. Having a program consist of a mix of instructions of different classes in not generally referred to as pairing. Would you consider a program that first does 4 memory loads (A-D), and then a dependency chain of 4 ALU operations on A and B (say calculating (A+B<<1)-A^B) and then 4 instructions that do the same with C and D, as 'paired'? Do you think re-ordering the instructions could give you a speedup?

Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years