Developments of the last two years

Don · Post by **Don** » Sun Mar 03, 2013 10:54 pm

Rebel wrote:
Don wrote: Look at some of the writings of Michael Abrash and I think you will change your tune
Precisely, the man has been my teacher

I have learned over my short lifetime that you have to question almost anything that comes into the public domain as "conventional wisdom."

With the web it's just as true, because if someone writes something and it sounds good, especially if they are a good writer, it becomes a standard sound bite that will automatically get repeated until it's just accepted without any critical thinking. I think this is clearly one of those things.

Of course some things are much more believable than others and this one is very believable and for most people more or less true.

Another one that make me cringe: "modern garbage collectors have better performance than people can with manual garbage collection."

Another one that has been going around for 30 years is: "with Moores law we no longer have any need for compiled languages." This one (or something like it) is based on the idea that an interpreted language running on a computer today runs faster than a compiled program running on a computer of a few years ago and it assumes that you will never need more performance that we have "right now" with a compiled language.

There are a bunch of these types of silly statements going around that people just accept once they hear someone say it.

Uri Blass · Post by **Uri Blass** » Sun Mar 03, 2013 11:19 pm

Don wrote:

There are some examples on the web of fairly simply programs being optimized in assembler to get a 2x speedup.

Don

I wonder if you think that 2x speedup is possible for a chess program
by some good programmer.

If yes then maybe somebody can prove it by improving stockfish only by translating it to assmbler and making it twice faster.

Joost Buijs · Post by **Joost Buijs** » Sun Mar 03, 2013 11:29 pm

Rebel wrote:
Joost Buijs wrote:In another thread I saw you are using the "Digital Mars" compiler. I have no experience with this compiler but I can imagine that it produces (much) slower code as e.g. the Intel, Microsoft or the latest version of the GCC compiler.
I was curious how the MS compiler would perform and the 2010 edition was a disappointing experience, a speed loss (ASM->C) of 15% (IIRC). Besides, although the DigitalMars compiler is old and development has stopped its code speed-wise is on the same level as MSVC-2010.

A couple of years ago I wrote my whole evaluation function in 64 bit ASM and at best I could reach approximately the same performance as with the C++ version. I really tried very hard to get it better but it just didn't work.
I take your word for it. Just curious, did you apply the rules of "pairing" ?

No I didn't apply rules of instruction pairing.
I have a very old book over here called "Pentium Processor Optimization Tools" that's where they talk about instruction pairing for the Pentium.
This doesn't seem to hold anymore for the core architecture, although there still seems to be a pairing mechanism when you use MMX or SSE instructions.

It surprises me that a very old compiler like the Digital Mars produces code that is speed-wise the same as code produced by MSVC 2010.

I can imagine that you lose some speed when you translate directly from ASM to C. You probably have to restructure your program at some point to regain the lost speed. It is just a matter of knowing what your compiler does, some constructs work better than others.

Don · Post by **Don** » Mon Mar 04, 2013 12:00 am

Uri Blass wrote:
Don wrote:

There are some examples on the web of fairly simply programs being optimized in assembler to get a 2x speedup.

Don
I wonder if you think that 2x speedup is possible for a chess program
by some good programmer.

If yes then maybe somebody can prove it by improving stockfish only by translating it to assmbler and making it twice faster.

You probably think that is ridiculous, right? I don't.

I think an exceptionally talented assembler guru could squeeze a lot out of Stockfish. I don't know if it would be 2X but it might be even more and that would not surprise me whatsoever.

There are 2 limits to how fast a "functionally identical" optimized version of Stockfish could get. The first limit is simply what is possible which we can probably never know. A metaphor for this is to ask, "what could God achieve?" I can easily imagine that there are many possible optimizations that we never even thought of and probably even a few "head slappers" in there.

The second limitation is the one that is most relevant and that is what we believe is possible - and that is the one that limits us the most. If we become satisfied at some point that we have pretty much gotten close to the limit, we won't look for any more.

If you are talking about pure code optimization where you completely ignore anything that might speed up the code substantially but is more of a algorithmic nature, then there is probably still at least 50% speedup possible. Maybe a lot more. You should realize that compilers are probably missing quite a bit even though they continue to improve. They cannot understand anything that is too complicated and they have to make the most conservative assumptions to be sure of being correct and they cannot rewrite algorithms a better way. Keep in mind that C also has limitations too, such as lack of access to all the instructions that a chip provides. In fact most programs now have to work around the popCount thing - either with special intrinsic's to address this limitation or a small assembly language routine. That should tell you something!

syzygy · Post by **syzygy** » Mon Mar 04, 2013 1:36 am

hgm wrote:
Don wrote:The reasoning is that a human cannot beat a compiler unless he has the same knowledge that a compiler has. The good compilers "understand" many of the important details of instruction scheduling. That can be worked out by hand and improved upon (because compilers STILL do not do it very well.) Chess is a different task which requires massive calculation.
But that is exactly my point: current CPUs are so complex that it would be a computationally intensive task to produce optimum code. Elementary operations that describe a task can be ordered in combinatorially many sequences. It is far from trivial how each of these perform; that would require complex simulation of the core.

Current compilers don't perform those simulations, either.

Don Dailey wrote:I have learned over my short lifetime that you have to question almost anything that comes into the public domain as "conventional wisdom."

Wasn't it too long ago that the "conventional wisdom" was that compiled C code was a factor 10 or so slower than manually written assembly? That is of course not true anymore. But I would be surprised if many people really believe that compiled C code is faster than assembly code.

Of course C does become faster than assembly when the program is sufficiently complex that writing it in C due to its higher level of abstraction allows you to come up with more efficient data structures and/or algorithms.

bob · Post by **bob** » Mon Mar 04, 2013 2:54 am

hgm wrote:I don't think that reasoning is sound. I can build an engine that plays better Chess than I do. Why shouldn't I be able to write a compiler that makes better assembly than I do? Even if I would be the worlds's number one expert. We have engines that beat the World champ Chess...

Here's the biggest win you get, from someone that has been programming in ASM forever...

The compiler has to follow the semantics of the language. It doesn't know anything about the semantics of the program itself, other than what it sees. But that is not much.

For example, if you do a switch() it must, by standard, do what the specs say. If the switch variable value does not match any of the cases, the compiler is required to test for that condition to avoid errors. In my code, I might well KNOW that the "piece" value will only be one of 13 possibilities, and I can exclude that "if piece < -6 or piece > +6" sanity test. There are lots of such things that one can exploit, but it requires knowledge about the program that can't readily be discerned just looking at the source code for one procedure. Other things might include min and max range of a value where the compiler only knows "int" or "long" but you know it is 17 bits max. So there is room to out-perform the compiler, but nowhere near what it was 20-30 years ago when hand-coded asm was the way to go everywhere.

Compilers are good at doing the classic optimizations, like common-subexpression elimination, loop-invariant detection and such. So I doubt a 10:1 is possible, UNLESS one intentionally (or unknowingly) writes C code that ties the compiler's hands. Bad C code can still produce bad asm code. But with care, one can write C code that produces pretty well optimized asm code. Unless you step outside the semantics of the language somewhere. For example, there is no standard way to get to the popcnt() instruction. compilers do include intrinsics for such, but it is not something that is portable.

I've spent a lot of time looking at compiler output and am generally impressed. They can be improved on, with effort, and perhaps with a lot of pragma usage to tell the compiler things it can't spot for itself.

bob · Post by **bob** » Mon Mar 04, 2013 2:58 am

Don wrote:
Uri Blass wrote:
Don wrote:

There are some examples on the web of fairly simply programs being optimized in assembler to get a 2x speedup.

Don
I wonder if you think that 2x speedup is possible for a chess program
by some good programmer.

If yes then maybe somebody can prove it by improving stockfish only by translating it to assmbler and making it twice faster.
You probably think that is ridiculous, right? I don't.

I think an exceptionally talented assembler guru could squeeze a lot out of Stockfish. I don't know if it would be 2X but it might be even more and that would not surprise me whatsoever.

There are 2 limits to how fast a "functionally identical" optimized version of Stockfish could get. The first limit is simply what is possible which we can probably never know. A metaphor for this is to ask, "what could God achieve?" I can easily imagine that there are many possible optimizations that we never even thought of and probably even a few "head slappers" in there.

The second limitation is the one that is most relevant and that is what we believe is possible - and that is the one that limits us the most. If we become satisfied at some point that we have pretty much gotten close to the limit, we won't look for any more.

If you are talking about pure code optimization where you completely ignore anything that might speed up the code substantially but is more of a algorithmic nature, then there is probably still at least 50% speedup possible. Maybe a lot more. You should realize that compilers are probably missing quite a bit even though they continue to improve. They cannot understand anything that is too complicated and they have to make the most conservative assumptions to be sure of being correct and they cannot rewrite algorithms a better way. Keep in mind that C also has limitations too, such as lack of access to all the instructions that a chip provides. In fact most programs now have to work around the popCount thing - either with special intrinsic's to address this limitation or a small assembly language routine. That should tell you something!

2x is likely possible. More? very difficult to say and it gets more unlikely as you increase the improvement. BUT, and there is a very big BUT here... time. It is a time-consuming project to rewrite to ask. Cray Blitz had 20K lines of ask to go with the base fortran source. It was a daunting task to write, to debug, and to maintain. Much easier to add null-move (1988 or so idea) to the fortran search, than it was to add it to the cray ASM code I had written. One was a 10 minute project, one was a several day project, due to all the hand-optimizing harry and I had done...

It is sometimes worth it for code that will not change, but most parts of a chess program are subject to rewrite/modification at any point in time, making the effort expended go way up.

lucasart · Post by **lucasart** » Mon Mar 04, 2013 4:31 am

bob wrote:
hgm wrote:I don't think that reasoning is sound. I can build an engine that plays better Chess than I do. Why shouldn't I be able to write a compiler that makes better assembly than I do? Even if I would be the worlds's number one expert. We have engines that beat the World champ Chess...
Here's the biggest win you get, from someone that has been programming in ASM forever...

The compiler has to follow the semantics of the language. It doesn't know anything about the semantics of the program itself, other than what it sees. But that is not much.

For example, if you do a switch() it must, by standard, do what the specs say. If the switch variable value does not match any of the cases, the compiler is required to test for that condition to avoid errors. In my code, I might well KNOW that the "piece" value will only be one of 13 possibilities, and I can exclude that "if piece < -6 or piece > +6" sanity test. There are lots of such things that one can exploit, but it requires knowledge about the program that can't readily be discerned just looking at the source code for one procedure. Other things might include min and max range of a value where the compiler only knows "int" or "long" but you know it is 17 bits max. So there is room to out-perform the compiler, but nowhere near what it was 20-30 years ago when hand-coded asm was the way to go everywhere.

Compilers are good at doing the classic optimizations, like common-subexpression elimination, loop-invariant detection and such. So I doubt a 10:1 is possible, UNLESS one intentionally (or unknowingly) writes C code that ties the compiler's hands. Bad C code can still produce bad asm code. But with care, one can write C code that produces pretty well optimized asm code. Unless you step outside the semantics of the language somewhere. For example, there is no standard way to get to the popcnt() instruction. compilers do include intrinsics for such, but it is not something that is portable.

I've spent a lot of time looking at compiler output and am generally impressed. They can be improved on, with effort, and perhaps with a lot of pragma usage to tell the compiler things it can't spot for itself.

I appreciate that your knowledge of assembly is far better than mine, and probably almost everyone here.

However, the arguments you are making have little to do with Assembly vs C. For example, the switch argument. Let's say that your piece values range from 0 to 11. How about writing this instead?

Code: Select all

if (piece >= 12) goto skip;
switch (piece) {...};
skip:

The fact that 0 <= piece <= 11 can help to optimize the code is of course not known by a C compiler, nor is it known by an assembler...

In C++, with typed enum, you can let the compiler know the range of possible values

Code: Select all

enum Piece {wPawn, wKnight...};

And then the switch knows that the set of possible values for a "Piece" is realatively compact so the compiler may decide to use a jump table, to make branching a O(1) operation.

Regarding the int/long, there are many integer types in C. It's all about choosing the right one, and knowing that sign extension is not a zero cost operation. If you use unsigned judiciously, there is some (tiny) performance gain there. But in reality, writing compact and cache friendly code is far more important than these microscopic optimizations.

What I am trying to say is that you will not make your code faster by using assembly, except to use hardware features otherwise unavailable in C (in the case of chess, I think there are only 4 functions that deserve assembly or intrinsics: popcnt(), lsb(), msb(), prefetch()).

But you will make your program faster by writing *better C* code. And in order to write efficient C code, some knowledge of assembly, and how a compiler/optimizer works certainly helps.

You can see that the Linux kernel has a very small %age of assembly code. It only uses assembly where it must (to do things not possible in C or to access hardware features like task switching port I/O etc.). Everything else is in C, even though Torvaldes is a performance freak, and a hardcore assembly programmer too.

And even if there is some performance gain that assembler gives you over *perfectly written* C code (which I doubt), it is certainly not worth doing. Imagine if your entire search() function was in assembly. It would be pages and pages long, and completely illegible. And you'd have to write plenty of versions of it for all sorts of different plateforms (and maintain them risking to have differences). A ridiculous waste of time, not to mention the fact that you'd have to constantly add new versions to support new CPU features etc.

lucasart · Post by **lucasart** » Mon Mar 04, 2013 4:55 am

lucasart wrote:But you will make your program faster by writing *better C* code. And in order to write efficient C code, some knowledge of assembly, and how a compiler/optimizer works certainly helps.

Anyway, we can talk about it in abstract terms as much as we want. But it's a waste of time (especially when people who have no knowledge of assembly start spamming the discussion).

So let's make this a challenge. I challenge anyone to write an assembly perft() that will beat my C code. PM me if you are interested!

Anyone up for the challenge ?

Ron Murawski · Post by **Ron Murawski** » Mon Mar 04, 2013 5:08 am

Rebel wrote:... although the DigitalMars compiler is old and development has stopped its code speed-wise is on the same level as MSVC-2010.

The Digital Mars C compiler is still being actively developed. It was last updated December 15, 2012 to version 8.56.
http://www.digitalmars.com/download/freecompiler.html

Ron

Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years