Developments of the last two years

bob · Post by **bob** » Mon Mar 04, 2013 5:37 am

lucasart wrote:
bob wrote:
hgm wrote:I don't think that reasoning is sound. I can build an engine that plays better Chess than I do. Why shouldn't I be able to write a compiler that makes better assembly than I do? Even if I would be the worlds's number one expert. We have engines that beat the World champ Chess...
Here's the biggest win you get, from someone that has been programming in ASM forever...

The compiler has to follow the semantics of the language. It doesn't know anything about the semantics of the program itself, other than what it sees. But that is not much.

For example, if you do a switch() it must, by standard, do what the specs say. If the switch variable value does not match any of the cases, the compiler is required to test for that condition to avoid errors. In my code, I might well KNOW that the "piece" value will only be one of 13 possibilities, and I can exclude that "if piece < -6 or piece > +6" sanity test. There are lots of such things that one can exploit, but it requires knowledge about the program that can't readily be discerned just looking at the source code for one procedure. Other things might include min and max range of a value where the compiler only knows "int" or "long" but you know it is 17 bits max. So there is room to out-perform the compiler, but nowhere near what it was 20-30 years ago when hand-coded asm was the way to go everywhere.

Compilers are good at doing the classic optimizations, like common-subexpression elimination, loop-invariant detection and such. So I doubt a 10:1 is possible, UNLESS one intentionally (or unknowingly) writes C code that ties the compiler's hands. Bad C code can still produce bad asm code. But with care, one can write C code that produces pretty well optimized asm code. Unless you step outside the semantics of the language somewhere. For example, there is no standard way to get to the popcnt() instruction. compilers do include intrinsics for such, but it is not something that is portable.

I've spent a lot of time looking at compiler output and am generally impressed. They can be improved on, with effort, and perhaps with a lot of pragma usage to tell the compiler things it can't spot for itself.
I appreciate that your knowledge of assembly is far better than mine, and probably almost everyone here.

However, the arguments you are making have little to do with Assembly vs C. For example, the switch argument. Let's say that your piece values range from 0 to 11. How about writing this instead?
Code: Select all
if (piece >= 12) goto skip;
switch (piece) {...};
skip:
The fact that 0 <= piece <= 11 can help to optimize the code is of course not known by a C compiler, nor is it known by an assembler...

How does that help? You are STILL doing the test.

If I write that in ask, I won't do that test at all, and that was my point. You know things about the code that the compiler can't determine from inspection, or even from execution in profiling.

In C++, with typed enum, you can let the compiler know the range of possible values
Code: Select all
enum Piece {wPawn, wKnight...};
And then the switch knows that the set of possible values for a "Piece" is realatively compact so the compiler may decide to use a jump table, to make branching a O(1) operation.

Regarding the int/long, there are many integer types in C. It's all about choosing the right one, and knowing that sign extension is not a zero cost operation. If you use unsigned judiciously, there is some (tiny) performance gain there. But in reality, writing compact and cache friendly code is far more important than these microscopic optimizations.

Again, there are things you know that the compiler can't know. A simple example: you want to change the side to move (WTM) variable. You could do the common "wtm = !wtm" which is horrible, or you could do the less obvious wtm ^= 1. The latter is a one instruction operation that will execute as fast as any instruction in the cpu. The former is a conditional comparison with a branch. the semantics of ! say "if wtm is zero, !wtm is 1, if wtm is non-zero, !wtm=0. The gotcha is that "is non-zero". The compiler can't know that the only two possible values are 0 and 1 and do the xor optimization, since we don't have any true boolean data type in C.

What I am trying to say is that you will not make your code faster by using assembly, except to use hardware features otherwise unavailable in C (in the case of chess, I think there are only 4 functions that deserve assembly or intrinsics: popcnt(), lsb(), msb(), prefetch()).

But you will make your program faster by writing *better C* code. And in order to write efficient C code, some knowledge of assembly, and how a compiler/optimizer works certainly helps.

Two points, one we disagree on.

(1) it is always a possibility that a better algorithm can be developed, one that dwarfs any hand-coded asm code. No argument there. No point in hand-coding a poor algorithm.

(2) hand-coding, for an experienced x86 (or any other processor) asm programmer can produce significant performance improvements. How big a gain is debatable, but I would NEVER say that a compiler can't be beat. I've looked at too much compiler output. They are good. Very good. But they are nowhere near perfect.

Whether (2) is a good idea is a different topic. Asm code is harder to read, harder to write, harder to debug, and harder to modify later. In return, you will generally get something that executes faster. The question is, how much faster? 2x is really not uncommon at all. Question is, for a chess engine, is it worth the effort required to get that 2x speedup, when compared to the effort required to make future changes to it? That is certainly a valid topic for debate.

You can see that the Linux kernel has a very small %age of assembly code. It only uses assembly where it must (to do things not possible in C or to access hardware features like task switching port I/O etc.). Everything else is in C, even though Torvaldes is a performance freak, and a hardcore assembly programmer too.

Different topic, entirely. An OS, by definition, is constantly changing. We've had pure asm kernels in the past. IBM os/360 was an example. The xerox operating system CP-V for another. Etc. Problem with them is the difficulty of making changes. The linux kernel is very fast already. 2x would not make any measurable speed improvement for most users, and the advantage of having a c code base is a significant offsetting factor in favor of C.

And even if there is some performance gain that assembler gives you over *perfectly written* C code (which I doubt), it is certainly not worth doing. Imagine if your entire search() function was in assembly. It would be pages and pages long, and completely illegible. And you'd have to write plenty of versions of it for all sorts of different plateforms (and maintain them risking to have differences). A ridiculous waste of time, not to mention the fact that you'd have to constantly add new versions to support new CPU features etc.

That is the ONLY argument that makes sense. But there are plenty of places that STILL hand-code asm code for performance. Lawrence Livermore Lab for one... Sometimes speed is the overriding consideration, and there asm is the right answer. The only answer, in fact. If development time and such are important, then asm is never the answer.

bob · Post by **bob** » Mon Mar 04, 2013 5:40 am

lucasart wrote:
lucasart wrote:But you will make your program faster by writing *better C* code. And in order to write efficient C code, some knowledge of assembly, and how a compiler/optimizer works certainly helps.
Anyway, we can talk about it in abstract terms as much as we want. But it's a waste of time (especially when people who have no knowledge of assembly start spamming the discussion).

So let's make this a challenge. I challenge anyone to write an assembly perft() that will beat my C code. PM me if you are interested!

Anyone up for the challenge ?

If you make it financially worthwhile, you will lose the challenge. But who is willing to do that for a piece of code that won't make their engine any stronger???

I've not even optimized my perft code at all. That was an idea I came up with years ago to help me debug a move generator + make/unmake (circa 1994, first versions of Crafty). I never intended perft to be used as a performance measure, and certainly don't intend to waste any time making it run faster. The move generator and make/unmake are a different topic, however, since they are a part of the chess program.

bob · Post by **bob** » Mon Mar 04, 2013 5:45 am

velmarin wrote:
hgm wrote:For the record: I have absolutely nothing against derivatives. I have something against people that lie about their engine being original, or lie about which engine their derivative is based on.

One day I might even make my own derivative. It seems fun to try if I can modify Ippolit to play Seirawan Chess. (I already have a name for it: Ippocryt! )

But of course my admiration / respect for people is related to their achievements. And I rate the achievement of producing a 3000+ Elo Ivanhoe clone a lot below producing a 2000-Elo engine from scratch. In most cases, the making of these 'top engines' rates about as high as being able to write a 'Hello world' program, as far as I am concerned.

Mr. Muller,
You always respect me and I appreciate it, I wanted to tell you two things.

Not a day that I regret having started this project, even one day.
Open your browser and read Luke, the two Alex, post on Facebook and always the same answer. Login to chat TCEC and the same, day after day.

But what I started and finish it.

Ideas I assure that I have more and more, we must translate them to the engine and the values, that's another story.
Ippolit is an amazing machine, it is difficult to get out of his personality, I do not understand when people talk of unreadable code, ect, ect.
Sure that you have taken a look at the code, is sublime.
For example Bouquet in H7 does not attack, I'm on it, to press on the Black King (without putting masks nor cramfiple nor Xray in eval.h)
I repeat, EXAMPLE.
would be as easy as this example:
Code: Select all
#define DiagA1H8  0x8040201008040201   // diagonal A1 a H8
#define DiagB1H7  0x0080402010080402    // diagonal B1 a H7
#define DiagA2G8  0x4020100804020100   // diagonal A2 a G8
#define SideKing    0xF0F0F0F0F0F0F0F0   // Files= E,F,G,H
#define WingKing    0xE0E0E0E0E0E0E0E0   // Files= F,G,H


	if (King_BlacK& G8){
               if (Queen_white & DiagB1_H7 ) 
			Value += Score(5, 15);

		if (Bhisop_white & DiagB1H7 )
			Value += Score(5, 15);
		if (Bhisop_white  & DiagA1H8 )
			Value += Score(5, 15);
		if (Bhisop_white & DiagA2G8 )
			Value += Score(4, 10);
        if (Nkigth_white  & WingKing )
			Value += Score(4, 10);
		if (POPCNT(Nkigth_Black & WingKing)==0)
			Value += Score(4, 10);

	}
Needless to say this is perfect, but you can not deny that it is simple to write, almost like the "Hello World"

For the record, the PERFECT piece of code does not exist. The above, for example, is easy to speedup, depending on exactly what you want to detect. What are your assumptions about popcnt()? Could you combine both bishop diagonal tests into one AND and popcnt()? You should read what some of the well-known compiler gurus Don mentioned have to say. They know their stuff code, and they all recognize that hand-coded will always win. Only question is, by how much...

Rebel · Post by **Rebel** » Mon Mar 04, 2013 8:32 am

Ron Murawski wrote:
Rebel wrote:... although the DigitalMars compiler is old and development has stopped its code speed-wise is on the same level as MSVC-2010.
The Digital Mars C compiler is still being actively developed. It was last updated December 15, 2012 to version 8.56.
http://www.digitalmars.com/download/freecompiler.html

1. Unfortunately since version 8.42 (the one I use) it produces slower code.

2. During the years I (and others) have asked for x64 support, at least for inline x64 ASM support. Walter has hinted to that without making hard promises but unfortunately it never came to that.

Walter is a great guy, superb support but he isn't the youngest any longer.

velmarin · Post by **velmarin** » Mon Mar 04, 2013 8:41 am

bob wrote: For the record, the PERFECT piece of code does not exist. The above, for example, is easy to speedup, depending on exactly what you want to detect. What are your assumptions about popcnt()? Could you combine both bishop diagonal tests into one AND and popcnt()? You should read what some of the well-known compiler gurus Don mentioned have to say. They know their stuff code, and they all recognize that hand-coded will always win. Only question is, by how much...

Thanks for answering, Bob.

It was just a silly example, when one is told day after day, only Hexeditas, ect.
My knowledge is 3 out of 10, I continually open the books to see, still I have enthusiasm.

Of course an attack on (example H7) must be defined much more, jaques, developed pawns, pawns locked, escape the king, ect.
Make a popcnt of whether there are horses in the area, it was an idea.
And that if {} else {} only, we do nothing.
The concept is that the structure created in the engine (Ippolit) can use these instructions and Bitboards hardly yield losses of KNS.

Thanks, Bob.

Rebel · Post by **Rebel** » Mon Mar 04, 2013 8:59 am

Gerd Isenberg wrote:
Rebel wrote:
Don wrote: Look at some of the writings of Michael Abrash and I think you will change your tune
Precisely, the man has been my teacher
Hey, you never told us before
Thanks to Don for link and hint.

Gerd, good to see you join the discussion. Is there a good page about how to write good and optimized X64 ASM code?

Joost Buijs · Post by **Joost Buijs** » Mon Mar 04, 2013 9:08 am

lucasart wrote: What I am trying to say is that you will not make your code faster by using assembly, except to use hardware features otherwise unavailable in C (in the case of chess, I think there are only 4 functions that deserve assembly or intrinsics: popcnt(), lsb(), msb(), prefetch()).

But you will make your program faster by writing *better C* code. And in order to write efficient C code, some knowledge of assembly, and how a compiler/optimizer works certainly helps.

You can see that the Linux kernel has a very small %age of assembly code. It only uses assembly where it must (to do things not possible in C or to access hardware features like task switching port I/O etc.). Everything else is in C, even though Torvaldes is a performance freak, and a hardcore assembly programmer too.

And even if there is some performance gain that assembler gives you over *perfectly written* C code (which I doubt), it is certainly not worth doing. Imagine if your entire search() function was in assembly. It would be pages and pages long, and completely illegible. And you'd have to write plenty of versions of it for all sorts of different plateforms (and maintain them risking to have differences). A ridiculous waste of time, not to mention the fact that you'd have to constantly add new versions to support new CPU features etc.

This is exactly the way I think about this subject. To write efficient C/C++ code you got to have basic knowledge about assembler and you have to know how your compiler behaves. It is not about writing wtm=!wtm or wtm^=1, it is more about choosing the right data structures and making your program as cache-friendly as possible. I won't say that it's impossible to produce something faster with handwritten assembler, but if it is, it will be only by a very small margin. I see people here talking about gains of a factor of 2 in my opinion that's completely ridiculous.

In the past I wrote several programs in assembler including 2 chess programs and an Othello program. At a certain point I started using C with inline assembler because - at that time - the compilers didn't produce very efficient code. About 8 years ago I noticed that the compilers were getting so efficient that replacing the inline assembler with C code actually improved performance. Since that time I decided to throw out all assembler code, not only because it didn't help, but maintaining this code was also hampering development by a big margin.

A couple of years ago I was so stupid to waste a lot of time to rewrite my whole evaluation function is assembler, in the hope I was able to improve performance by - lets say - 20% or so. As I already told before, this was a very big disappointment, at best I was able to get on par with the C++ version.

Rebel · Post by **Rebel** » Mon Mar 04, 2013 9:16 am

Joost Buijs wrote: It surprises me that a very old compiler like the Digital Mars produces code that is speed-wise the same as code produced by MSVC 2010.

If you compare the produced ASM code of the 2 you will notice. The author of DM is simply good. He is the man behind Zortech undoubtedly the fastest compiler in the early-mid 90's.

Rebel · Post by **Rebel** » Mon Mar 04, 2013 9:32 am

lucasart wrote: In C++, with typed enum, you can let the compiler know the range of possible values
Code: Select all
enum Piece {wPawn, wKnight...};
And then the switch knows that the set of possible values for a "Piece" is realatively compact so the compiler may decide to use a jump table, to make branching a O(1) operation.

This is true, but did you ever checked the created code if the compiler actually created a jump table?

Besides (and even in C) you can create a jump table yourself for every one byte switch-case, see: http://www.top-5000.nl/authors/rebel/chess840.htm#INTRO

You simply force the compiler to create it. Works for goto's and calling routines.

Joost Buijs · Post by **Joost Buijs** » Mon Mar 04, 2013 9:40 am

Rebel wrote:
Joost Buijs wrote: It surprises me that a very old compiler like the Digital Mars produces code that is speed-wise the same as code produced by MSVC 2010.
If you compare the produced ASM code of the 2 you will notice. The author of DM is simply good. He is the man behind Zortech undoubtedly the fastest compiler in the early-mid 90's.

Zortech was the very first C compiler I ever used, as I recall this was in the beginning of the 80's. I used it for some years, after that I went to Watcom C/C++.

When I have some time I will download the DM compiler and compare it with VC2012 and Intel C++ 13.0.

I can't imagine that the optimizer from the DM compiler has any knowledge about the latest CPU cores. But maybe I'm mistaken.

Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years

Re: Developments of the last two years