Questions about getting ready for multicore programming.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Apr 08, 2008 4:49 pm

dzhao wrote:
Gerd Isenberg wrote: I have the impression that accessing global variables in 64-bit mode becomes more expensive. There is no compact mode with 32-bit addresses. There is a rip-relative addressing mode, but assembly generated by vc2005 indicates a pointer is needed to access globals all the time. Globals like static class members as well as statics inside the local scope of a function.
Code: Select all
 lea r10, base of some data_segment
 mov rax, [r10 + offset global var]
Thus passing a board-, search- or equivalently a this-pointer around - even in a recursive search, might be faster than accessing globals. It might even make sense to keep all the constant data inside a one time initialized, embedded none static "const" member.
Gerd
I noticed this when I played with x64 first time. I think the reason the compiler uses a register instead of an explicit constant to access a global is to reduce the code size. A 64 bit constant is 8 bytes, which is a large operand and no good for fast instruction decoding.

I don't think you need to put constants on heap if you do multithreading and use pointer to address a search tree or board.
In such a case r10 is already known (or initialized), that is the board (or tree) pointer for a thread. The first load only executes once.

As far as I understand there are only two instructions which take an immediate 64-bit address, called moffset64 — direct memory offset that specifies a quadword (64-bit) operand in memory:

Code: Select all

MOV RAX, moffset64   opcode A1
MOV moffset64, RAX   opcode A3

See AMD64 Architecture Programmer's Manual. Volume 3: General-Purpuse and System Instructions, 1 Instruction Formats.

Otherwise compiler rely on a ModRM/SIB mode - or on rip-relative addressing mode. ModRM/SIB mode requires base-register which is adjusted by the os-loader. Of course the lea is not always needed if the compiler keeps the pointer to global data over function boundaries - anyway the register is "wasted" to access globals, similar to a this-pointer to access "private" data. I managed to relax register pressure to speedup my code by eliminating all global data references in critical code to keep once initialized constants as embedded objects accessible via this-ptr which I use anyway.

Here is one sample with static data, where the vc-compiler for some reason does not emit a lea-instruction, but uses four times memory operands with four byte displacements each.

Code: Select all

U32 popCount(U64 bb) {
   static const U64 CACHE_ALIGN masks[8] = {
      C64(0x0101010101010101), C64(0x0202020202020202),
      C64(0x0404040404040404), C64(0x0808080808080808),
      C64(0x1010101010101010), C64(0x2020202020202020),
      C64(0x4040404040404040), C64(0x8080808080808080),
   };
   __m128i x0, x1, x2, x3, zr; U32 cnt;
   __m128i * pM = (__m128i*) masks;
   x0  =  _mm_cvtsi64x_si128 ( bb );
   x0  =  _mm_unpacklo_epi64 ( x0, x0 );
   zr  =  _mm_setzero_si128();
   x3  =  _mm_andnot_si128 ( x0, pM[3] );
   x2  =  _mm_andnot_si128 ( x0, pM[2] );
   x1  =  _mm_andnot_si128 ( x0, pM[1] );
   x0  =  _mm_andnot_si128 ( x0, pM[0] );
   x3  =  _mm_cmpeq_epi8   ( x3, zr );
   x2  =  _mm_cmpeq_epi8   ( x2, zr );
   x1  =  _mm_cmpeq_epi8   ( x1, zr );
   x0  =  _mm_cmpeq_epi8   ( x0, zr );
   x2  =  _mm_add_epi8     ( x2, x3 );
   x0  =  _mm_add_epi8     ( x0, x1 );
   x0  =  _mm_add_epi8     ( x0, x2 );
   x0  =  _mm_sad_epu8     ( x0, zr );
   cnt = -_mm_cvtsi128_si32( x0 )
         -_mm_extract_epi16( x0, 4 );
   return cnt & 255;
}

bb$ = 8
?popCount@@YAI_K@Z PROC
  00000   66 0f ef db    pxor    xmm3, xmm3
  00004   66 48 0f 6e d1 movd    xmm2, rcx
  00009   66 0f 6c d2    punpcklqdq xmm2, xmm2
  0000d   66 0f 6f e2    movdqa  xmm4, xmm2
  00011   66 0f 6f c2    movdqa  xmm0, xmm2
  00015   66 0f 6f ca    movdqa  xmm1, xmm2
  00019   66 0f df 15 30 00 00 00    pandn    xmm2, XMMWORD PTR ?masks+48
  00021   66 0f df 25 00 00 00 00    pandn    xmm4, XMMWORD PTR ?masks
  00029   66 0f df 0d 20 00 00 00    pandn    xmm1, XMMWORD PTR ?masks+32
  00031   66 0f df 05 10 00 00 00    pandn    xmm0, XMMWORD PTR ?masks+16
  00039   66 0f 74 e3    pcmpeqb xmm4, xmm3
  0003d   66 0f 74 c3    pcmpeqb xmm0, xmm3
  00041   66 0f 74 cb    pcmpeqb xmm1, xmm3
  00045   66 0f fc e0    paddb   xmm4, xmm0
  00049   66 0f 74 d3    pcmpeqb xmm2, xmm3
  0004d   66 0f fc ca    paddb   xmm1, xmm2
  00051   66 0f fc e1    paddb   xmm4, xmm1
  00055   66 0f f6 e3    psadbw  xmm4, xmm3
  00059   66 0f 7e e0    movd    eax, xmm4
  0005d   66 0f c5 cc 04 pextrw  ecx, xmm4, 4
  00062   03 c8          add     ecx, eax
  00064   f7 d9          neg     ecx
  00066   0f b6 c1       movzx   eax, cl
  00069   c3             ret     0
?popCount@@YAI_K@Z ENDP

Tord Romstad · Post by **Tord Romstad** » Thu Apr 10, 2008 5:32 pm

Hello Wyle,

Great post. It sums up many of my own gripes with C++.

wgarvin wrote: I suspect that a new language could offer 95% of the useful power of C++ while (initially) having about 10% of the complexity. The three main things I would want out of this language are:

(1) Be designed for efficient incremental compilation and linking, and efficient manipulation by tools/IDEs (in other words: no more preprocessor!).

(2) Strong compile-time metaprogramming support built into the language. Let me write *imperative* metaprograms which look just like my imperative C++ code! They should be able to inspect declarations, types, sizes and alignments as known to the compiler, and generate new statements or declarations or types on-the-fly. In short, all of the power of LISP macros combined with all the nice syntactic sugar of curly-brace languages. Also, there should be a phase which evaluates all the metaprograms, templates, compile-time conditionals etc. and outputs the results *as human-readable source code*, with the option to switch between this generated source and the original source while debugging. The language features would all have to be designed so that this generated source would stay comprehensible by humans (in contrast to, for example, the generated type names of C++ templates, which can easily run to thousands of characters and be nigh-unparseable by a human).

(3) Must be suitable for embedded real-time programming, must provide full control over memory allocation (i.e. it can't require a garbage collector), and must be able to link directly to compiled C code or libraries on the target machine.

The only language I know of that's close to what I want is the D by Walter Bright.

To me, BitC looks far more promising, from a technical point of view. Unfortunately, I think it has even smaller chances than D of ever becoming a mainstream language with good tools and compilers.

[Edit: Chess engines are complex and have complex behaviour,

Actually, they are not. Most people tend to believe that strong chess engines must be very complex programs, probably because playing chess well is so difficult for humans. The truth is that the vast majority modern chess engines (including the best ones) are very simple programs. The state of the art algorithms and data structures are extremely straightforward to implement.

but the program itself is not very large, so C++ does an adequate job.

Yes.

But for large projects, C++ can be painful to work with.

No kidding.

For example, the debug info for our *release* executables is still hundreds of megabytes. (I don't even try to build debug ones anymore, they take forever to link.)]

Debugging C/C++ code is immensely painful even for tiny programs like a chess engine, IMHO. I really can't stand the infinite edit-compile-run-debug loop typical of C programming. Since several years, I don't debug my program at all. My philosophy is that f something doesn't work, and it is not immediately obvious why it doesn't work, my code is too complex to be manageable, and I rewrite it in a simpler and clearer way.

My preferred way of working with C/C++ is to do almost all the work with pencil and paper. I break all problems down to sub-problems so simple that it just isn't possible to make any mistakes implementing them, and only start typing the code when I have worked out all little details and am 100% sure everything works (needless to say, it often happens that it doesn't work, even though I am 100% sure it will). It's a slow and tedious way to work, but to me it's the only bearable way to work in C/C++. For something simple like a chess engine, this approach works. For something bigger and more complex, it would be hopeless.

Tord

Tord Romstad · Post by **Tord Romstad** » Thu Apr 10, 2008 5:41 pm

Tord Romstad wrote:Perhaps not, but there is no reason to use OOP just because you use C++. At least to me, OOP is not among the most compelling advantages of C++ compared to C. Stronger typing, exception handling and real strings are far more important.

I forgot namespaces.

I don't think C++'s slightly improved type checking is any significant benefit. Just not strong enough to really stop a programmer from making mistakes.

To me, it helps a lot, especially the fact that enums in C++ are real types, and not just synonyms for ints. I can define separate types for squares, colors, pieces, moves, and so on, which are all represented by machine integers, but which are checked for type correctness at compile-time. I can't pass a move to a function which expects a square, and I can't assign a piece to a variable of type Color.

The type system in C++ leaves a lot to be desired, but it's a tremendous improvement over C typing.

I don't think exception handling is all that useful for chess, either.

I agree. I was talking about programming in general, not about chess programming in particular. Chess programming is so simple that it doesn't matter all that much what language you use, as long as you have a good optimizing compiler.

Tord

Tord Romstad · Post by **Tord Romstad** » Thu Apr 10, 2008 6:08 pm

Zach Wegner wrote:You might as well compare C to Java and conclude that you can't do anything in C because there isn't a standard library function for a hashtable. The real answer is, you actually need to learn what a hashtable is to program one in C.

If only hash tables were the most important thing missing. We don't even have integers or strings.

What the paper seems to be arguing is that _new_ programmers can come in and write code that is similar in efficiency to optimized C without learning all of the low-level details.

No, it is not only about that. Because of the poor support for abstractions of all kinds in C, it is often difficult or impossible to implement algorithms in the most general possible way without suffering a significant performance penalty. Consider the example of sorting again: You made the point that if highest possible performance was the goal, you wouldn't use the standard qsort() function, but write your own function optimized for the particular data you want to sort. Imagine that you have a big program where you have to sort lots of different types of data many places in your code. You would then have to re-implement exactly the same algorithm again and again, just tailor-made for each type of data you need to sort. This is very tedious and error-prone, especially if you discover a better algorithm some time later, and have to update your code many places instead of just one place.

Of course, in theory it is always possible to manually write out tailor-made and optimized versions of your algorithms for all situations you need in your program, but in practice it can become unrealistic in big programs. Moreover, why should you spend your time implementing the same algorithm again and again, with only minor changes? It is a task which requires great accuracy, but very little intelligence. In other words, it is the kind of task which is better left to the machine. This is what metaprogramming is about.

If anything, good metaprogramming support is even more important in low-level languages. You really want to be able to implement powerful abstractions without a runtime performance penalty. C and C++ gives you the unpleasant choice between no metaprogramming at all (C), or metaprogramming made so bizarrely complicated and ugly that only wizards can use it effectively (C++).

Tord

Tord Romstad · Post by **Tord Romstad** » Thu Apr 10, 2008 6:16 pm

Guetti wrote:I also would like to point out that glaurung2 is a beautiful piece of software written in C++, despite Tords remark that he detests C++. It is a great help for me to look from time to time at glaurungs code, which gives me the ideas how to implement some things in a better way in C++. And apart of that, it is much easier to read than any C program like Crafty or C+(+) programs like Fruit.
Actually glaurung is the C++ chess engine I always wanted to write myself. Unfortunately Tord is much better and was faster than me.

Hello Andreas,

Thanks for the compliments! I'm glad you like my code. But why do you think it shows how to do things "in a better way in C++"? As far as I can see, I don't make use of a lot of C++ features except stronger typing. There are a few classes, but because I don't subclass anything (except in the endgame code, where I do some subclassing in order to implement a form of poor man's closures), I could just as well have used C structs. The difference between

Code: Select all

pos.piece_on(square);

and

Code: Select all

piece_on(pos, square);

is purely syntactic.

Apart from a few string operations in my IO code and the use of stronger typing to catch bugs, I can't remember anything in Glaurung which couldn't have been implemented just as easily in plain C.

Tord

Guetti · Post by **Guetti** » Thu Apr 10, 2008 6:31 pm

Tord Romstad wrote: Hello Andreas,

Thanks for the compliments! I'm glad you like my code. But why do you think it shows how to do things "in a better way in C++"? As far as I can see, I don't make use of a lot of C++ features except stronger typing.

But this is a main feature of C++ and one that is beneficial for a chess engine, while others like subclassing are probably not so useful in a chess engine. But I'm a C++ beginner, thus I can learn also from very simple constructs.

There are a few classes, but because I don't subclass anything (except in the endgame code, where I do some subclassing in order to implement a form of poor man's closures), I could just as well have used C structs. The difference between
Code: Select all
pos.piece_on(square);
and
Code: Select all
piece_on(pos, square);
is purely syntactic.

Apart from a few string operations in my IO code and the use of stronger typing to catch bugs, I can't remember anything in Glaurung which couldn't have been implemented just as easily in plain C.

Tord

I only say in contrary to Fruit, Glaurung looks like a C++ programm. You use the C++ I/O functions, classes, namespaces, enums etc.
Mine is much less C++. I planned to write mainly C but use some useful features of C++, but I'm often to conservative and concerned to much about speed. E.g. I currently don't type everything in enums, like you do in glaurung, but use typedefs. I was worried enums would be slower.

Zach Wegner · Post by **Zach Wegner** » Thu Apr 10, 2008 8:57 pm

Tord Romstad wrote:If only hash tables were the most important thing missing. We don't even have integers or strings.

Hehe, I think the strings are OK. The standard library could use some things like replace(), but it's possible to write them (though it must return a new string).
And with integers, I'd rather have access to machine integers (which I mainly use) and then be able to just add libraries for real integers. It's nice to have native support of them, but I don't like to be forced to use them, like in Lisp and Haskell (right?)

What the paper seems to be arguing is that _new_ programmers can come in and write code that is similar in efficiency to optimized C without learning all of the low-level details.
No, it is not only about that. Because of the poor support for abstractions of all kinds in C, it is often difficult or impossible to implement algorithms in the most general possible way without suffering a significant performance penalty. Consider the example of sorting again: You made the point that if highest possible performance was the goal, you wouldn't use the standard qsort() function, but write your own function optimized for the particular data you want to sort. Imagine that you have a big program where you have to sort lots of different types of data many places in your code. You would then have to re-implement exactly the same algorithm again and again, just tailor-made for each type of data you need to sort. This is very tedious and error-prone, especially if you discover a better algorithm some time later, and have to update your code many places instead of just one place.

I don't disagree. I did say that C wasn't designed for that kind of code, and it's a fair argument against it. In the case you describe I'd likely use qsort, even though it's hideous.

Of course, in theory it is always possible to manually write out tailor-made and optimized versions of your algorithms for all situations you need in your program, but in practice it can become unrealistic in big programs. Moreover, why should you spend your time implementing the same algorithm again and again, with only minor changes? It is a task which requires great accuracy, but very little intelligence. In other words, it is the kind of task which is better left to the machine. This is what metaprogramming is about.

If anything, good metaprogramming support is even more important in low-level languages. You really want to be able to implement powerful abstractions without a runtime performance penalty. C and C++ gives you the unpleasant choice between no metaprogramming at all (C), or metaprogramming made so bizarrely complicated and ugly that only wizards can use it effectively (C++).

Of course, the problem is that its impossible to do metaprogramming without either having separate code for each type or carrying the metadata for size, length, etc. around. You can do either implicitly or explicitly (basically any high level language vs. C). I guess that in the kind of programming I do, I just don't really need metaprogramming.

bob · Post by **bob** » Thu Apr 10, 2008 10:56 pm

Zach Wegner wrote:
Tord Romstad wrote:If only hash tables were the most important thing missing. We don't even have integers or strings.
Hehe, I think the strings are OK. The standard library could use some things like replace(), but it's possible to write them (though it must return a new string).
And with integers, I'd rather have access to machine integers (which I mainly use) and then be able to just add libraries for real integers. It's nice to have native support of them, but I don't like to be forced to use them, like in Lisp and Haskell (right?)

What the paper seems to be arguing is that _new_ programmers can come in and write code that is similar in efficiency to optimized C without learning all of the low-level details.
No, it is not only about that. Because of the poor support for abstractions of all kinds in C, it is often difficult or impossible to implement algorithms in the most general possible way without suffering a significant performance penalty. Consider the example of sorting again: You made the point that if highest possible performance was the goal, you wouldn't use the standard qsort() function, but write your own function optimized for the particular data you want to sort. Imagine that you have a big program where you have to sort lots of different types of data many places in your code. You would then have to re-implement exactly the same algorithm again and again, just tailor-made for each type of data you need to sort. This is very tedious and error-prone, especially if you discover a better algorithm some time later, and have to update your code many places instead of just one place.
I don't disagree. I did say that C wasn't designed for that kind of code, and it's a fair argument against it. In the case you describe I'd likely use qsort, even though it's hideous.

Of course, in theory it is always possible to manually write out tailor-made and optimized versions of your algorithms for all situations you need in your program, but in practice it can become unrealistic in big programs. Moreover, why should you spend your time implementing the same algorithm again and again, with only minor changes? It is a task which requires great accuracy, but very little intelligence. In other words, it is the kind of task which is better left to the machine. This is what metaprogramming is about.

If anything, good metaprogramming support is even more important in low-level languages. You really want to be able to implement powerful abstractions without a runtime performance penalty. C and C++ gives you the unpleasant choice between no metaprogramming at all (C), or metaprogramming made so bizarrely complicated and ugly that only wizards can use it effectively (C++).
Of course, the problem is that its impossible to do metaprogramming without either having separate code for each type or carrying the metadata for size, length, etc. around. You can do either implicitly or explicitly (basically any high level language vs. C). I guess that in the kind of programming I do, I just don't really need metaprogramming.

OK, for the record, what is the beef with qsort()???

1. It is the fastest known sorting algorithm, an O(N log N) algorithm.

2. It has the rather clunky comparison facility where you write a procedure to compare two values and return <, = or >. But a _generalized_ sort is going to have to do the same thing, and I'll bet I can _always_ write a better comparison algorithm than a general-purpose one supplied for a more general sort algorithm.

Software development cost is a different issue, but for a chess engine, I consider that irrelevant, since it is a "one-off" type algorithm where reusability is of little importance when compared to raw speed.

C is still completely viable for high-performance algorithms...

Aleks Peshkov · Post by **Aleks Peshkov** » Fri Apr 11, 2008 12:51 am

Tord Romstad wrote:
Chess engines are complex and have complex behaviour,
Actually, they are not. Most people tend to believe that strong chess engines must be very complex programs, probably because playing chess well is so difficult for humans. The truth is that the vast majority modern chess engines (including the best ones) are very simple programs. The state of the art algorithms and data structures are extremely straightforward to implement.

I think many people get bored with the "state of the art" chess. I think I will not mistake if you expressed a similar personal opinion in forums.

It is very difficult to succeed in innovation at established field of algorithms and data structures, but I believe current chess programming fashion is not the final solution. Some innovative and important things like Magic Attacks Bitboards and material evaluation tables was relatively late additions to chess community.

Carey · Post by **Carey** » Fri Apr 11, 2008 1:04 am

bob wrote: OK, for the record, what is the beef with qsort()???

1. It is the fastest known sorting algorithm, an O(N log N) algorithm.

(This isn't why they are harping on qsort, but I thought it worth mentioning...)

Actually Prof. Hyatt, you've made a slight error there...

The C standard does *NOT* require the qsort routine to be implemented with any particular method.

(edited. Used 'algorithm' when I meant 'routine')

Because of the name, you expect it to be the quick-sort algorithm, but it doesn't have to be. It could be bubble sort.

I know, it sounds stupid. But the C standard simply doesn't require it to be any particular method but still calls itself 'qsort'. You'd think they'd either require it to be quick-sort or have changed it to just 'sort' but they didn't.

True, most compilers will do quick-sort. But they don't have to.

Also, the quick-sort method, while being usually efficient, can have some degenerate cases where the performance really degrades. The exact sequence depends on the implementation and the choice of the pivot point.

That's why the currently prefered methods are to use a hybrid sort method. It starts out as quick-sort but if monitoring code thinks it's taking too long, it'll switch to a different algorithm, such as a heap-sort.

Ironically enough, I think it was one of the original designers of C++'s STL that came up with this aproach. He called it the 'Intro sort'.

Writing a good sort routine can be surprisingly difficult. I remember reading an article by P. J. Plauger (he was on both the C & C++ standards commitees) about all the errors he made over the years while writing supposedly simple sort routines.

I used to thoroughly enjoy his articles back when he wrote in Embedded Systems Programming magazine.

To get this back onto chess....

I used to use the C library qsort routine back in my early chess programs. This was with a K&R C compiler.

It had so much overhead that it was actually faster for me to just do a little selection sort. The n^2 growth was still faster than what the C library offered.

Questions about getting ready for multicore programming.

Re: Consistant performance penalty for C++ classes

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.