Questions about getting ready for multicore programming.

Carey · Post by **Carey** » Wed Apr 02, 2008 12:53 am

bob wrote:
Carey wrote:Big quote snipped. I know you like to quote everything, but to me it gets in the way of reading what is currently being said.

bob wrote: The way you handle memory is almost independent of whether you decide to use threads or processes. In modern unix systems, both will work the same way, where all instruction pages are shared, and data that is initialized prior to creating the new processes or threads, and then never modified (magic data, rotated bitboard data, etc) is also shared and not duplicated.
I think I understand that part. I've read yours (and others) posts in here about that.

So what you are left with is the data that you either want to share with other threads/processes, or data you want to keep private from other threads or processes.

The simplest solution is to create a large block of memory that contains your private data, but create N of them. GIve each thread/process a pointer to its private data and they will not interact with each other at all. For data that is shared and modified, you create one large block of memory and give all threads/processes a pointer to that data, and then you take proper care by using atomic locks. It will be likely that some of your private data will also be shared with processes helping you search that sub-tree, so you will need locks to limit that access as well... Other types of "private" data will work themselves out simply, but how depends on whether you use threads or processes.

Right. I'm pretty sure I understand that.

That's just basically creating a big struct to hold the private data and pass that pointer to every routine. (Or some similar way.) Just like creating a C++ class for each thread, except the C++ nature saves you some notational effort about passing the data pointer, etc.

Or with processes you do the global stuff first, fork, then use special memory calls to look into the parent's memory space.

I think I understand the basics. I haven't done it, but I think I understand the idea.

The problems I have with that is that it seems 'clunky' and it hurts performance.

It just seems like a kludge doing it that way. It just doesn't feel 'right' manually handling all the data like that.

I know, going multicore will be a win over all, but it still seems fundamentally wrong to choose a method that will have 10% or so performance penalty when you aren't multi-core.

That's part of my problem. It's not so much that I can't just pick a method and make it work. I believe I could do threads or processes or even seperate external programs and 'make it work'.

It's that none of these seem to feel right. They all feel like they are a 'work around'. Like a kludge. Like I'm forcing a solution onto a problem and the fit isn't quite right.

And when something doesn't feel right, then there is something wrong somewhere. Either with the program or the tools or your own mental process.

(I guess I should have written my original message a little clearer about that part. But my only excuse is that it was late at night and I was about to go to bed.)
You are not going to see a 10% penalty. Probably 1% at worst. I was originally afraid that passing the TREE pointer around in Crafty would be a 10% penalty, but when I finished, there was no significant speed difference at all. So you don't have to worry about speed,

Really?

I had just been experimenting to see what kind of code changes I could / should do and picked the old move generator as being tolerably representative while being easiest to convert for my tests.

I would expect the evaluator & (Un)MakeMove to have similar results.

By that time that's covering most of the cost of the program. (At least for my program it would be.)

Of course, that is with a simple mailbox program. Eventually that'll get switched to bitboard. But I would think that would have nearly as much penalty because bitboard operations are more 'atomic' than mailbox is, hence the pointer cost would be a higher percentage.

Maybe I ought to hack in some of the other parts of my old program and compare what performance differences they have when I use a pointer or a class for local data.

Maybe the mailbox move generator isn't the most representative.

I guess that's my project for the next few days...

and actually should not be at this point anyway.

I'm not too concerned about speed. Getting the last 0.001% of performance. I'm way too old for that kind of cycle counting. I stopped being able to do that with the 8 bit micros. The last time I tried was with some numerical programs and I ran into the extreme variability among processors and memory architectures.

But I don't want to pick a method that is inherently going to be slow, either.

You simply want to find the simplest way to share what is needed because this kind of programming is non-trivial and anything you can do to simplify the design will reduce the number of bugs later...

I'm all for simplistic & clean code... (grin)

That's why I actually took some code and started playing around. Just thinking about what I could / should do wasn't worth the time I had spent on it.

I didn't want to do anything more complicated than I had to.

Here the approach that I use, which works with either threads or processes by the way, seems effective enough, and bearable in terms of effort required for results obtained... You are going to have independent tasks running since you will be searching different parts of the tree in parallel anyway, and that will begin to feel "natural" after a while...

Maybe it is a change in thinking that's part of my problem.

It just seemed like most of the ideas I came up were more of a kludge than a solution.

Oh, one question. How do you handle the hash table with multiple search threads?

Do you share them or does each thread (or process) get its own table?

I would assume you share it. Threads just access it directly and processes would use some system call to map it into its own address space?

Carey · Post by **Carey** » Wed Apr 02, 2008 5:14 am

After Prof. Hyatt said he saw only about a 1% performance hit when he went to a TREE struct and explicitly passing a pointer, I decided I need to run some more tests.

Rather than continuing with my previous test program fragment, I went a simpler route.

I took a 15 year old program that was based on one I wrote for an 8 bit micro. In other words, it was fairly small. (It was also buggy about checkmates. The qsearch was just an unbounded capture search, and the eval was based upon John Stanback's public domain program's. The only move ordering was MVV/LVA and a 16 bit History Heuristic. Nothing fancy. Nothing pretty.)

I first ran a 9 and then 10 ply search from the opening position. It took 48 and 391 seconds. (As I said, this was an old, simple program. That is indeed horrible runtime growth.)

I then stuffed all the relevant search data into a C++ class and put in all the necessary functions.

I then reran the search and it took 55 and 443 seconds. An increase of 14% and 13%.

Just to see if it was some obscure C++ issue (or the compiler doing additional C++ stuff), I moved most of the data out of the class but left in a few, along with the functions.

This time the search took 48 and 394 seconds. Essentially the same as the original plain C version.

So the C++ class nature with the data is indeed responsible for at least 10% performance penalty.

Sure, the program is crud. But I don't think this is an unrealistic test. It's a real program doing a real search.

I don't know why Prof. Hyatt saw only a 1% performance penalty when he went to C struct's and explicitly passing pointers. The C++ class is basically doing the same thing, just hiding the ugly details by automatically doing that rather than forcing you to do it.

I don't think trying to do it with a struct and manually passing the pointer around would result in any improvements. I could try it if anybody seriously believes it would make a difference and can persuade me.

I don't know if a bitboard program would see similar performance penalty. I don't happen to have a small bitboard program to test it with. (My last program was a bitboard program. Big. Ugly. Cluttered. Abandoned.)

Any comments or ideas?

I'm not willing to accept a 13%+ performance penalty simply to prepare for multi-core programming.

Dann Corbit · Post by **Dann Corbit** » Wed Apr 02, 2008 5:25 am

Unless you are using virtual functions, you should not see any performance penalty.

If the code is small, I would like to see it (both for the fully "in-class" version and the one with the class stuff extracted).

Carey · Post by **Carey** » Wed Apr 02, 2008 5:55 am

Dann Corbit wrote:Unless you are using virtual functions, you should not see any performance penalty.

If the code is small, I would like to see it (both for the fully "in-class" version and the one with the class stuff extracted).

I'm not using virtual functions. Just sticking the data and the functions into a C++ class and making it all public.

I wasn't really expecting too much performance difference either, but I had just enough doubt that I did the original experiment. And this second test is showing similar results.

As for the code, sure I can send it to you.

Just PM me your email address and I'll do it tomorrow.

However, the code is absolute crud. It is not an example of good programming. It's also 15 year old code that was a direct port of even older code that was designed for an 8 bit micro running a K&R C compiler.

The code is bad and it is buggy.

But it was small and it was handy, which made it okay for this quick & dirty test. (I recently went through my archives and dug out all my early chess programs for nostalgia reasons. That's why it was handy.)

(And no, my current programs aren't like that anymore. But they are also bigger and would have taken more time to convert.)

As long as you are aware of that and don't expect anything from it, sure I'll let you have it as long as you don't pass it around. (It's too embarasing to let anybody see it....)

This isn't much of a C to C++ conversion. All I did was stick the variables and the functions into a class, go through the code and stick "ChessGame :: " in front of all the class functions.

Not exactly elegant, but it worked well enough for the test.

I'm thinking that the performance penalty is due to the 'this' pointer having to be referenced for every access to the 'Board[]' array. That's got to add some overhead. I wouldn't have thought 10% or more, but.... (shrug)

bob · Post by **bob** » Wed Apr 02, 2008 7:14 am

The hash table needs to be shared, because it carries information from all nodes searched... There are issues when storing the thing in parallel, because memory writes can be done out of order in today's microprocessors, and that can result in inconsistent information in a simple entry. You either have to be able to detect that this has happened (see the lockless hashing paper Tim Mann and I wrote for the JICCA) or else you have to make sure that an inconsistent entry won't break your program...

Carey · Post by **Carey** » Wed Apr 02, 2008 7:17 am

I just couldn't let this go.

When Hyatt says he ran such & such test and got such & such results, you can be sure he did. He may not have done the exact same test you would have or some such difference, but he is definetly a big believer in testing.

So, I began thinking that maybe the problem isn't C++ or my test program, but maybe my compiler.

I'm using the current version of MingW / GCC.

Admittedly GCC has never been known for the best code generation, but it should have produced reasonable results.

But maybe it didn't. Maybe GCC has an optimization problem with C++.

So, I installed MSVC Express 2008 and ran the C++ tests again.

I can't give you the numbers because my laptop is on battery, so they are much slower than before, but this time, they are much closer together.

My battery is almost dead and it's late and I need to go to bed, but it looks like my problem might have been GNU C++ generating poor code for data in classes.

I'll run some more tests tomorrow.

bob · Post by **bob** » Wed Apr 02, 2008 7:17 am

Carey wrote:After Prof. Hyatt said he saw only about a 1% performance hit when he went to a TREE struct and explicitly passing a pointer, I decided I need to run some more tests.

Rather than continuing with my previous test program fragment, I went a simpler route.

I took a 15 year old program that was based on one I wrote for an 8 bit micro. In other words, it was fairly small. (It was also buggy about checkmates. The qsearch was just an unbounded capture search, and the eval was based upon John Stanback's public domain program's. The only move ordering was MVV/LVA and a 16 bit History Heuristic. Nothing fancy. Nothing pretty.)

I first ran a 9 and then 10 ply search from the opening position. It took 48 and 391 seconds. (As I said, this was an old, simple program. That is indeed horrible runtime growth.)

I then stuffed all the relevant search data into a C++ class and put in all the necessary functions.

I then reran the search and it took 55 and 443 seconds. An increase of 14% and 13%.

Just to see if it was some obscure C++ issue (or the compiler doing additional C++ stuff), I moved most of the data out of the class but left in a few, along with the functions.

This time the search took 48 and 394 seconds. Essentially the same as the original plain C version.

So the C++ class nature with the data is indeed responsible for at least 10% performance penalty.

Sure, the program is crud. But I don't think this is an unrealistic test. It's a real program doing a real search.

I don't know why Prof. Hyatt saw only a 1% performance penalty when he went to C struct's and explicitly passing pointers. The C++ class is basically doing the same thing, just hiding the ugly details by automatically doing that rather than forcing you to do it.

I don't think trying to do it with a struct and manually passing the pointer around would result in any improvements. I could try it if anybody seriously believes it would make a difference and can persuade me.

I don't know if a bitboard program would see similar performance penalty. I don't happen to have a small bitboard program to test it with. (My last program was a bitboard program. Big. Ugly. Cluttered. Abandoned.)

Any comments or ideas?

I'm not willing to accept a 13%+ performance penalty simply to prepare for multi-core programming.

The penalty I saw was from going to normal global data that was all addressed explicitly, to data that was all accessed indirectly through a pointer... the board state, search state, etc... The overhead was caused by needing an extra register here and there to do the referencing, which didn't add nearly as much overhead as I expected, particularly on 64 bit processors that have 8 extra registers they can use...

bob · Post by **bob** » Wed Apr 02, 2008 7:21 am

Carey wrote:I just couldn't let this go.

When Hyatt says he ran such & such test and got such & such results, you can be sure he did. He may not have done the exact same test you would have or some such difference, but he is definetly a big believer in testing.

So, I began thinking that maybe the problem isn't C++ or my test program, but maybe my compiler.

I'm using the current version of MingW / GCC.

Admittedly GCC has never been known for the best code generation, but it should have produced reasonable results.

But maybe it didn't. Maybe GCC has an optimization problem with C++.

So, I installed MSVC Express 2008 and ran the C++ tests again.

I can't give you the numbers because my laptop is on battery, so they are much slower than before, but this time, they are much closer together.

My battery is almost dead and it's late and I need to go to bed, but it looks like my problem might have been GNU C++ generating poor code for data in classes.

I'll run some more tests tomorrow.

I tested this caretully.. Originally all my board state was in global arrays and stuff that were simply referenced directly. After the "big change" I had to pass a pointer to all functions, and then the state was referenced relative to that pointer address. Each function suddenly had one more formal argument passed in, and each memory reference had a register (pointer) requirement that direct addressing did not.

The overhead was actually minimal. I originally guessed maybe 10%, but that turned out to be way high, and I didn't notice much difference at all, which was a surprise. At the time, this was also GCC based testing under linux, although some "helpers" tested it on windows with the same result...

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Apr 02, 2008 9:08 am

Carey wrote: I'm thinking that the performance penalty is due to the 'this' pointer having to be referenced for every access to the 'Board[]' array. That's got to add some overhead. I wouldn't have thought 10% or more, but.... (shrug)

I have the impression that accessing global variables in 64-bit mode becomes more expensive. There is no compact mode with 32-bit addresses. There is a rip-relative addressing mode, but assembly generated by vc2005 indicates a pointer is needed to access globals all the time. Globals like static class members as well as statics inside the local scope of a function.

Code: Select all

 lea r10, base of some data_segment
 mov rax, [r10 + offset global var]

Thus passing a board-, search- or equivalently a this-pointer around - even in a recursive search, might be faster than accessing globals. It might even make sense to keep all the constant data inside a one time initialized, embedded none static "const" member.

A nice aspect with C++ is that you may wrap and hide sse2-instrinsics (or other SIMD-architectures like AltiVec) to use 128-bit registers as vector of two bitboards for fill-stuff with usual C++ operator syntax. The vc2005 generated assembly is quite optimal with that stuff.

Whether you use move.isCapture() or isCapture(move) with TMove as ordinal scalar int is a matter of taste or pragmatism. Whether you internally use a bitfield struct/class or explicite masking or filling by shift/and/or shouldn't care as long as you "hide" the implementation and the object fits inside a register and can be passed by value with default fastcall convention. Endian issues matters for persistent objects where one may use endian and size independent write/read methods.

Gerd

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Apr 02, 2008 3:42 pm

Gerd, how efficient is x86 implementation of thread scope variables that supported by all major compilers and seems to be added in next C++ standard?

Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Consistant performance penalty for C++ classes

Re: Consistant performance penalty for C++ classes

Re: Consistant performance penalty for C++ classes

Re: Questions about getting ready for multicore programming.

Re: Consistant performance penalty for C++ classes

Re: Consistant performance penalty for C++ classes

Re: Consistant performance penalty for C++ classes

Re: Consistant performance penalty for C++ classes

Re: Consistant performance penalty for C++ classes