Questions about getting ready for multicore programming.

bob · Post by **bob** » Thu Apr 03, 2008 4:33 pm

Carey wrote:
bob wrote:
Carey wrote:Okay, I just got through running some tests.

I am using the latest MingW, which is v3.x. As Bo Persson points out, that may be a good part of my problem. (But since MingW hasn't officially released a version based on a current version of GNU C 4, and probably wont until the next decade, there isn't a lot I am willing to do. I don't like running alpha's or private builds.)

I tested my program with GCC v3, OpenWatcom and MSVC 2008 Express.

I did the 'data in class' vs. 'data outside of class'. I didn't bother testing the plain C version because that should be comparable to 'data outside of class'.

Needless to say, OpenWatcom was the slowest. Dead dog slow. Almost twice as long as MSVC. Probably the only other 'professional' compiler that would be slower would probably be Borland's current free Turbo C++.

Anyway, OpenWatcom had a performance penalty of about about 8%.

I tried a couple versions of MingW with a few different switches. (I didn't try all the switches it offers. Just what the CodeBlocks IDE offers.) The performance penalty was 9%.

I tried MSVC 2008 Express. The performance penalty was 3%.

(I don't know if the free student version of MSVC pro tools would do any better. I'm not a college student and unfortunately, I don't know any to ask if they'd get me a free copy from Microsoft.)

So it looks like there is indeed a non-trivial performance penalty when you put the data into a C++ class.

Much of that can be optimized away by using a state-of-the-art compiler with significant optimization abilities.

Anything with just 'average' or 'good' optimization abilities will be at a serious disadvantage with C++ code.

So I guess most of this thread has been taken care of.... If I want to do multi-core programming, I'm going to have to stay with MSVC. In which case it doesn't matter too much whether it's threads or processes or something else.
Just to clarify, there isn't any "something else".

Actually there, although it's only a little removed.

Normal 'process' programm just forks and shares some read-only data and them communicates through shared memory, while sharing the trans table.

The 'something else' is to go all the way and use entirely seperate programs communicationg though some channel (pipes, or LAN, or whatever you want). No shared memory, etc.

Each engine can be running on a different core or even a different processor.

It solves all the shared memory bugs & issues, while increasing the communication complexity.

Not much removed, but there are enough differences in what can be done that it's worth calling it 'something else'.

Threads are on the left, common fork()ing processes in the middle, and entirely seperate programs on the right.

There's no difference between doing a fork() and simply starting N copies of the program, from a practical perspective. You can _still_ have shared memory. That's what the system V shared memory library is all about, you can create shared memory objects that any process can map into its virtual address space. Ditto for the other approach using mmap(). Otherwise, on modern Unix systems, if you start two instances of the program (assuming it is the exact same program, as in having the same i-node for both instances) they will even share memory pages for instructions and such, just like the fork() processes will.

That was what I meant by "there is no other approach". Threads share _everything_, the other approach is to just share what you explicitly want to share...

Bo Persson · Post by **Bo Persson** » Thu Apr 03, 2008 5:22 pm

Volker Annuss wrote:
Carey wrote: I don't have a 64 bit compiler, so even if I installed Vista64 I wouldn't be able to check that aspect. The people here may be righ when they say that in 64 bit mode, it's not at all a problem.

But, V C 2008 Express doesn't support 64 bit code, and I don't think spending $700 for the professional one is worth it. And I'm not a student so I can't get it for free. So I wont be testing it.
You can get a 64-Bit-Compiler for free by downloading the Windows-SDK. It works from the command line, but I did not get it work inside VC2005 and VC2008 Express Edition.

Greetings
Volker

If it is the compiler from the Windows Server 2003 SDK, that one is older than even the VS2005 Beta. Not recommended.

Bo Persson · Post by **Bo Persson** » Thu Apr 03, 2008 5:31 pm

Carey wrote:Okay, I just got through running some tests.

I am using the latest MingW, which is v3.x. As Bo Persson points out, that may be a good part of my problem. (But since MingW hasn't officially released a version based on a current version of GNU C 4, and probably wont until the next decade, there isn't a lot I am willing to do. I don't like running alpha's or private builds.)

I tested my program with GCC v3, OpenWatcom and MSVC 2008 Express.

I did the 'data in class' vs. 'data outside of class'. I didn't bother testing the plain C version because that should be comparable to 'data outside of class'.

Needless to say, OpenWatcom was the slowest. Dead dog slow. Almost twice as long as MSVC. Probably the only other 'professional' compiler that would be slower would probably be Borland's current free Turbo C++.

Anyway, OpenWatcom had a performance penalty of about about 8%.

I tried a couple versions of MingW with a few different switches. (I didn't try all the switches it offers. Just what the CodeBlocks IDE offers.) The performance penalty was 9%.

I tried MSVC 2008 Express. The performance penalty was 3%.

I believe that just dumping everything in a class is not really fair. If you actually transform the code from C to "proper" C++, I belive you can regain more than these 3%.

Carey wrote: (I don't know if the free student version of MSVC pro tools would do any better. I'm not a college student and unfortunately, I don't know any to ask if they'd get me a free copy from Microsoft.)

There is no difference in the compiler, the Pro editon has a more "advanced" IDE and additional libraries.

Carey wrote: So it looks like there is indeed a non-trivial performance penalty when you put the data into a C++ class.

Much of that can be optimized away by using a state-of-the-art compiler with significant optimization abilities.

Anything with just 'average' or 'good' optimization abilities will be at a serious disadvantage with C++ code.

Not having a good compiler will be a disadvantage anyway.

Carey · Post by **Carey** » Thu Apr 03, 2008 9:15 pm

Gerd Isenberg wrote:
Carey wrote:I just can't seem to leave this alone...

I downloaded and installed into a seperate directory the 'TDM' port of gcc v4.2 for MingW.

I then did the 'data in class' versus 'data not in class' test.

My laptop was on battery, so the numbers aren't comparable to my other tests, but it was faster than with GCC 3.4 that MingW normally offers.

However, the performance penalty was still bad. In this case, nearly 15% performance reduction for data in the class versus global data.

Maybe I missed the magical option that would improve this. My IDE (CodeBlocks) doesn't give a lot of choices for optimization or code tweaking.

Or maybe I screwed up the install of GCC 420 and it's somehow doing my old gcc. (I don't think so, because this is faster than what I was getting before.)

It still looks like my previous conclusion is right. If you intend to put data in a class or in a struct for multi-threading, you had better have a darn good compiler, else you will be getting a significant performance penalty with 32 bit code.

With this performance penalty for GNU C (which I prefer over msvc), I'm definetly going to have to come up with an approach where I don't have to access the data via pointers.

I don't have a 64 bit compiler, so even if I installed Vista64 I wouldn't be able to check that aspect. The people here may be righ when they say that in 64 bit mode, it's not at all a problem.

But, V C 2008 Express doesn't support 64 bit code, and I don't think spending $700 for the professional one is worth it. And I'm not a student so I can't get it for free. So I wont be testing it.

Well.... I hope this has been entertaining, if not informative for everybody here. It certainly was informative for me. Not massively helpful (since I don't like MSVC and that one is the only one without a major performance penalty), but definetly informative.
Carey
One additional register for this-pointer everywhere takes space and time in 32-bit mode with only a few registers available.

Naturally.

The point of the tests was to see how much penalty was there. I discovered that the penalty can be significant for some compilers and minimal with others.

It's not going to be a 'fixed cost' for the programmer.

The relative more, the smaller the program initially is. Despite compiler and optimization issues, you'll always have chaotical "none-linearities", if

Right. I can give you some horror stories from my numerical programming days.

Three very similar Pentium systems giving radically different results depending on which compiled program they ran. The only change between versions was that I simply recompiled it again. The linker turned out to be placing stuff nearly at random, totally destroying alignment.

Even after dealing with that, the three systems still behaved so differently that I ended up giving up.

And then, of course, you get into OS issues.... Kind of makes you wish for the DOS days where you could take control of the hardware and do what you wanted.

you add code or increase data inside your program. If you already exceeded some threshold before, you may add code and data to some extend without further (or even negative) slowdown. If you are below that threshold and cross some borders while adding code/data the slowdown may be notable, since you suddenly need more pages and cachelines for code and/or data/bss/stack.

Well, for this program that's not an issue. The thing is so small it entirely fits into the L1 cache.

It was originally written on an 8 bit micro.

Does your global version keep the variables in the same order than with classes, e.g. by using a global struct? Changing the order inside those structs may have enormous effects as well.

Yes, except for some minor ones at the bottom.

I thought of that too, and I played with the alignment switches too.

I recommend to keep the search threadsafe. The better your 64-bit speedup

Cheers,
Gerd

I still haven't entirely decided which route I'm going.

Part of me does like having all the search data encapsulated, nice & neat.

But another part of me is leaning towards making the search an isolated program.

I've been so busy with the tests and other stuff that I haven't gotten around to really making any decision.

Carey · Post by **Carey** » Thu Apr 03, 2008 9:18 pm

Volker Annuss wrote:
Carey wrote: I don't have a 64 bit compiler, so even if I installed Vista64 I wouldn't be able to check that aspect. The people here may be righ when they say that in 64 bit mode, it's not at all a problem.

But, V C 2008 Express doesn't support 64 bit code, and I don't think spending $700 for the professional one is worth it. And I'm not a student so I can't get it for free. So I wont be testing it.
You can get a 64-Bit-Compiler for free by downloading the Windows-SDK. It works from the command line, but I did not get it work inside VC2005 and VC2008 Express Edition.

Greetings
Volker

I may have to check into that.

It may not work from within VC2008 Express. Since the native compiler is built in, they may not be providing any hooks to let you use it. Gives people another reason to pay $$$ for the Pro version.

I'm not really looking forward to doing a plain command line, though.

At least with GCC I have an IDE that can deal with it and the debugger. I'm not sure if CodeBlocks would do that with Vista64 & the 64 bit Microsoft compiler.

But at least it's an option.

Carey · Post by **Carey** » Thu Apr 03, 2008 9:25 pm

bob wrote:
Carey wrote:
bob wrote:
Carey wrote:Okay, I just got through running some tests.

I am using the latest MingW, which is v3.x. As Bo Persson points out, that may be a good part of my problem. (But since MingW hasn't officially released a version based on a current version of GNU C 4, and probably wont until the next decade, there isn't a lot I am willing to do. I don't like running alpha's or private builds.)

I tested my program with GCC v3, OpenWatcom and MSVC 2008 Express.

I did the 'data in class' vs. 'data outside of class'. I didn't bother testing the plain C version because that should be comparable to 'data outside of class'.

Needless to say, OpenWatcom was the slowest. Dead dog slow. Almost twice as long as MSVC. Probably the only other 'professional' compiler that would be slower would probably be Borland's current free Turbo C++.

Anyway, OpenWatcom had a performance penalty of about about 8%.

I tried a couple versions of MingW with a few different switches. (I didn't try all the switches it offers. Just what the CodeBlocks IDE offers.) The performance penalty was 9%.

I tried MSVC 2008 Express. The performance penalty was 3%.

(I don't know if the free student version of MSVC pro tools would do any better. I'm not a college student and unfortunately, I don't know any to ask if they'd get me a free copy from Microsoft.)

So it looks like there is indeed a non-trivial performance penalty when you put the data into a C++ class.

Much of that can be optimized away by using a state-of-the-art compiler with significant optimization abilities.

Anything with just 'average' or 'good' optimization abilities will be at a serious disadvantage with C++ code.

So I guess most of this thread has been taken care of.... If I want to do multi-core programming, I'm going to have to stay with MSVC. In which case it doesn't matter too much whether it's threads or processes or something else.
Just to clarify, there isn't any "something else".

Actually there, although it's only a little removed.

Normal 'process' programm just forks and shares some read-only data and them communicates through shared memory, while sharing the trans table.

The 'something else' is to go all the way and use entirely seperate programs communicationg though some channel (pipes, or LAN, or whatever you want). No shared memory, etc.

Each engine can be running on a different core or even a different processor.

It solves all the shared memory bugs & issues, while increasing the communication complexity.

Not much removed, but there are enough differences in what can be done that it's worth calling it 'something else'.

Threads are on the left, common fork()ing processes in the middle, and entirely seperate programs on the right.
There's no difference between doing a fork() and simply starting N copies of the program, from a practical perspective. You can _still_ have shared memory. That's what the system V shared memory library is all about, you can create shared memory objects that any process can map into its virtual address space. Ditto for the other approach using mmap(). Otherwise, on modern Unix systems, if you start two instances of the program (assuming it is the exact same program, as in having the same i-node for both instances) they will even share memory pages for instructions and such, just like the fork() processes will.

That was what I meant by "there is no other approach". Threads share _everything_, the other approach is to just share what you explicitly want to share...

I understand what you are saying. I'm not disagreeing with that.

I was doing some things like that as far back as 20 years on an 8 bit micro with 512k and a real-time multitasking OS.

What I am saying is more a matter of degree and intent.

Threads share everything, the common fork()ing method shares only what you want. The 'something else' is where you either don't or can't share anything at all and it has to be considered a 'black box'.

Like if it was an entirely seperate program (maybe not even yours!) and the OS didn't allow you to share data, or if you were running it on an entirely seperate computer on the other side of the world.

It's a matter of degree's and intent of isolation rather than technical details, especially when the common chess meaning is that you will share data. So if you can't or just don't, then that has to be considered 'something else.'

Carey · Post by **Carey** » Thu Apr 03, 2008 9:26 pm

Bo Persson wrote:
Volker Annuss wrote:
Carey wrote: I don't have a 64 bit compiler, so even if I installed Vista64 I wouldn't be able to check that aspect. The people here may be righ when they say that in 64 bit mode, it's not at all a problem.

But, V C 2008 Express doesn't support 64 bit code, and I don't think spending $700 for the professional one is worth it. And I'm not a student so I can't get it for free. So I wont be testing it.
You can get a 64-Bit-Compiler for free by downloading the Windows-SDK. It works from the command line, but I did not get it work inside VC2005 and VC2008 Express Edition.

Greetings
Volker
If it is the compiler from the Windows Server 2003 SDK, that one is older than even the VS2005 Beta. Not recommended.

They have released a couple newer SDK's since then. One for Vista and then one shortly after that for Win2k8.

Carey · Post by **Carey** » Thu Apr 03, 2008 10:04 pm

Bo Persson wrote:
Carey wrote:Okay, I just got through running some tests.

I am using the latest MingW, which is v3.x. As Bo Persson points out, that may be a good part of my problem. (But since MingW hasn't officially released a version based on a current version of GNU C 4, and probably wont until the next decade, there isn't a lot I am willing to do. I don't like running alpha's or private builds.)

I tested my program with GCC v3, OpenWatcom and MSVC 2008 Express.

I did the 'data in class' vs. 'data outside of class'. I didn't bother testing the plain C version because that should be comparable to 'data outside of class'.

Needless to say, OpenWatcom was the slowest. Dead dog slow. Almost twice as long as MSVC. Probably the only other 'professional' compiler that would be slower would probably be Borland's current free Turbo C++.

Anyway, OpenWatcom had a performance penalty of about about 8%.

I tried a couple versions of MingW with a few different switches. (I didn't try all the switches it offers. Just what the CodeBlocks IDE offers.) The performance penalty was 9%.

I tried MSVC 2008 Express. The performance penalty was 3%.
I believe that just dumping everything in a class is not really fair. If you actually transform the code from C to "proper" C++, I belive you can regain more than these 3%.

Well, it very simply simulates data isolation. Whehter it's done as a very simple class with no other C++ features or keeping everything as C and passing a struct pointer around (like Hyatt does), the resulting code is going to be very similar.

The class method just gives you the advantage of not having to manually pass the pointer and modify the code to reference the data through that pointer.

My test numbers were very similar to C with global data and C++ with a class but global data.

And considering this test was originally about the cost of having some data isolation in preperation for moving to threads, I think the test is a valid comparison.

As for additional C++ features... although C++ can certainly allow you a certain amount of organization improvements and programmer productivity, it's not really capable of providing any higher performance than what C can.

In fact, due to the extra difficulty of optimizing C++ code, it's likely to be somewhat worse. Maybe not a lot, but at least a little. At best, no faster than C.

I'm certainly not opposed to C++. I'm just saying that it's not capable of being any faster than C. Any differences would be attributable more to programmer style than the language.

It's also not a trivial task to seperate good C++ classes and features out of a chess program.

A chess program just doesn't seem to want to be organized in good OOP style.

Carey wrote: (I don't know if the free student version of MSVC pro tools would do any better. I'm not a college student and unfortunately, I don't know any to ask if they'd get me a free copy from Microsoft.)
There is no difference in the compiler, the Pro editon has a more "advanced" IDE and additional libraries.

A 64 bit compiler & debugger integrated into the IDE.

A profiler which is useful for both 32 & 64 bit systems.

But it does look like the 32 bit compiler is the same.

Carey wrote: So it looks like there is indeed a non-trivial performance penalty when you put the data into a C++ class.

Much of that can be optimized away by using a state-of-the-art compiler with significant optimization abilities.

Anything with just 'average' or 'good' optimization abilities will be at a serious disadvantage with C++ code.
Not having a good compiler will be a disadvantage anyway.

Well... Not quite like what you might expect.

I do realize you were making a bit of a joke, but comparing these compilers has been educational.

C is inherently easier to optimize than C++ is. Compiler writers have been complaing about that since C++ was in the development stages.

Even something as simple as classes can have hidden surprises. The more complicated aspects of C++ can be a mine field of performance issues.

Also, my tests have shown an somewhat interesting result. This is for a 10 ply search. C++/Global means there was a class, but the data was global. C++/Class means the data was in the class. Results are seconds.

Code: Select all

GCC 3.4
C           391
C++/Global  394
C++/Class   443

GCC 4.3.0
C++/Global  344
C++/Class   397

OpenWatcom
C++/Global  660
C++/Class   711

VC2008 Express
C++/Global  376
C++/Class   388

GCC 3.4 had a insignificant penalty for C versus C++/Global. Going to C++/Class there was a 13% penalty.

GCC 4.3.0 was very interesting. The C++/Global was the fastest of any of them, but going to C++/Class resulted in a higher 15% penalty, making it slower than VC2008.

OpenWatcom had terrible results all around. The performance penalty was less than GCC though.

VC2008 was pretty consistant, at just a 3% penalty. The 'data in a class' was the best of any of them.

So clearly VC2008 doesn't optimize as well as GCC430 does for general stuff, but it can better handle the case where data is in a class.

So not all compilers & optimizers are created equal.

You can have a good quality compiler but it just not be able to handle C++ classes well.

If I was going to do muti-core as a process, I'd choose GCC430. If I was going to multi-thread within the same program, I'd choose VC2008.

(I didn't test the Borland compiler. I would expect it to be pretty bad. I didn't test the Intel compiler. I would expect it to be pretty good. Whatever 'good' and 'bad' means....)

And, of course, your results will vary depending on your particular program.

So don't get too in love with a particular compiler. By changing your programming style just a little and switching to a different compiler, you may end up getting a truely significant performance improvement or penalty.

Bo Persson · Post by **Bo Persson** » Fri Apr 04, 2008 5:42 pm

Carey wrote:
Bo Persson wrote:
Carey wrote: I tried MSVC 2008 Express. The performance penalty was 3%.
I believe that just dumping everything in a class is not really fair. If you actually transform the code from C to "proper" C++, I belive you can regain more than these 3%.

As for additional C++ features... although C++ can certainly allow you a certain amount of organization improvements and programmer productivity, it's not really capable of providing any higher performance than what C can.

In fact, due to the extra difficulty of optimizing C++ code, it's likely to be somewhat worse. Maybe not a lot, but at least a little. At best, no faster than C.

I'm certainly not opposed to C++. I'm just saying that it's not capable of being any faster than C. Any differences would be attributable more to programmer style than the language.

I bet you haven't seen this paper by Bjarne Stroustrup, where he shows a case of the C++ standard library being inherently faster than the C library. It uses C++ templates to do things a C compiler isn't able to do.

"Learning Standard C++ as a New Language"

http://www.research.att.com/~bs/new_learning.pdf

The idea is that you can do some things differently in C++, and the language lets the compiler optimize the code better. You do need a good compiler, but in some cases a C++ compiler can do things a C compiler can not.

Dumping some old C code at the C++ compiler just isn't fair.

Carey · Post by **Carey** » Fri Apr 04, 2008 7:00 pm

Bo Persson wrote:
Carey wrote:
Bo Persson wrote:
Carey wrote: I tried MSVC 2008 Express. The performance penalty was 3%.
I believe that just dumping everything in a class is not really fair. If you actually transform the code from C to "proper" C++, I belive you can regain more than these 3%.

As for additional C++ features... although C++ can certainly allow you a certain amount of organization improvements and programmer productivity, it's not really capable of providing any higher performance than what C can.

In fact, due to the extra difficulty of optimizing C++ code, it's likely to be somewhat worse. Maybe not a lot, but at least a little. At best, no faster than C.

I'm certainly not opposed to C++. I'm just saying that it's not capable of being any faster than C. Any differences would be attributable more to programmer style than the language.
I bet you haven't seen this paper by Bjarne Stroustrup, where he shows a case of the C++ standard library being inherently faster than the C library. It uses C++ templates to do things a C compiler isn't able to do.

"Learning Standard C++ as a New Language"

http://www.research.att.com/~bs/new_learning.pdf

The idea is that you can do some things differently in C++, and the language lets the compiler optimize the code better. You do need a good compiler, but in some cases a C++ compiler can do things a C compiler can not.

I just glanced through it and from what little I saw, it's much more slight of hand. Like comparing a water-mellon to an apple and saying the apple is better because it's smaller and is a pretty red color.

Comparing black-box library routines are pretty much in the same category. They were written with different requirements and specifications and interfaces and says nothing about your code.

What he's really comparing isn't C & C++ but the interfaces to their libraries. That really says very little about your code.

Write your own library to suite your own programming style and you'd probably get comparable performance to what Bjarne is allegding that C can't do.

This is really like saying interpreted BASIC is faster than assembler because it has strings & floating point data types built in.

Dumping some old C code at the C++ compiler just isn't fair.

For the test I did for the reasons I did, I stand by my assesment that it was fair.

It wasn't about C vs. C++.

It wasn't about whether you could write a chess program in C++ and make it nice & OOPy and as fast as C.

It was about the cost of gathering up the data so it wouldn't be global. In preperation for going multi-core. Nothing more.

If you really, really wish me to tediously convert the C code to a struct and pass a pointer around, I'm willing to do so. But you may have to pay me to do it because I'm really not looking forward to that work. It'd be tedious and I don't believe it would show anything new, and therefor I'm not looking forward to doing something that I believe would be a waste of time.

As for OOP & chess... I'm still not convinced that chess is a good subject for C++ & OOP. It really doesn't want to be broken into nice neat isolated sections.

You end up doing member functions and small classes just to satisify the OOP goal, even though actually using them gets in the way and slows down the program.

So many sections of chess need access to other parts that pretty soon you are working around your classes just to keep the performance up.

You are doing OOP because you are supposed to be doing it that way, but you are having to fight it every step just to keep it from hurting the performance.

I've said a few times that if anybody can come up with some nice, neat classes etc. for a C++ chess program, where you don't end up trying to work around the restrictions, where you actually want to use the classes and member functions etc., I'd like to hear about it.

From what I've seen and other's I've talked with, chess programming just doesn't seem to easily fit into C++ & OOP.

Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Results for GCC v4.2

Re: Questions about getting ready for multicore programming.

Re: Results for GCC v4.2

Re: Results for GCC v4.2

Re: Questions about getting ready for multicore programming.

Re: Results for GCC v4.2

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.

Re: Questions about getting ready for multicore programming.