MP Implementation: One Method or Varied?

Howard E · Post by **Howard E** » Sat May 26, 2012 8:32 pm

When designing multi-processor chess programs is there typically one way to do this? Or do we see different approaches in how to utilize the processors?

ZirconiumX · Post by **ZirconiumX** » Sat May 26, 2012 9:21 pm

There are several ways, as with most things in life.

http://chessprogramming.wikispaces.com/Parallel+Search

Matthew:out

jdart · Post by **jdart** » Sun May 27, 2012 1:22 am

Most engines use YBW, a good tradeoff between complexity and performance.

--Jon

diep · Post by **diep** » Mon May 28, 2012 5:09 pm

Howard E wrote:When designing multi-processor chess programs is there typically one way to do this? Or do we see different approaches in how to utilize the processors?

there are many ways to do it. However nowadays chessprograms search very deep and get a lot of positions per second and the number of cores keeps growing.

So the first requirement is to implement a SMP search that doesn't deteriorate the branching factor too much. Many 'old school' SMP solutions really have a bad effect on the branching factor. Branching factor is the time to get to the next ply here to use not an accurate definition of it. In 80s and 90s this was far over factor 10 for most supercomputers or some other big factor. Nowadays we're at 2.0 or better.

If you manage to parallellize your program getting that branching factor up from 2.0 to 3.0, to give an example, then even a 1000 processor supercomputer will be outsearched by a simple laptop.

So februari 1998 when talking to Rainer Feldmann about how to parallellize Diep, he did do the historic prediction that initially if you parallellize your program, you're actually SLOWER, not faster! He had a talent there to predict the future. Februari 1999 first run on Bob's quad Xeon, diep was 6x slower than Diep single cpu. Not faster at all!

Then there is a number of SMP approaches that definitely will not scale well or simply have a very bad speedup when the number of cores used grows.

As it appears more complex ways to parallellize simply perform better then, yet there is very few who managed and/or did do the effort to implement more complex forms of YBW.

So what you see typically is that several 'cheapskate' solutions have been developed to get *some* speedup meanwhile still following YBW.

As usual such algorithms don't dictate the datastructure how to efficiently implement YBW.

Then there can also be distinguished between the types of implementation.

One can implement YBW with central locking, such as crafty for example is doing and get a decent speedup, or you can try to look for a way to do it without central locking such as Diep. In that case you scale easier to more cores meanwhile having a magnificent speedup.

Now YBW is a rather simple idea. Search the first move sequential and after that parallellize the rest of the moves. This means obviously that the real difficult to accomplish goal is cheap splitting and cheap abortion.

So some enthusiasts saw therefore another cheapskate implementation that avoids the complex way to abort cpu's, such as i do in Diep, and simply not abort at all. The question is whether one would be able to call what some do there a full YBW implementation, if you don't abort.

In short all sorts of hybrid forms of SMP search are there.

In general the surviving mechanism everywhere is YBW.

What YBW doesn't dictate of course is what to do when in this given position P that has been split all moves are 'under search' and 1 of the cores is ready with its job there, yet doesn't win a queen or something, so no fail high occurs - the other cpu's do not need to get aborted.

So there is a wildgrow of mechanisms there that get used what to do if something like that happens.

Some engines, the ones that are recursive in general, are stuck there in the mudd as they might own the recursions previous to this position. Others not using recursion such as Diep can frankly do whatever they want.

Yet realize YBW is not easy to implement if you try to fully implement it. It does preserve however your branching factor very well and we cannot say that from most other forms of SMP such as Aphid, to give an example.
So in general spoken a good performing SMP search is a very mathematical type of accomplishment where the programmer needs a big skillset to prove correctness of search. Only a few on this planet have that.

Picking up just another software architect from the street and have him implement a good performing SMP search that's asking for trouble.

A quote that definitely describes this for SMP:

"Computerchess is very boring, you always see the same persons!" Jonathan Schaeffer.

Jouni · Post by **Jouni** » Mon May 28, 2012 7:33 pm

I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?

Uri Blass · Post by **Uri Blass** » Tue May 29, 2012 10:39 am

Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?

If every cpu get the same position then it is not faster and you simply do the same work 4 times instead of one time.

I think that it may be possible to get some small improvement for all one cpu programs by a special software.

The idea is the following:
The program prints a pv of at least 2 plies in most cases so it is possible that one cpu get the position in the board and second cpu get the position after 2 plies of the pv that has good chances to happen in the game later.

If the position after 2 plies of the pv happens later in the game then the program already analyzed it for some minutes so it has some advantage relative to 1 cpu.

I will give an example.

Assume that we play a game at 120/40 and assume that after 1 minute CPU 1 has the line 1.e4 e5 that is not changed during analysis(earlier d4 was suggested).

CPU 2 is asked to analyze the position after 1.e4 e5 after 1 minute

Suppose that after 3 minutes the program plays e4 then already CPU 2 has 2 minutes of analysis after 1.e4 e5 and CPU 2 continue to analyze the position after 1.e4 e5

Now suppose that the opponent replies after 3 minutes with e5.
The program already have 5 minutes of analysis for that move(when practically with 1 cpu it could have only 3 minutes of analysis for that move so the program got some advantage relative to 1 cpu)

rbarreira · Post by **rbarreira** » Tue May 29, 2012 11:19 am

Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?

Automatically turning a single-thread program into a multi-threaded program is the holy grail of parallelization.

Keep in mind that what you suggest is probably impossible for some classes of programs:

http://en.wikipedia.org/wiki/P-complete

For the ones where it is possible, it requires massive intelligence on the OS (perhaps strong AI) to do something like that, since the OS would have to understand quite well what the program is doing, in order to parallelize it effectively and safely.

Then we have chess engines where effective parallelization changes the output of the program by making it non-deterministic, which makes this mythical OS's job even harder as it has to make a subjective interpretation of what is an acceptable change in the output.

In short - what you suggest will not happen anytime soon, if ever...

zamar · Post by **zamar** » Tue May 29, 2012 2:03 pm

Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?

It is

Program is a stream of commands. The commands needs to be executed in correct order. One CPU can only handle one stream of commands. To make any use of additional CPUs, the program design needs to be changed from one stream of commands to multiple streams of commands (also called multiple threads or processes).

This is a very difficult task to get right, because you need split work and synchronize threads. It's like four people trying to write a book instead of just one. You may get some speed up, but also a lot of time is wasted in meetings making sure everybody is doing the right thing.

Daniel Shawul · Post by **Daniel Shawul** » Tue May 29, 2012 3:03 pm

I would like to think it may very well be possible in the future, definately something not to ruled out as being impossible. For data based parallelism , there are good compilers that target loops in a program. Dynamic scheduling of tasks as each processor finishes a sub-task is something you would _not_ expect from a compiler but it is possible. Even for task based parallelizm, as chess alpha-beta do require, the programmer needs to provide only hints here and there which part needs to be parallized (for example cilk & openmp). A smart AI could replace that in the future...

rbarreira · Post by **rbarreira** » Tue May 29, 2012 3:27 pm

zamar wrote:It's like four people trying to write a book instead of just one.

I like the "nine women making a baby in one month" analogy even more

MP Implementation: One Method or Varied?

MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?