MP Implementation: One Method or Varied?

rbarreira · Post by **rbarreira** » Thu May 31, 2012 10:26 am

Jan Brouwer wrote:
Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
People can parallelize a program, so in principle a program can parallelize a program as well.
First, write a serial chess program in a functional language to ease synchronization between multiple threads.
Then design another program which performs transformations on the chess program while maintaining correctness, and introduces additional threads to increase parallelism.
Finally, we we need a optimization program to find the best performing transformations for the second program to apply to the first program.

I mean, how hard can that be?

Jan

One of the difficulties with what you proposed is the "while maintaining correctness" part. An easy definition of correctness is unlikely to work well, since every parallel alpha-beta algorithm that is known to work well, does not in fact give the exact same output as the serial search.

bob · Post by **bob** » Thu May 31, 2012 5:30 pm

rbarreira wrote:
Jan Brouwer wrote:
Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
People can parallelize a program, so in principle a program can parallelize a program as well.
First, write a serial chess program in a functional language to ease synchronization between multiple threads.
Then design another program which performs transformations on the chess program while maintaining correctness, and introduces additional threads to increase parallelism.
Finally, we we need a optimization program to find the best performing transformations for the second program to apply to the first program.

I mean, how hard can that be?

Jan
One of the difficulties with what you proposed is the "while maintaining correctness" part. An easy definition of correctness is unlikely to work well, since every parallel alpha-beta algorithm that is known to work well, does not in fact give the exact same output as the serial search.

This can, and has been, done. The problem with a deterministic algorithm is that it is not as efficient as the non-deterministic approach and has too much synchronization overhead.

diep · Post by **diep** » Sat Jun 02, 2012 5:57 pm

Jan Brouwer wrote:
Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
People can parallelize a program, so in principle a program can parallelize a program as well.
First, write a serial chess program in a functional language to ease synchronization between multiple threads.
Then design another program which performs transformations on the chess program while maintaining correctness, and introduces additional threads to increase parallelism.
Finally, we we need a optimization program to find the best performing transformations for the second program to apply to the first program.

I mean, how hard can that be?

Jan

Your idea will lose, because my manually programmed program parallellizes better than the beginners ideas you put into that OS

It's like the calculator that's inside windows versus matlab and then claim that with that built in calculator you can in principle do huge matrix calculations with the OS. It's the same bullshit.

Note that for specialistic work matlab isn't very fast. Specialistic software is factors faster in such case.

diep · Post by **diep** » Sat Jun 02, 2012 5:58 pm

Daniel Shawul wrote:Yeah Jan, how hard can that be. Conventional chess program is still for loops... If you disregard efficiency all in all, I am sure something can be done as a proof of concept. And maybe even more if the target is specific, as is in this case chess. Compilers already do much more than a human can optimize by itself if the programmer follows a good coding style. eg. profile guided optimization from run time behavior of program. It is not at all a far fetched thought to think so otherwise no one would attempt it in the first place.. Just don't set the goal so high.

Compilers make 0% chance against manual coded assembler. Sure it's a lot of work...

diep · Post by **diep** » Sat Jun 02, 2012 6:07 pm

bob wrote:
diep wrote:
bob wrote:
Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
Completely impossible until the O/S actually UNDERSTANDS the algorithm and can make the necessary changes to make a single-cpu program use more than one. Take a single-thread chess engine and rewrite it to use more than one CPU effectively. That is what the operating system / compiler would have to do to accomplish the same thing. Could a compiler really either (a) understand all possible computationally intensive algorithms or (b) be able to look at a source code and actually understand the underlying ideas and then rewrite to use more than one CPU?

Not impossible, but at least 20-30-40 years away from the present state of the art...

The major projects on doing this have basically faded away for pretty obvious reasons...
Not only impossible, also it would be a crap algorithm as some loser would implement it

If you'd use Cilk, which means you have to call Cilk functions to get things done, which already is a step further than 'have the OS figure it out', you still would get maybe 2000 chesspositions a second at each machine.

That's factor 10k slower now or so?

So 500 cpu's would be slower with Cilk than a single CPU program without, with 100% sureness

Simply because Leierson & co might be good book authors, but they cannot program themselves.

Generic workstealing/loadbalancing/auto-parallellisation algorithms always will be factor 1000 too slow for computerchess. The latency of those algorithms is always in the dozens of milliseconds, as the runqueue still ticks at that speed of 100Hz, and that won't change any soon.

So we can prove that a generic implementation will always fail to deliver as splitting, aborting and other things, at least in Diep is just a fraction of the time that the management of Cilk and similar systems eats.

When running at machines with hundreds of cores you simply need those millions of splits and Cilk just simply can't deliver that quickly. So it's always exponential slower.

And against that mathematical proof not even the help of some clever PHD students let alone intel, will ever help that.

Yet please realize that in case of Cilk the YBW algorithm you still have to implement yourself doing Cilk function calls. So it really is far off from the one liner dropped here - at which we always gladly respond

Vincent
One of the best automatic parallelizing projects was done by Kuch and associates for Cray years ago. Cray asked me to try the compiler and I did. It had no clue about how to parallelize what was actually an iterated (rather than recursive) search already. It parallelized move input and output loops and such, and produced zero speedup for the actual search, because it could not understand the dependencies caused by a variable-depth search.

I don't think software will ever be able to deal with automatically transforming a normal alpha/beta search into a parallel search. At least not until the point where the software actually begins to understand source code. That is a long, long way out in the future, if ever...

Amazingly it seems some government dudes has revived Cilk as intel is quoting it in their speeches. Speaking of worthless crap, cilk sure belongs there.

Also note it isn't actually parallellizing yoru software. You still must give cilk the notion of where to split and how. So it's not autoparallellisation at all.

From my viewpoint seen most scientific and government people total underestimate how tough it is to parallellize a game tree search. they just have no clue.

If we realize that start 21th century around year 2000, when i asked a bunch, 0 programmers that i met who had themselves by then parallellized their own engine, 0 of them believed it was possible to parallellize a MODERN chess engine with a low branching factor for a big supercomputer, without losing some big gigantic factor rendering the supercomputer near worthless.

They just didn't believe it, because of the total inefficiency of guys before them that had tried to do this, most of them losing factor 40 to 50, and even then they just got af ew plies more because of big forward pruning.

So when comparing the searches 1 to 1, they were crap everywhere.

And i have to admit - when i started parallellizing, had i known in advance the total crap latencies from cpu to cpu, i would NOT have started the project of parallellizing diep for the supercomputer. When i finally figured out the bad latencies, after having written a test myself and i had to wait a month or 6 to run it on the 512 processor partition, using a couple of hundreds of cpu's, then i didn't really know myself whether i would manage.

That's basically showing the big difference between experts and guys on the street.

So we can prove very simple that it's impossible that an OS or some abstraction layer can do this for you, as the hardware simply changes shape each time.

So without first getting it to run perfectly by an expert, some abstraction layer cannot do it.

As they never will hire an expert who can do it, to put it in an abstraction layer, simply because that expert is not gonna have an US passport, we can therefore prove that this is impossible.

If you say: "oh i have a car that drives me from A to B, so i can drink something tonight as the car will bring me at home".

Most will simply offer you a beer and go on - not realizing how difficult such achievement is.

Alexander Zacharias · Post by **Alexander Zacharias** » Thu Jun 07, 2012 3:05 pm

This is the basic idea of "optimistic pondering" (Himstedt, ICGAJ 2005), implemented in "GridChess" (WCCC 2007) and "Cluster Toga" (WCCC 2008).
I think it has some potential even on single cpu systems if the number of cores per cpu continues to grow, as YBWC/DTS etc. search efficiency degrades with growing thread count.

MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?

Re: MP Implementation: One Method or Varied?