MP Implementation: One Method or Varied?

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: MP Implementation: One Method or Varied?

Post by bob »

Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
Completely impossible until the O/S actually UNDERSTANDS the algorithm and can make the necessary changes to make a single-cpu program use more than one. Take a single-thread chess engine and rewrite it to use more than one CPU effectively. That is what the operating system / compiler would have to do to accomplish the same thing. Could a compiler really either (a) understand all possible computationally intensive algorithms or (b) be able to look at a source code and actually understand the underlying ideas and then rewrite to use more than one CPU?

Not impossible, but at least 20-30-40 years away from the present state of the art...

The major projects on doing this have basically faded away for pretty obvious reasons...
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: MP Implementation: One Method or Varied?

Post by diep »

bob wrote:
Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
Completely impossible until the O/S actually UNDERSTANDS the algorithm and can make the necessary changes to make a single-cpu program use more than one. Take a single-thread chess engine and rewrite it to use more than one CPU effectively. That is what the operating system / compiler would have to do to accomplish the same thing. Could a compiler really either (a) understand all possible computationally intensive algorithms or (b) be able to look at a source code and actually understand the underlying ideas and then rewrite to use more than one CPU?

Not impossible, but at least 20-30-40 years away from the present state of the art...

The major projects on doing this have basically faded away for pretty obvious reasons...
Not only impossible, also it would be a crap algorithm as some loser would implement it :)

If you'd use Cilk, which means you have to call Cilk functions to get things done, which already is a step further than 'have the OS figure it out', you still would get maybe 2000 chesspositions a second at each machine.

That's factor 10k slower now or so?

So 500 cpu's would be slower with Cilk than a single CPU program without, with 100% sureness :)

Simply because Leierson & co might be good book authors, but they cannot program themselves.

Generic workstealing/loadbalancing/auto-parallellisation algorithms always will be factor 1000 too slow for computerchess. The latency of those algorithms is always in the dozens of milliseconds, as the runqueue still ticks at that speed of 100Hz, and that won't change any soon.

So we can prove that a generic implementation will always fail to deliver as splitting, aborting and other things, at least in Diep is just a fraction of the time that the management of Cilk and similar systems eats.

When running at machines with hundreds of cores you simply need those millions of splits and Cilk just simply can't deliver that quickly. So it's always exponential slower.

And against that mathematical proof not even the help of some clever PHD students let alone intel, will ever help that.

Yet please realize that in case of Cilk the YBW algorithm you still have to implement yourself doing Cilk function calls. So it really is far off from the one liner dropped here - at which we always gladly respond :)

Vincent
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: MP Implementation: One Method or Varied?

Post by Daniel Shawul »

Yet please realize that in case of Cilk the YBW algorithm you still have to implement yourself doing Cilk function calls. So it really is far off from the one liner dropped here - at which we always gladly respond
That is the point :) Someone not familiar with chess programming asks a general legitimate question based on his intuition, and we scramble to express our views (back to basics).
Howard E
Posts: 261
Joined: Wed Mar 08, 2006 8:49 pm

Re: MP Implementation: One Method or Varied?

Post by Howard E »

Thanks everyone for your reply.

This cut and paste from the chess wiki link ...

Other Approaches
Many different approaches have been tried that do not directly split the search tree. These algorithms have not enjoyed popular success due to the fact that they are not scalable. Typical examples include one processor that evaluates positions fed to it by a searching processor, or a tactical search that confirms or refutes a positional search.

Has anyone tried this, where one or two threads are just devoted to tactical analysis. Then somehow a "master brain algorithm" decides
which move to make? I imagine this would be problematic yet if successfully implemented could have payoffs.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: MP Implementation: One Method or Varied?

Post by diep »

Daniel Shawul wrote:
Yet please realize that in case of Cilk the YBW algorithm you still have to implement yourself doing Cilk function calls. So it really is far off from the one liner dropped here - at which we always gladly respond
That is the point :) Someone not familiar with chess programming asks a general legitimate question based on his intuition, and we scramble to express our views (back to basics).
With the introduction of many cores and huge supercomputers and gpu's, the increased difficulty of programming them is difficult to explain.

In fact even to most PHD's in computer science or math this is difficult to explain. Most for example figure out the parallellisation trick to split up all the root moves to different processors.

Now to start with not having a good bound means you're gonna lose factor 6 extra if you do that, yet if you have it, you can show that the expected speedup of that is under factor 2.0, as the mainline eats 50% of all nodes usually, if not more in some of the deepsearching mainline checkers that reign today.

Apart from the problem of course that some of us have more than a handful of cores. How to split 30 moves between 64 real intel cores here is yet another problem of course :)

So if it's already so difficult to explain this to PHD's who are not beginners in algorithms in general as they got a PHD in it, then obviously one cannot blame others asking that question.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: MP Implementation: One Method or Varied?

Post by diep »

Howard E wrote:Thanks everyone for your reply.

This cut and paste from the chess wiki link ...

Other Approaches
Many different approaches have been tried that do not directly split the search tree. These algorithms have not enjoyed popular success due to the fact that they are not scalable. Typical examples include one processor that evaluates positions fed to it by a searching processor, or a tactical search that confirms or refutes a positional search.

Has anyone tried this, where one or two threads are just devoted to tactical analysis. Then somehow a "master brain algorithm" decides
which move to make? I imagine this would be problematic yet if successfully implemented could have payoffs.
Let me respond onto that.

For most chessengines of today combining this doesn't make sense as basically the evaluation function is so limited that we can already see it as a stripped engine, so combining it with a tactics only engine doesn't make sense at all, as your nps will not speedup much, at most a few percent if not less.

For an engine like Diep that's factor 10-20 slower in nps (if not more) than the fast beancounters, it makes more sense to take a look there as there is a potential speedup that's very huge.

Where to effectively use such fast beancounter? Not to replace ENTIRE search, yet just the last few plies of course. If we're already up a full piece and have just 2 or 3 plies to search what's the odds that we have a lost position there?

Very small huh?

So if we just check with a fast tactical search there, we do not need to do that with the slow diep search huh?

That's about the idea below... ...now more nerdie talk below...

I spent around a month or 6+ to speedup diep in 2004-2005. I had a sponsor for world champs 2005 and the promise of a book from Erdo. yet his book was only for tactical strong engines that positional sucked a tad. It was all tactical tricks, run with passed pawns and pray for the mistake of the opponent.

Diep's nps was rather limited so i tried to combine diep's slow positoinal search with a fast beancounter search.

Now of course not in parallel, but each core in itself simply would do next.
The observation was that 90% of all positions at depthleft 3-5 ply was over 3 pawns away from beta. So 3 pawns above beta.

In itself usually you can prune that, just the few cases that it would give a cutoff you cannot.

What todays software is doing is just throw it away and pray for the best.
Yet my search depth and/or huge evaluation either isn't enough or the evaluation is spending a lot of time you don't want to throw away that cheaply.

So then with a quick tactical search i tried to get a cutoff there.

In itself this plan is not new. In 80s/90s already some engines tried this approach.

As for Diep back then. At home at my k7, at a single cpu of it from 2.1Ghz, the normal Diep search was searching around a 100k nps. The fast beancounter search i wrote which took a lot of months of fulltime work, it was 2.5 million nps though (includes giving checks in qsearch and a limited evaluation). This all in 32 bits obviously.

So it would be a tremendeous speedup for Diep to search like that.

Realize all this ONLY in positions where we are more than 3 pawns above the current bestscore of the opponent (beta). So in 99% of cases it really is useless to search further there.

Normal diep search wastes on average 7 nodes to that, and in total it's more than 50% of all nodes searched by Diep, so the actual speedup would win handsdown a ply and when replacing the entire search there it would win diretly 3 ply searchdepth for Diep.

yet despite the searchdepth win, it didn't work very well.

I didn't keep my plan especially secret back then. Stefan Meyer-Kahlen had a short comment about it: "it's difficult to combine 2 different searches".

Now i still feel with some toying something would be possible, yet fact is that i didn't get it to work simply. Maybe with more time put into it i would get it to work, yet for now Stefan's comments you can write in big capitals.

combining different searches that get produced by different evaluation functions is very difficult simply.

The first problem you run into for example is that Diep picks up a lot of tactics in this manner, whereas a fast beancounter search simply sees less there. So even replacing Diep's search tehre by a fast beancounter, already tactical you're directly a ply or what weaker in many occasions as Diep is spending way more time on selecting which moves to try.

So that factor 20 increase in nps directly some factor 4 or so you lose to fact Diep picks up more within its search thanks to evaluation and trying more moves, already in a TACTICAL manner.

Then next problem is that the 'bestmove' that such fast search finds is not close to what diep thinks the best move. So it also suffers in move ordering.

Everywhere you have problems. Now i still feel it is solvable to some extend, yet the effort you need to do for that is real complicated of combining a normal search with full slow eval with a fast tactical search.

With todays hardware i have here i AM capable of better testing such things. Back then the problem was simply i didn't have enough hardware to test enough games to be sure of anything :)

The end of it, is that with endless talks i managed to convince Anthony to use Erdo's book and Erdo to give his book to Anthony, and Anthony became world champion then with Zappa with 10.5 out of 11.

3 weeks in advance of world champs 2005 i concluded all my experiments failed for combining the 2 searches and that i had to stop research on it as i also wanted to score a few points.

So improvement of evaluation had totally stopped till then and that proved to be devastating as the few 'changes' i had done for more agressive scoring of the passed pawns total backfired.

Arturo's book, historical more suited to Diep is what i used world champs 2005 wasn't ready in time either also proving a disaster at the worldchamps 2005 for Diep.

Interesting is that Diep managed to win from Fruit pretty easily based upon some chessknowledge, so i'm very convinced that massive chessknowledge is the way to go.

However it has to be said that optimizing it is very complicated as no one else is doing that.

We also see that whatever Stefan MK is doing there is not exactly effective as compared to the elorating one would project Shredder.

Obviously one needs total different algorithms and different approaches for engines with a lot of chessknowledge and past nearly 20 years the only one who really put a lot of time into this is me and doing that on your own is not easy if it doesn't get paid.

Whereas most of the simplistic beancounters just write over whatever the hell someone else is doing and any trick one manages to squeeze out works for the other as well usually.

With more chessknowledge there is also more tricks possible - yet it's fulltime work - don't make any mistakes there. If you have that time it will work superior of course though - yet you do need a clever person who can invent new stuff otherwise it will fail.

Last systematic improvement of evaluation code in diep was start 2006, so we will see how much elo i manage to get it higher within afew months now.

Each few years doing some effort there is in fact not much if you consider the huge amount of code the evaluation is.
syzygy
Posts: 5554
Joined: Tue Feb 28, 2012 11:56 pm

Re: MP Implementation: One Method or Varied?

Post by syzygy »

Howard E wrote:Thanks everyone for your reply.

This cut and paste from the chess wiki link ...

Other Approaches
Many different approaches have been tried that do not directly split the search tree. These algorithms have not enjoyed popular success due to the fact that they are not scalable. Typical examples include one processor that evaluates positions fed to it by a searching processor, or a tactical search that confirms or refutes a positional search.
I'm pretty sure it has been tried (and probably the chess programming wiki has some references). One obvious way is to run a specialised mate solver (such as a proof number searcher) on a separate machine or a separate core. This will not help too much in chess though, since if a mate is in sight there is probably also a winning move that a normal alpha/beta-type search will find.

For games with deep forcing variations it can be different. For suicide chess / loser's chess, only doing a "regular" alpha/beta search won't cut it.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: MP Implementation: One Method or Varied?

Post by bob »

diep wrote:
bob wrote:
Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
Completely impossible until the O/S actually UNDERSTANDS the algorithm and can make the necessary changes to make a single-cpu program use more than one. Take a single-thread chess engine and rewrite it to use more than one CPU effectively. That is what the operating system / compiler would have to do to accomplish the same thing. Could a compiler really either (a) understand all possible computationally intensive algorithms or (b) be able to look at a source code and actually understand the underlying ideas and then rewrite to use more than one CPU?

Not impossible, but at least 20-30-40 years away from the present state of the art...

The major projects on doing this have basically faded away for pretty obvious reasons...
Not only impossible, also it would be a crap algorithm as some loser would implement it :)

If you'd use Cilk, which means you have to call Cilk functions to get things done, which already is a step further than 'have the OS figure it out', you still would get maybe 2000 chesspositions a second at each machine.

That's factor 10k slower now or so?

So 500 cpu's would be slower with Cilk than a single CPU program without, with 100% sureness :)

Simply because Leierson & co might be good book authors, but they cannot program themselves.

Generic workstealing/loadbalancing/auto-parallellisation algorithms always will be factor 1000 too slow for computerchess. The latency of those algorithms is always in the dozens of milliseconds, as the runqueue still ticks at that speed of 100Hz, and that won't change any soon.

So we can prove that a generic implementation will always fail to deliver as splitting, aborting and other things, at least in Diep is just a fraction of the time that the management of Cilk and similar systems eats.

When running at machines with hundreds of cores you simply need those millions of splits and Cilk just simply can't deliver that quickly. So it's always exponential slower.

And against that mathematical proof not even the help of some clever PHD students let alone intel, will ever help that.

Yet please realize that in case of Cilk the YBW algorithm you still have to implement yourself doing Cilk function calls. So it really is far off from the one liner dropped here - at which we always gladly respond :)

Vincent
One of the best automatic parallelizing projects was done by Kuch and associates for Cray years ago. Cray asked me to try the compiler and I did. It had no clue about how to parallelize what was actually an iterated (rather than recursive) search already. It parallelized move input and output loops and such, and produced zero speedup for the actual search, because it could not understand the dependencies caused by a variable-depth search.

I don't think software will ever be able to deal with automatically transforming a normal alpha/beta search into a parallel search. At least not until the point where the software actually begins to understand source code. That is a long, long way out in the future, if ever...
Jan Brouwer
Posts: 201
Joined: Thu Mar 22, 2007 7:12 pm
Location: Netherlands

Re: MP Implementation: One Method or Varied?

Post by Jan Brouwer »

Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
People can parallelize a program, so in principle a program can parallelize a program as well.
First, write a serial chess program in a functional language to ease synchronization between multiple threads.
Then design another program which performs transformations on the chess program while maintaining correctness, and introduces additional threads to increase parallelism.
Finally, we we need a optimization program to find the best performing transformations for the second program to apply to the first program.

I mean, how hard can that be? 8-)

Jan
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: MP Implementation: One Method or Varied?

Post by Daniel Shawul »

Yeah Jan, how hard can that be. 8-) Conventional chess program is still for loops... If you disregard efficiency all in all, I am sure something can be done as a proof of concept. And maybe even more if the target is specific, as is in this case chess. Compilers already do much more than a human can optimize by itself if the programmer follows a good coding style. eg. profile guided optimization from run time behavior of program. It is not at all a far fetched thought to think so otherwise no one would attempt it in the first place.. Just don't set the goal so high.